GHI Data Integration
GHI Data Integration (v2.05) is a Python, Pentaho, SQL, and AWS-based system that standardizes and combines health datasets from WHO, Global Health Data Exchange, UNICEF, and the World Bank.
Team Objectives
- Eliminate manual data aggregation steps across health data sources
- Allow users to select specific datasets (diseases, years, indicators) without pulling entire sources
- Normalize country names to ISO 3166 codes for consistency
- Convert numeric measures (thousands/millions to units) and standardize labels
- Ensure uniform column headers and data structure across all datasets
- Apply database normalization principles to consolidated outputs
Resources & Links
6Google Doc
Automation Overview Document
Technical documentation for the automation pipeline
Link
GitHub — GHI Data Consolidation
Source code for the data integration pipeline
Video
Video Explanation (Google Drive)
Alex's walkthrough video of the data integration system
Link
Python Downloads
Download Python 3 — required to run the pipeline
Link
GBD Results Tool (GHDx)
Global Health Data Exchange — primary data source
Link
Database Normalization (Wikipedia)
Reference article on normalization principles used in the pipeline
Tech Stack
Python 3PentahoSQLAWS
Required Python Modules
country_converterpandasnumpyglobData Sources
WHO (World Health Organization)
Global Health Data Exchange (GHDx)
UNICEF
World Bank