GHI Data Integration

GHI Data Integration (v2.05) is a Python, Pentaho, SQL, and AWS-based system that standardizes and combines health datasets from WHO, Global Health Data Exchange, UNICEF, and the World Bank.

Team Objectives

  • Eliminate manual data aggregation steps across health data sources
  • Allow users to select specific datasets (diseases, years, indicators) without pulling entire sources
  • Normalize country names to ISO 3166 codes for consistency
  • Convert numeric measures (thousands/millions to units) and standardize labels
  • Ensure uniform column headers and data structure across all datasets
  • Apply database normalization principles to consolidated outputs

Resources & Links

6
Google Doc
Automation Overview Document

Technical documentation for the automation pipeline

Link
GitHub — GHI Data Consolidation

Source code for the data integration pipeline

Video
Video Explanation (Google Drive)

Alex's walkthrough video of the data integration system

Link
Python Downloads

Download Python 3 — required to run the pipeline

Link
GBD Results Tool (GHDx)

Global Health Data Exchange — primary data source

Link
Database Normalization (Wikipedia)

Reference article on normalization principles used in the pipeline

Tech Stack

Python 3PentahoSQLAWS

Required Python Modules

country_converterpandasnumpyglob

Data Sources

WHO (World Health Organization)
Global Health Data Exchange (GHDx)
UNICEF
World Bank