Show code: Data management
tdata = stOusmane Diallo, MPH-PhD
November 16, 2025
This project builds a full R pipeline to evaluate the quality of routine malaria data reported through a national DHIS2-based HMIS.
Using several years of health-facility–level data (2016–2022), the workflow quantifies:
The outputs are designed to support national malaria program teams in monitoring data quality and prioritizing supervision and corrective actions.
This project highlights my expertise in Real World Evidence (RWE) analytics, focusing on routine health data from DHIS2. It demonstrates skills in data wrangling, quality assessment, and visualization—key for roles in pharmacoepidemiology, health analytics, or public health research
Key malaria indicators used in this project include:
| Indicator | Description | Age Groups | Example Use Case |
|---|---|---|---|
| allout | Outpatient visits (all causes) | All, <5y, >5y | Denominator for incidence rates |
| susp | Suspected malaria cases | All, <5y, >5y | Suspicion surveillance |
| test | Tested cases (RDT + microscopy) | All, <5y, >5y | Diagnostic coverage |
| conf | Confirmed malaria cases | All, <5y, >5y | Confirmed incidence |
| maltreat | Treated uncomplicated cases | All, <5y, >5y | Treatment access |
| confsev | Confirmed severe cases | All, <5y, >5y | Severity monitoring |
| maldth | Malaria-related deaths | All, <5y, >5y | Mortality tracking |
| IPTp/Stocks | Preventive interventions & commodities | Pregnant women | Activity flags |
Challenges Addressed: Recurrent missing data, French month names, facility name inconsistencies across years.
The R pipeline performs the following steps:
Cleaning and recoding
month and year (case_when and separate).Construction of aggregate indicators Using dplyr::rowwise() and mutate(), the script creates:
susp, test (RDT + microscopy), conf (RDT + microscopy)maltreat (treated uncomplicated cases),pressev, confsev (presumed and confirmed severe cases)maldth (overall and age-specific)Zero to NA recoding for analytical indicators
NA to distinguish true zeros from non-reporting.Using group_by() and across():
\[ \text{Missingness }(\%) = \frac{N_{\text{facilities with NA}}}{N_{\text{facilities reporting in district}}} \times 100 \]
The resulting dataset (*_missing_data) includes:
The pipeline applies two complementary methods:
Outliers are:
Programmatic rules derived from malaria surveillance practice are encoded in R; for example:
For each facility-month, the script:
To distinguish inactive facilities from true zero values:
active if at least one key malaria indicator is reported,inactive otherwise.This produces district-level completeness trajectories over several years.
Key outputs include:
Detailed interactive visualizations of this project are implemented in a dedicated Visualizations Chapter using Shiny (for dynamic R-based dashboards). These tools enable drill-downs by district, time-series filtering, and exportable insights for stakeholders. Access the Shiny app via [GitHub link] upon request.
Missingness: 15-25% for confirmed cases (higher in dry season); rural districts > urban.
Outliers: 5-10% flagged (Hampel more conservative).
Consistency: 92% “normal” records; issues tied to under-reported tests.
Completeness: Improved from 65% (2016) to 85% (2024); district trajectories guide interventions.
RWE Insights: Adjustments for reporting bias refine incidence estimates (e.g., +20% under-reported cases), enhancing evidence for intervention efficacy.
This project demonstrates:
| Category | Skills | R Tools / Examples |
|---|---|---|
| Data Wrangling | Cleaning, reshaping, indicator building | dplyr, tidyr, lubridate |
| Quality Assessment | Missingness, outliers (Hampel/SD), consistency rules | Custom functions, mad(), logical flags |
| Visualization | Heatmaps, time series, scatterplots | ggplot2, viridis |
| RWE-Specific | Bias adjustment in routine data, longitudinal trends | DHIS2 handling, facility-level aggregation |
Ousmane Diallo, MPH-PhD – Biostatistician, Data Scientist & Epidemiologist based in Chicago, Illinois, USA. Specializing in SAS programming, CDISC standards, and real-world evidence for clinical research.
Back to top