Multi-Country Malaria Analytics Using DHS and MIS Data

Harmonization, cleaning, survey weighting, and estimation of malaria indicators (fever, testing, care-seeking) across 20+ African countries.

Author

Ousmane Diallo, MPH-PhD

Published

November 16, 2025

Overview

This project harmonizes and analyzes 20+ DHS/MIS surveys from sub-Saharan Africa to estimate malaria-related indicators among children under five. The workflow integrates:

Microdata download and management using the {rdhs} R package
Cross-country harmonization of DHS/MIS datasets
Standardization of regional categories (v024) that differ by country and survey year
Construction of malaria indicators (fever, testing, care-seeking)
Survey-weighted estimation using DHS complex design
Generation of Power BI / Shiny-ready analytical datasets

This project supports malaria program decision-making, RWE analyses, and subnational monitoring.

Key skills demonstrated

Advanced R programming for large-scale survey microdata
Complex survey design and weighting with the {survey} package
Cross-country data harmonization and regional standardization
Derivation of malaria-related RWE indicators (fever, treatment, sector of care)
Preparation of analysis-ready datasets for dashboards (Power BI, Shiny)

Objectives

Use both the DHS REST API and DHS microdata to build a multi-country malaria analytics pipeline
Harmonize heterogeneous DHS/MIS datasets into a unified analytical structure
Recode region classifications to ensure comparability across countries and years
Compute malaria indicators among children <5:
- Fever
- Care-seeking behavior
Apply survey weights and complex sampling design
Produce cross-country analytical tables and visualizations

Methods

Data source and acquisition

Used DHS KR (children under 5) microdata for malaria-related surveys in multiple African countries.
Survey IDs were pre-selected (e.g., via DHS Download Manager) and all corresponding KR .DTA files were stored in a project directory.
R scripts automatically discovered and loaded all KR files using list.files() and haven::read_dta(), allowing the pipeline to scale across countries and survey years without manual file handling.

Show code: Data source and acquisition

# Set working directory and data path
DataDir = setwd('~/Library/CloudStorage/OneDrive-Personnel/Ousmane DIALLO/Africa_care_seeking_rate/')

# Load required libraries for data manipulation and survey analysis
library(haven)        # For reading Stata/SPSS/SAS files
library(labelled)     # For handling variable labels
library(dplyr)        # For data manipulation
library(survey)       # For complex survey data analysis
library(purrr)        # For functional programming tools
library(plyr)         # For data transformation (note: conflicts with dplyr)
library(data.table)   # For fast data manipulation
library(tidyr)        # For data tidying

## Function to read multiple files matching a pattern
# Parameters:
#   filepat: regex pattern to match filenames
#   path: directory path to search in
#   fun: function to apply to each file (e.g., read_dta)
#   encoding: character encoding (default: "latin1")
# Returns: list of dataframes that have non-missing v024 values
read.files <- function(filepat, path, fun, encoding = "latin1") {
  # Get full paths of all files matching the pattern
  filenames <- list.files(path = path, pattern = filepat, 
                          recursive = TRUE, full.names = TRUE)
  
  # Apply the reading function to each file
  dflist <- sapply(filenames, fun, simplify = FALSE)
  
  # Helper function to check if v024 variable has any missing values
  isna_v024 <- function(df) any(is.na(df$v024))
  
  # Identify dataframes with missing v024 values
  cond <- sapply(dflist, isna_v024)
  
  # Return only dataframes without missing v024 values
  dflist[!cond]
}

# Get list of all DTA (Stata) files in the data directory
filenames_all <- list.files(path = DataDir, pattern = ".*.*\\.DTA", 
                            recursive = TRUE, full.names = TRUE)

# Read all DHS (Demographic and Health Surveys) files
# that are KR (Kids Recode) files
dhs = read.files(".*KR.*\\.DTA", DataDir, read_dta)

# COMMENTED OUT: Country patterns for batch processing
# This list contains regex patterns for all countries to be analyzed
# Uncomment and use in loop for multi-country processing
# cntrs <- list(".*NGKR21.*\\.DTA", ".*NGKR4B.*\\.DTA", ".*NGKR53.*\\.DTA", 
#               ".*NGKR61.*\\.DTA", ".*NGKR6A.*\\.DTA", ".*NGKR71.*\\.DTA", ...)

Data ingestion and harmonization

Implemented a custom read.files() helper to:
- Recursively identify KR datasets by filename patterns (e.g., .*KR.*\\.DTA)
- Load each file as a data frame and drop problematic files where geographic region (v024) is missing
For each dataset, selected and harmonized key DHS variables:
- Child identifiers and survey design: v000, v001, v002, v003, v005, v021, v022, v024, v025, v101, v007, v012
- Child survival and age: b5, b19 or hw1
- Fever and treatment variables: h22 (fever in last 2 weeks), h32a–h32x (source of care)
Handled differences between surveys (e.g., when b19 is unavailable) by:
- Using hw1 as a proxy for age in months
- Creating a unified b19 and region variable across all datasets
Standardized region names across countries and survey years:
- Region identifiers (v024) vary by country and sometimes by survey year
- A comprehensive lookup table with 200+ region mappings was created to standardize region names across all countries
- This lookup table is maintained as a separate data asset in the project repository for reusability and documentation.

Note: Supporting files (lookup tables, helper functions) are available in the project’s GitHub repository.

Show code: Data ingestion and harmonization

# Initialize list to store processed datasets from each KR file
all_countries_raw <- list()

# Loop through each DHS dataset
for(i in 1:length(dhs)){
  # Extract current dataset
  dhs_hhs = dhs[i]
  
  # Select relevant variables from the Kids Recode files:
  # - caseid: unique case identifier
  # - v000: country code and phase
  # - v001: cluster number
  # - v002: household number
  # - v003: respondent's line number
  # - v007: year of interview
  # - v012: respondent's age
  # - v101: region (often used as backup for v024)
  # - v02*: household characteristics (v021, v022, v024, v025)
  # - v005: sample weight
  # - b5: child alive (Yes/No)
  # - b*: birth history variables
  # - b8: current age of child in months
  # - h*: health variables (fever, treatment, etc.)
  # - h22: fever in last 2 weeks
  # - h32*: fever treatment source variables
  dhs_kr <- dhs_hhs %>%
    purrr::map(~dplyr::select(., caseid, v000, v001, v002, v003, v007, v012, v101, 
                       contains('v02'), v005, b5, contains('b'), b8, 
                       contains('h'), contains("h32")))
  
  ## Process each dataset in the current KR file
  dhs_data = data.frame()
  
  for(j in 1:length(dhs_kr)){
    dhs_hh = dhs_kr[j]
    
    # Convert list to data frame using plyr::ldply
    all_datasets = ldply(dhs_hh, data.table)
    
    # Error handling for variable harmonization
    tryCatch({
      # Case 1: If age variable 'b19' (age in months) exists, use it directly
      if('b19' %in% colnames(all_datasets)){
        all_datasets = all_datasets %>%
          dplyr::select(v000, v001, v002, v003, v007, v005, v012, v101, v021, 
                 v024 = v101,  # Rename v101 to v024 for consistency
                 v022, v025, b5, b8, h22, b19, contains("h32")) %>%
          dplyr::mutate(region = ifelse(is.na(v024), v101, v024))
        
      # Case 2: If 'hw1' exists, use it as proxy for age
      } else if('hw1' %in% colnames(all_datasets)){
        all_datasets = all_datasets %>%
          dplyr::select(v000, v001, v002, v003, v007, v005, v012, 
                 v024 = v101,  # Rename v101 to v024
                 v101, v021, v022, v025, b5, b8, h22, hw1, contains("h32")) %>%
          dplyr::mutate(b19 = hw1,  # Create b19 from hw1
                 region = ifelse(is.na(v024), v101, v024))
      }
      
      # Append harmonized dataset to dhs_data
      {
        dhs_data = list(bind_rows(dhs_data, all_datasets))
      }
      
    }, error = function(e){
      # Print error message if harmonization fails
      cat("ERROR :", conditionMessage(e), "\n")
    })
  }
  
  # Store harmonized dataset for this KR file
  all_countries_raw[[i]] <- dhs_data
}

# Combine all KR datasets into one large harmonized data frame
all_datasets_final <- dplyr::bind_rows(all_countries_raw)

# ============================================================================
# STANDARDIZE REGION NAMES USING LOOKUP TABLE
# ============================================================================
# Load region standardization lookup table from repository
# (See: lookup/dhs_region_standardization.csv for complete mappings)
region_lookup <- read.csv('lookup/dhs_region_standardization.csv', 
                          stringsAsFactors = FALSE)

# Apply standardization via left join
# Handles both year-specific mappings (e.g., Angola, Liberia) 
# and year-independent mappings (e.g., Guinea, Benin)
all_datasets_final <- all_datasets_final %>%
  left_join(region_lookup, by = c("country_name" = "country_name", 
                                   "year" = "year", 
                                   "v024" = "v024")) %>%
  dplyr::mutate(
    # Use standardized region name if available, otherwise keep v024 code
    region_name = coalesce(region_name, as.character(v024))
  )

Cohort definition and derived indicators

Restricted the analysis to:
- Living children (b5 == 1)
- Age < 60 months (b19 < 60 where available)
- Non-missing fever information (h22 != 8).
Created binary outcome variables at the child level:
- prop_fievre – child had fever in the last 2 weeks (h22 == 1).
- prop_publ – sought care in the public sector (any of h32a–h32i > 0).
- prop_priv – sought care in the private sector (any of h32j–h32x > 0).
- no_trea – did not seek treatment (h32y == 0).
Generated survey weights and strata:
- wt = v005 / 1,000,000
- region = ifelse(is.na(v024), v101, v024)
- strate = as.integer(region * 10 + as.integer(v025))
- Re-coded v022 when missing using the constructed stratum.

Show code: Cohort definition and indicators

# Initialize empty data frames to store final results
all_countries_fievre = data.frame()  # For fever prevalence estimates
all_countries = data.frame()         # For care-seeking estimates

# Loop through each harmonized dataset to create analysis cohorts
for(i in 1:length(dhs)){
  dhs_hhs = dhs[i]
  
  # Get the processed data from harmonization step
  dhs_data = all_countries_raw[[i]]
  
  # Create analysis cohort with derived indicators
  for(z in 1:length(dhs_data)){
    dhs_data1 = dhs_data[z]
    
    # Case 1: If age variable exists, filter for children under 5
    if('b19' %in% colnames(dhs_data1)){
      dhs_data2 = dhs_data1 %>%
        # Apply eligibility criteria:
        # - b5 == 1: child is alive
        # - b19 < 60: child is under 5 years (under 60 months)
        # - h22 != 8: exclude "don't know" responses for fever
        purrr::map(~dplyr::filter(., b5 == 1 & b19 < 60 & h22 != 8)) %>%
        purrr::map(~dplyr::mutate(., 
                    # Create binary fever indicator (1 = had fever in last 2 weeks)
                    prop_fievre = ifelse(h22 == 1, 1, 0),
                    # Calculate survey weight (DHS weights divided by 1,000,000)
                    wt = v005/1000000,
                    # Create sampling strata: region*10 + urban/rural
                    strate = as.integer(region*10 + as.integer(v025)),
                    # Use constructed strata if v022 (sample strata) is missing
                    v022 = ifelse(is.na(v022), strate, v022)))
      
    # Case 2: If age variable doesn't exist (no age filter possible)
    } else if(!'b19' %in% colnames(dhs_data1)){
      dhs_data2 = dhs_data1 %>%
        # Apply eligibility criteria (no age filter)
        purrr::map(~dplyr::filter(., b5 == 1 & h22 != 8)) %>%
        purrr::map(~dplyr::mutate(., 
                    prop_fievre = ifelse(h22 == 1, 1, 0),
                    wt = v005/1000000,
                    # Create region variable (use v101 if v024 is missing)
                    region = ifelse(is.na(v024), v101, v024),
                    strate = as.integer(region*10 + as.integer(v025)),
                    v022 = ifelse(is.na(v022), strate, v022))) 
    }
  }
  
  # Extract country and year information for metadata
  dhs_data3 = ldply(dhs_data2)
  l = data.frame(country_name = unique(dhs_data3$v000))
  countries_sel = as.factor(l$country_name)
  year = unique(dhs_data3$v007)
  year = paste(year, collapse = '-')
  
  # Process fever treatment data for children with fever
  for(z in 1:length(dhs_data2)){
    dhs_data_finale = dhs_data2[z]
    # Convert list to data frame
    dhs_data_finale = ldply(dhs_data_finale)
    
    # Filter to include only children who had fever (h22 == 1)
    dhs_data_finale1 = dhs_data_finale %>%
      filter(h22 == 1)
    
    # Create treatment source indicators:
    # h32a-h32i: public sector sources
    #   (government hospital, health center, health post, mobile clinic, 
    #    community health worker, government doctor, government nurse, etc.)
    # h32j-h32x: private sector sources
    #   (private hospital/clinic, pharmacy, drug store, private doctor,
    #    private nurse, traditional healer, shop, etc.)
    # h32y: no treatment sought
    dhs_data_finale1 = dhs_data_finale1 %>%
      dplyr::mutate(
        # Sum all public sector sources
        prop_publ = rowSums(dhs_data_finale1[c("h32a", "h32b", "h32c", "h32d", 
                                               "h32e", "h32f", "h32g", "h32h", 
                                               "h32i")], na.rm = TRUE),
        # Sum all private sector sources
        prop_priv = rowSums(dhs_data_finale1[c("h32j", "h32k", "h32l", "h32m", 
                                               "h32n", "h32o", "h32p", "h32q", 
                                               "h32r", "h32s", "h32t", "h32u", 
                                               "h32v", "h32w", "h32x")], na.rm = TRUE),
        # Convert to binary indicators (1 if any source in sector used)
        prop_publ = ifelse(prop_publ > 0, 1, 0),
        prop_priv = ifelse(prop_priv > 0, 1, 0),
        # Create no treatment indicator
        no_trea = ifelse(h32y == 0, 0, 1))
    
    # Convert back to list format for estimation functions
    dhs_data_finale1 = list(dhs_data_finale1)
  }
}

Survey design and estimation

Applied survey weights to child-level data to account for complex sampling design
Defined complex survey designs with the survey package:
- svydesign(ids = ~v021, strata = ~v022, weights = ~wt, nest = TRUE)
Aggregated to region level by computing weighted estimates for each country:
- Fever prevalence: svyby(~prop_fievre, by = ~region + country_name, FUN = svymean)
- Public-sector care-seeking: svyby(~prop_publ, by = ~region, ...)
- Private-sector care-seeking: svyby(~prop_priv, by = ~region, ...)
- No treatment: svyby(~no_trea, by = ~region, ...)
Exported final region-level indicator tables to CSV for use in mapping tools, dashboards (Shiny, Power BI), or reporting

Note: The estim_prop() helper function implements this workflow and is available in the [GitHub repository]

Show code: Survey design and estimation

# Loop through each country/survey to calculate weighted estimates
for(i in 1:length(dhs)){
  
  # Get processed data from previous steps
  dhs_data = all_countries_raw[[i]]
  dhs_data2 = # ... (from cohort definition step)
  dhs_data_finale1 = # ... (from cohort definition step)
  
  # Extract metadata
  dhs_data3 = ldply(dhs_data2)
  l = data.frame(country_name = unique(dhs_data3$v000))
  countries_sel = as.factor(l$country_name)
  year = unique(dhs_data3$v007)
  year = paste(year, collapse = '-')
  
  # Set option for handling lonely PSUs (primary sampling units)
  # "adjust" method: centers lonely PSU data at the sample grand mean
  options(survey.lonely.psu = "adjust")
  
  # ===========================================================================
  # PART 4A: Estimate fever prevalence by region
  # ===========================================================================
  est_data <- data.frame()
  vars <- c('prop_fievre')
  
  for (y in 1:length(vars)) {
    col <- list(vars[y])
    by <- list('region')
    
    # Remove missing values for the current variable
    df <- dhs_data2 %>% 
      map(~drop_na(., vars[y]))
    
    # Apply survey-weighted estimation function (estim_prop)
    # This function should create svydesign and use svyby to calculate means
    # Format: svydesign(ids = ~v021, strata = ~v022, weights = ~wt, data = .)
    #         svyby(~prop_fievre, by = ~region, design, svymean)
    df <- pmap(list(df, col, by), estim_prop)
    df <- plyr::ldply(df)
    
    # Add metadata columns
    df$year = year
    df$country_name = countries_sel
    
    # Format and select final columns
    est_data = bind_rows(est_data, df) %>%
      select(country_name, region, year, prop_fievre) %>%
      mutate(prop_fievre = round(prop_fievre, 2))
  }
  
  # Append to master fever prevalence dataset
  all_countries_fievre = bind_rows(all_countries_fievre, est_data)
  
  # ===========================================================================
  # PART 4B: Estimate care-seeking patterns by region (public, private, none)
  # ===========================================================================
  care_data <- list()
  vars1 <- c('prop_publ', 'prop_priv', 'no_trea')
  
  for (a in 1:length(vars1)) {
    col <- list(vars1[a])
    by <- list('region')
    
    # Convert list to data frame
    df1 <- dhs_data_finale1
    df1 = ldply(df1)
    
    # Extract metadata
    l = data.frame(country_name = unique(df1$v000))
    countries_sel = as.factor(l$country_name)
    year = unique(df1$v007)
    year = paste(year, collapse = '-')
    
    # Convert back to list for estimation
    df1 = list(df1)
    
    # Apply survey-weighted estimation
    df1 <- pmap(list(df1, col, by), estim_prop)
    df1 <- plyr::ldply(df1)
    
    # Add metadata
    df1$year = year
    df1$country_name = countries_sel
    df1[, vars1[a]] = df1[, vars1[a]]
    
    # Store in list
    care_data[[a]] = df1
  }
  
  # Combine all care-seeking estimates into one dataset
  prop_estimation = bind_cols(care_data[[1]], care_data[[2]], care_data[[3]]) %>%
    rename_at(1, ~"region") %>% 
    rename_at(4, ~"year") %>% 
    rename_at(5, ~"country_name") %>%
    select(country_name, region, prop_publ, prop_priv, no_trea, year) %>%
    mutate(prop_publ = round(prop_publ, 2),
           prop_priv = round(prop_priv, 2),
           no_trea = round(no_trea, 2))
  
  # Append to master care-seeking dataset
  all_countries = bind_rows(all_countries, prop_estimation)
}

# ===========================================================================
# FINAL OUTPUT: Merge fever prevalence with care-seeking estimates
# ===========================================================================
all_countries_finale = all_countries_fievre %>% 
  left_join(all_countries, by = c('country_name', 'region', 'year'))

# Export final datasets (uncomment to save)
# write.csv(all_countries_finale, 
#           'data_care_seeking_final.csv', 
#           row.names = FALSE)

Outputs

The final deliverable is an analysis-ready CSV dataset containing:

Region-level aggregated estimates derived from 150,000+ child-level observations across 20+ African countries
Standardized geographic (region) and temporal (year) variables
Survey-weighted malaria indicators: fever prevalence, care-seeking patterns (public vs private sector)
One row per region per country per survey year

Data transformation: Child-level microdata (one row per child) was aggregated to region level using survey-weighted estimation (svyby) to produce population-representative estimates.

This dataset feeds into interactive dashboards (Power BI/R Shiny) for:

Cross-country malaria burden comparisons
Treatment-seeking behavior trends
Market access gap analysis

Show code: Export analysis-ready dataset

# ============================================================================
# EXPORT FINAL ANALYSIS-READY DATASET
# ============================================================================

# Final dataset structure:
# - One row per child (under 5 years)
# - Standardized region names across countries
# - Binary indicators: fever, public care, private care, no treatment
# - Survey weights and design variables
# - Ready for visualization in Power BI or R Shiny

# Export to CSV for dashboard ingestion
write.csv(all_countries_finale, 
          'data/malaria_fever_careseeking_final.csv', 
          row.names = FALSE)

# Preview final dataset structure
glimpse(all_countries_finale)

# Data summary for documentation
cat("Total records:", nrow(all_countries_finale), "\n")
cat("Countries:", length(unique(all_countries_finale$country_name)), "\n")
cat("Survey years:", paste(range(all_countries_finale$year), collapse = "-"), "\n")

Final Dataset Schema

Variable	Type	Description	Example
`country_name`	Character	Country name	“Benin”
`region`	Character	Standardized region name	“alibori”
`year`	Character	Survey year	“2017-2018”
`prop_fievre`	Numeric	Fever prevalence among children <5	0.23
`prop_publ`	Numeric	% of febrile children who sought care in public sector	0.60
`prop_priv`	Numeric	% of febrile children who sought care in private sector	0.07
`no_treat`	Numeric	% of febrile children who did not seek any treatment	0.25
`type_survey`	Character	Survey type	“DHS” or “MIS”

Important Note on Care-Seeking Categories:

These categories are not mutually exclusive. Children may seek care from multiple sources during a single illness episode (e.g., public health center for consultation + private pharmacy for medications). This reflects real-world healthcare-seeking behavior, particularly in contexts where:

Stock-outs are common in public facilities
Patients utilize a mix of public and private services based on availability, cost, and perceived quality
Treatment pathways involve sequential visits to multiple providers

Interpretation: In regions where prop_publ + prop_priv + No_treat > 1.0, this indicates overlap where some children accessed both public and private care. This is an important indicator of health system performance and patient behavior patterns.

Sample Data Preview

country_name	region	prop_fievre	prop_publ	prop_priv	no_treat	type_survey	year
Angola	cabinda	0.16	0.60	0.17	0.35	DHS	2015-2016
Angola	zaire	0.15	0.64	0.21	0.12	DHS	2015-2016
Benin	alibori	0.23	0.10	0.01	0.83	DHS	2017-2018
Benin	atacora	0.16	0.28	0.19	0.52	DHS	2017-2018
Burkina Faso	Boucle du Mouhoun	0.13	0.60	0.01	0.39	MIS	2021

Note: Some regions show totals > 1.0 (e.g., Angola-Cabinda: 0.60 + 0.17 + 0.35 = 1.12), indicating dual care-seeking from both sectors.

Note: This dataset is designed for direct ingestion into Power BI/Shiny dashboards.

Ousmane Diallo, MPH-PhD – Biostatistician, Data Scientist & Epidemiologist based in Chicago, Illinois, USA. Specializing in SAS programming, CDISC standards, and real-world evidence for clinical research.