Prepare Raw Clinical Trial Data into Study Data Tabulation Model (SDTM)-compliant datasets
This page outlines the key steps in preparing raw clinical trial data into SDTM-compliant datasets.
I intentionally kept this page simple to demonstrate my understanding of the process of creating datasets that follow the CDISC SDTM standard.
Note: In the corresponding GitHub repository https://github.com/Ousmanerabi/clinical-trials-programming-portfolio/tree/main, I used SAS macros to create, format, and validate the SDTM domains.
These macros were adapted from reference books I studied while preparing for the SAS Clinical Trials Programming Certification.
1. Create Empty Datasets (Metadata-Driven)
Start from SDTM metadata (stdm_metadata.csv)
Use
data function
to define structure (names, labels, types, lengths)Programs ensure all domains have the correct structure before populating
Code Example
1. Demographics (DM) Domain
Ce code cree un empty dataset pour le domain Demographics (DM). DM domain includes a set of essential standard variables that describe each subject in the clinical trial. The structure of the DM domain is one record per subject. We have just the required variables, variables that must be included in the dataset submission and cannot have missing values. First, we have the identifier variables, study ID, domain, and unique subject ID, then the topic variable, which is the ID of each unique subject within the clinical study. We then have our qualifier variables, the site ID, the sex of the subject, and the country of the site of the clinical trial.
/* Example: Create empty DM structure */
data empty.dm;
/* length for each of the 25 DM variables*/
lenght STUDYID $15 DOMAIN $2 USUBJID $25 SUBJID $7 RFSTDTC $16
RFENDTC $16 RFXSDTC $16 RFICDTC $16 RFPENDTC $16 DTHDTC $16 DTHFL $2
SITEID $7 BRTHDTC $16 AGE 8 AGEU $10 SEX $2 RACE $80 ARMCD $8 ARM $40
ACTARMCD $8 ACTARM $40 ARMRS $40 ACTARMUD $40 COUNTRY $3;
/* The label for all the variables using a label statement*/
label STUDYID="Study Identifier"
DOMAIN=label="Domain Abbreviation"
USUBJID="Unique Subject Identifier"
SUBJID="Subject Identifier for the Study"
RFSTDTC="Subject Reference Start Date/Time"
RFENDTC = "Subject Reference End Date/Time"
RFXSTDTC='Date/Time of First Study Treatment'
RFXENDTC='Date/Time of Last Study Treatment'
RFICDTC='Date/Time of Informed Consent'
RFPENDTC='Date/Time of End of Participation'
DTHDTC='Date/Time of Death'
DTHFL='Subject Death Flag' SITEID='Study Site Identifier'
BRTHDTC='Date/Time of Birth' AGE='Age' AGEU='Age Units' SEX='Sex'
RACE='Race' ARMCD='Planned Arm Code'
ARM='Description of Planned Arm'
ACTARMCD='Actual Arm Code' ACTARM='Description of Actual Arm'
ARMNRS= 'Reason Arm and/or Actual Arm is Null'
ACTARMUD= 'Description of Unplanned Actual Arm'
COUNTRY='Country';
stop;
run;
2. Supplemental (SUPP) Domain
The supplemental qualifiers special purpose dataset model is used to capture non-standard variables.
Supplemental domains are associated with parent records in domains of either the general observation class datasets, i.e. events, findings, interventions, or demographics. These domains are represented as SUPP for each domain containing sponsor-defined variables.
/* Create a temporary empty dataset using a data step */
data empty_suppDM;
/* Define the length for each variable*/
length STUDYID $15 RDOMAIN $2 USUBJID $25 IDVAR $8 IDVARVAL $200 QNAM
$8 QLABEL $40 QVAL $200 QORIG $20 QEVAL $8;
/*Define the label for each variable*/
label STUDYID='Study Identifier' RDOMAIN='Related Domain Abbreviation'
USUBJID='Unique Subject Identifier' IDVAR='Identifying Variable'
IDVARVAL='Identifying Variable Value' QNAM='Qualifier Variable Name'
QLABEL='Qualifier Variable Label' QVAL='Data Value' QORIG='Origin'
QEVAL='Evaluator';
/* kill the PDV execution of the blank observation*/
stop;
run;
2. Map Raw Variables (SDTM Variables)
- Load raw data
3. Create Formats (Controlled Terminology)
Define SAS formats for coded values
Improves consistency and simplifies mapping
/*Creating custom formats for the variables sex, race, arm, and armcd*/
proc format;
value sex 1='M' 2='F' .='U';
value race 1='White' 2='Black or African American' 3='Asian';
value armcd 0='PLACEBO' 1='ALG123';
value arm 0='Placebo' 1='Analgezia HCL 30mg';
run;
4. Derive Variables (Logic/Calculations)
Derive analysis-ready variables not present in raw (ISO8601 dates, study day, flags)
Handle missing or partial dates, safety flags, baseline derivations