While AI has exploded in popularity across various industries, including pharma and life sciences, many companies have launched models that haven't delivered on their full potential.
Often, the culprit behind the underwhelming AI project results is poor data preparation. This post delves into the essential phases of preparing pharmaceutical data for successful AI integration.
1. Data structure analysis
Ensuring data is consistent, well-organized, and ready for AI algorithms.
1.1 Data consistency and integrity
Establishing trust in the data by identifying and rectifying issues:
- Missing values. Addressed using imputation techniques like mean for continuous variables or LOCF for longitudinal studies.
- Duplicates. Cleansed by identifying and merging duplicates based on unique identifiers.
- Errors. Managed through data validation rules, enforcing data type restrictions and expected value ranges.
1.2 Normalization
Assessing data organization, types, formats, and redundancies:
- Data types. Aligning with typical pharmaceutical datasets, ensuring consistent date formats and standardized measurement units.
- Redundancy. Minimized using data normalization techniques to streamline structure while preserving integrity.
1.3 Analyzing data relationships between tables
Understanding data connections across databases:
- Relationships. Identifying primary and foreign keys linking data points across tables.
- Entity-relationship diagrams (ERDs). Visualizing connections between patients, drugs, diagnoses, and outcomes.
1.4 Adherence to predefined standards
Creating unified naming conventions and schema designs:
- Standardized naming. Implementing controlled vocabularies for drug names, diagnoses, and procedures.
- Data dictionaries. Defining data elements with type, allowed values, and units specific to pharmaceutical research.
1.5 Defining schema for report usage
Designing data structures for both AI analysis and report generation:
- Descriptive naming. Using clear column names reflecting data meaning.
- Schema comments. Including explanations for tables and columns.
- Data lineage. Tracking data origin and transformations to ensure that the data structure remains consistent over time. If changes do occur, they must be properly accounted for.
- Schema design for reporting. Using optimized designs like star or snowflake schemas for efficient data extraction and informative reporting.
2. Data accuracy
Making sure all training data is accurate for reliable AI insights.
2.1 Reflecting real-world properties
Assessing if data accurately represents real-world objects:
- Drug properties. Checking if data captures chemical and biological properties, such as molecular structure, solubility, and interactions with proteins.
- Clinical trial data. Verifying that data reflects patient demographics, treatment regimens, and outcomes.
- EHR data. Ensuring accurate capture of diagnoses, medications, and patient responses.
2.2 Data normalization
Applying consistent principles and conventions for data normalization:
- Standardized units. Ensuring consistent measurement units (e.g., mg/mL for drug concentration).
- Controlled vocabularies. Maintaining consistent terminology for diagnoses, medications, and procedures to avoid misinterpretations.
2.3 Typos in data
Identifying and rectifying typos and data entry errors:
- Critical fields. Detecting typos in drug names, dosages, and patient identifiers to prevent significant model inaccuracies.
- Domain-specific validation. Implementing validation rules for pharmaceutical data, checking for valid dosage ranges and correct anatomical terminology.
2.4 Anomalies within data
Detecting and addressing anomalous data points:
- Clinical trial outliers. Investigating outliers in treatment responses to determine if they indicate biological effects or data collection errors.
- Biochemical outliers. Identifying abnormal lab results that may indicate errors or rare medical conditions.
2.5 Missing data
Analyzing and managing missing values:
- Patterns in missing data. Identifying systematic absences, such as missing patient demographics, which may point to broader data collection issues.
3. Data uniqueness check
Preventing duplicate data points to avoid inflated sample sizes and misleading AI-produced insights.
3.1 Identifying duplicates
Taking steps to identify duplicate data objects:
- Matching criteria. Establishing criteria for identifying duplicates, such as patient ID, compound structure, demographic details, and clinical trial identifiers.
- Fuzzy matching. Implementing techniques to account for variations in data entry, like slight name spelling differences or date format inconsistencies.
3.2 Analyzing duplicate sources
Investigating the root causes of duplicate records, focusing on:
- Data integration issues. Detecting issues and standardizing processes to prevent duplicates arising from integrating different databases (e.g., EHRs and clinical trial systems).
- Human error. Addressing data entry errors by implementing validation rules and controlled vocabularies.
3.3 Strategies for handling duplicates
Determining the most appropriate approach to handle duplicates:
- Merging duplicates. Merging duplicates while retaining relevant data points from each instance.
- Flagging and removal. Flagging duplicates for further investigation or removing them if the "correct" record is unclear.
- Domain-specific considerations. Tailoring duplicate-management strategies to specific data types, such as patient demographics or compound structure data.
3.4 Preventing future duplicates
Making sure duplicates don’t occur in the future.
- Standardized data collection. Using standardized forms and electronic data capture systems to minimize human error.
- Data cleansing routines. Scheduling regular data cleansing to identify and address duplicates before they impact analysis.
4. Data existence check
Ensuring there is complete data across time, location, and user contexts to avoid biased models and inaccurate outputs.
4.1 Time-based data check
Verifying the presence of complete data points across the relevant timeframe for analysis:
- Clinical trials. Verifying complete data capture for all trial phases, including enrollment, dosing, and adverse event reporting.
- EHR data. Ensuring there are comprehensive records of patient histories, treatment courses, lab results, and diagnoses.
- Pharmacovigilance. Ensuring there is exhaustive reporting of adverse events throughout post-marketing surveillance.
4.2 Location-based data check
Doing the same with the geographical information:
- Clinical trial sites. Confirming complete capture of patient enrollment location, including country, region, and specific sites.
- Pharmacovigilance. Verifying the location of adverse events to identify geographic trends.
- Supply chain tracking. Assessing the completeness of geolocation data for drug manufacturing and raw materials.
4.3 User-based data check
Checking whether the data associated with specific collectors or users is comprehensive enough.
- Clinical trial data. Ensuring complete data at each trial site, including dosage records by research personnel.
- EHR data. Checking the data entered by healthcare professionals, such as diagnoses by physicians.
5. Data Augmentation
Fighting data scarcity, which may substantially hinder research and models’ effectiveness.
5.1 Data augmentation
Manipulating existing data (e.g., medical images, EHR) to create variations (rotation, noise) for better model generalizability.
5.2 Synthetic data generation
Creating entirely new, realistic data points for:
- Rare disease research (diagnosis, treatment)
- Clinical trial design (patient selection)
- Drug discovery (virtual screening)
Important considerations:
- Data quality. Biases in original data will likely be amplified.
- Validation. Ensure generated data reflects reality to prevent introducing misleading patterns.
- Regulations. Transparency and process documentation are crucial for regulatory compliance.
Beyond these fundamental data preparation stages, data annotation and anonymization play very important roles. Data annotation adds meaning to the data, enabling AI models to interpret it effectively. Anonymization safeguards patient privacy through techniques like pseudonymization and data minimization. For comprehensive data protection, we also recommend regular risk assessments and review processes.
If you're interested in learning more about preparing pharmaceutical data for AI, check out this blog post where we break down each preparation phase in detail.