This is the companion post to Biomedical Basics for AI/ML People, which covers molecular biology, genomics, proteins, and drug discovery. Here we focus on the clinical practice side — the data that comes from hospitals and patient care.

We’ll cover two major areas where AI is making an impact: medical imaging and electronic health records (EHR).

Medical Imaging

Imaging Modalities

Modality What It Captures Typical Data Format
X-ray 2D projection of dense structures (bones, lungs). 2D grayscale image
CT (Computed Tomography) 3D volume from multiple X-ray angles. Good for anatomy. 3D volume (stack of 2D slices)
MRI (Magnetic Resonance Imaging) 3D soft tissue detail using magnetic fields. Multiple contrast types (T1, T2, FLAIR). 3D volume, often multi-channel
Ultrasound Real-time imaging using sound waves. Common for obstetrics, cardiac. 2D or 3D, often video
PET (Positron Emission Tomography) Metabolic activity using radioactive tracers. Often combined with CT (PET/CT). 3D volume, low resolution
Histopathology Microscopic images of tissue (biopsies). Stained and digitized as whole slide images. Very large 2D images (gigapixel-scale)
Fundoscopy / OCT Retinal imaging. Fundus photos (2D) or OCT scans (3D cross-sections). 2D or 3D

Key Terms

Term What It Is
DICOM The standard file format for medical images. Contains pixel data + rich metadata (patient info, acquisition parameters).
Segmentation Delineating structures (organs, tumors) in an image.
CAD (Computer-Aided Detection/Diagnosis) Systems that flag suspicious regions for a radiologist to review.
ROI (Region of Interest) A marked area in an image for analysis.
Whole Slide Image (WSI) A digitized pathology slide, often 100,000 × 100,000 pixels.
Radiomics Extracting quantitative features from medical images.

Where AI Fits In

  • Classification: Disease detection from images (e.g., diabetic retinopathy from fundus photos, pneumonia from chest X-rays).
  • Segmentation: Organ and tumor delineation in CT/MRI (U-Net and variants dominate).
  • Detection: Localizing lesions, nodules, or fractures (object detection on medical images).
  • Registration: Aligning images from different time points or modalities.
  • Report generation: Generating radiology reports from images (vision-language models).
  • Digital pathology: Classifying cancer subtypes and grading from whole slide images (multiple instance learning).

Clinical Data & Electronic Health Records

Key Terms

Term What It Is
EHR (Electronic Health Record) A patient’s digital medical record: diagnoses, medications, labs, notes, imaging.
ICD codes Standardized codes for diagnoses (e.g., ICD-10: E11.9 = Type 2 diabetes). ~70,000 codes.
CPT codes Codes for medical procedures.
Lab values Numeric test results (blood glucose, hemoglobin, etc.) with reference ranges.
Clinical notes Free-text notes written by clinicians. Rich but unstructured.
FHIR A modern standard for exchanging healthcare data via APIs.
Cohort A group of patients defined by shared characteristics for a study.
Phenotyping Identifying patients with a specific condition from EHR data (not always straightforward).
De-identification Removing personally identifiable information from medical data. Required for research use.

Key Challenges for ML

  • Data access: Medical data is heavily regulated (HIPAA in the US, GDPR in the EU). Getting access is hard and slow.
  • Missing data: Clinical data is collected for care, not research. Missingness is the norm, not the exception — and it’s often informative (a missing test may mean the doctor didn’t think it was needed).
  • Irregular time series: Vital signs and labs are recorded at irregular intervals, unlike the uniform grids ML models typically expect.
  • Label noise: Diagnoses in EHRs are billing codes, not ground-truth labels. A code may be present for “rule-out” purposes.
  • Class imbalance: Rare diseases are, by definition, rare. Most clinical prediction tasks are highly imbalanced.
  • Distribution shift: Patient populations, clinical practices, and coding conventions vary across hospitals.

Where AI Fits In

  • Clinical prediction: Readmission, mortality, sepsis, and deterioration risk scoring from EHR time series.
  • NLP on clinical notes: Named entity recognition, relation extraction, and summarization (Med-BERT, Clinical-T5).
  • Medical coding: Auto-assigning ICD/CPT codes to clinical encounters.
  • Federated learning: Training models across hospitals without sharing patient data.
  • Foundation models: Large models pre-trained on broad clinical data, fine-tuned for specific tasks (Med-PaLM, BioGPT).

Clinical Databases & Benchmarks

Resource What It Contains
MIMIC De-identified ICU patient records (EHR benchmark)
CheXpert / NIH ChestX-ray14 Chest X-ray datasets with labels
TCGA (The Cancer Genome Atlas) Multi-omics cancer data (bridges research and clinical)
UK Biobank Large-scale health and genomics data (~500K participants)