Image of a water droplet on a plant

Three Out of Four Patients Are Missing Smoking Data. That’s Not the Real Problem

Article filter
Share this article

Key takeaways:

  • Smoking status is one of the most clinically significant variables in drug development. It affects metabolism, disease progression, treatment response, and trial eligibility. In most structured clinical datasets, it’s wrong or missing for the majority of patients. 
  • This isn’t a niche data quality problem. It’s a representative one. The way smoking data get lost illustrates exactly how structured clinical datasets systematically fail to capture what’s clinically meaningful, and why that matters for every AI model built on them. 
  • The fix exists. TriNetX’s AI-driven, human-in-the-loop extraction from clinical notes has demonstrated that these data can be recovered at scale with the rigor required for regulatory and scientific use, raising smoking status coverage from 22.3% to 66.9% in the datasets where it’s been applied. 

The first post in this blog series made the case that artificial intelligence (AI) performance in clinical development is determined by data quality, not algorithm sophistication, and that four characteristics (comprehensiveness, quality, recency, and transparency) determine whether a data foundation can support trustworthy AI. (If you’re starting here, that post, Why Two Clinical Teams Can Run the Same AI and Get Completely Different Results, is worth reading first.) 

That’s the framework. But frameworks can stay abstract. Sometimes the clearest way to understand a data quality problem is to look at one specific variable: clinically important, well understood, and routinely missing. And then ask how it gets lost, and what it costs. 

Smoking status is that variable. 

Why Smoking Status Matters More Than Most Teams Realize

Smoking is not a lifestyle footnote. It is a clinically significant variable that affects how drugs are metabolized, how diseases progress, how patients respond to treatment, and whether they are eligible for a wide range of trials. Across oncology, respiratory disease, cardiovascular disease, metabolic disease, and many other therapeutic areas, smoking status is not optional information. It’s essential. 

Which makes the following finding striking: in structured clinical datasets, smoking status coverage from coded data alone is often low. In other words, for roughly three out of four patients, the structured record, which is the data stream most commonly used for clinical development analytics, simply doesn’t indicate whether a patient smokes. 

How Clinically Critical Data Get Lost

The reason smoking data get lost isn’t that clinicians don’t ask about smoking. They do. The reason is where the answer gets recorded. 

Structured electronic health record (EHR) fields, the coded data that most clinical datasets are built on, capture smoking status inconsistently. Documentation practices vary across institutions, specialties, and individual clinicians. The same patient may have smoking status recorded at one visit and not another. The information is there, but it’s buried in clinical notes rather than coded fields: a mention in a progress note, a detail in a patient history, a line in a consultation summary. 

Structured data systems, by design, don’t capture what’s in the notes. So the information disappears from view, not because it wasn’t collected, but because the data infrastructure wasn’t built to find it. 

This is the broader pattern behind the data quality pillar introduced in the first post of this series. The problem isn’t only data that are inaccurate or outdated. It’s data that exist in clinically rich, unstructured form and never make it into the datasets that feed AI systems. When those AI systems make feasibility predictions, generate site recommendations, or assess protocol eligibility, they’re working from a picture of the patient population that is systematically incomplete. Not randomly, but in ways that track specific clinical variables. 

What Fixing It Actually Looks Like

Recovering smoking status from clinical notes at scale requires two things working together: AI capable of finding and extracting the relevant information from unstructured text, and human clinical expertise to validate and refine what the AI finds. 

TriNetX’s AI-driven, human-in-the-loop active learning pipeline, applied to clinical notes, has demonstrated the ability to increase smoking status coverage from 22.3% (the coded data baseline) to 66.9%. That’s not just more data. It’s substantially more complete phenotypes for the patients in the dataset. 

What gets recovered goes beyond a binary yes/no. The pipeline extracts smoking status, type, pack-years, duration, frequency, and quantity: the clinical detail that matters for trial eligibility and outcome prediction, not just a checkbox. 

The validation process matters as much as the extraction. Clinician-reviewed annotations (in this case, more than 3,300 notes) continuously refine the named entity recognition and assertion models, targeting precision of 0.9 or above. The outputs are confidence-scored and fully traceable, which means they can be tuned for sensitivity and specificity and meet the standards required for model governance, auditability, and reproducibility. 

The result is data that are fit for the uses that matter most: assessing protocol feasibility more accurately, selecting sites more efficiently, and building real-world evidence (RWE) that holds up to regulatory scrutiny. 

What This Reveals About the Broader Problem

Smoking status is instructive not because it’s uniquely difficult to capture, but because it illustrates a pattern that runs across clinical variables. 

Clinically significant information gets collected during patient encounters. It lives in notes, summaries, and histories written by clinicians who know what they observed. Structured data systems capture a fraction of it. The rest stays invisible to the analytical systems (including AI systems) that are supposed to use it. 

Every gap in structured data coverage is a version of the smoking problem. It’s data that exist but can’t be seen. AI trained on structured data alone inherits every one of those blind spots. 

This is what the data quality pillar from the first post looks like in practice. Rigor in data preparation isn’t about cleaning up a messy spreadsheet. It’s about building the capability to recover clinically meaningful information that structured systems systematically miss, and doing it with the validation and traceability that high-stakes use cases demand. 

The full picture of what that capability enables across protocol design, site selection, recruitment, and RWE, along with the evaluation framework for assessing whether your current data foundation can support it, is in The Real-World Data Advantage: Why Clinical Operations Teams Are Rethinking AI Strategy.

About Mike Temple 

Mike Temple, MD, MS is an expert in clinical informatics. With his extensive experience as a practicing pediatrician and a data scientist, he leads TriNetX’s team of data scientists and clinical annotators in the extraction of high quality, research-ready data from unstructured clinical notes.