Acronyms
- ITTA - Intention-to-treat analysis
- NNH - Number needed to harm
- NNT - Number needed to treat
- NPV - Negative predictive value
- OTA - On-treatment analysis
- PPA - Per-protocol analysis
- PPV - Positive predictive value
- PSM - Propensity score matching
- RCT - Randomized controlled trial
- RRCT - Registry-based randomized controlled trial
CASE-CONTROL STUDY
- Overview
- A case-control study is an observational study where researchers identify two groups of subjects: one group has an outcome of interest (cases), and the other does not (controls). The prevalence of an exposure is then compared between the groups. The exposure can be just about anything (e.g. medications, environmental toxins, medical conditions), and the outcome is something thought to be related to the exposure (e.g. death, heart disease, cancer). If the exposure is more common in one of the groups, it may be associated with the outcome in some way.
- Example
- Researchers want to evaluate if cell phone use can cause brain cancer, so they identify two groups of subjects:
- Case group: 1000 people with brain cancer
- Control group: 1000 people without brain cancer
- Each group is given a questionnaire about their past cell phone use. The case group reports much higher usage, so the researchers conclude that cell phone use may be associated with brain cancer.
- Advantages
- Cheap and easy - case-control studies are far cheaper and easier to perform than randomized controlled trials. Nowadays, entities that provide healthcare (e.g. governments, HMOs, insurance companies) maintain large databases of medical information on their clients, providing researchers with an enormous amount of data that can be used to perform studies with minimal effort and resources.
- Immediate results - unlike a prospective cohort study, there is no observation period, so results are available as soon as the data is analyzed
- Low-incidence outcomes - outcomes with a low incidence (< 3%) can be evaluated. This may be unfeasible in a randomized controlled trial because it requires a very large number of participants.
- Unethical exposures - case-control studies can evaluate exposures that would be unethical in a randomized controlled trial. For example, no trial is going to randomize patients to smoking or other known carcinogens.
- Impractical exposures - case-control studies can evaluate exposures that would be impractical in a randomized controlled trial. For example, it would be impossible to randomize people to drinking alcohol or abstinence, as these behaviors are unlikely to change overnight.
- Medical conditions and outcomes - case-control studies can evaluate associations between medical conditions and outcomes. You can’t randomize people to diseases like rheumatoid arthritis or ulcerative colitis, so the only way to look for a link between these conditions and an outcome like cancer or heart disease is with observational data.
- Disadvantages
- Cannot prove causality
- Data for patient variables may be lacking or missing and must be estimated
- The incidence of the outcome is not measured, so group comparisons are expressed as odds ratios, which are less intuitive for quantifying risk
- Patients are not randomized, so there is no way to control for unmeasured covariates, confounders, and hidden bias (see randomization for definitions)
COHORT STUDY
- Overview
- A cohort study is an observational study that evaluates the association between an exposure and an outcome. The exposure can be just about anything (e.g. medications, tobacco smoke, sunlight, environmental toxins), and the outcome is something thought to be related to the exposure (e.g. death, heart disease, cancer). Groups of people called cohorts are identified based on whether or not they have been exposed to the factor of interest. The cohorts are then followed for a period of time, and the incidence of the outcome is measured. If the outcome is more common in one of the groups, the exposure may be associated with the outcome in some way.
- Cohort studies may be prospective or retrospective. In prospective studies, cohorts are identified before the observation period begins. In retrospective studies, cohorts and outcomes are identified after the observation period has occurred; this is typically done with a medical registry or database.
- Example
- Researchers want to see if smoking is associated with bladder cancer, so they identify two cohorts of people:
- Exposed cohort: 1000 smokers
- Control cohort: 1000 nonsmokers
- They follow the two cohorts for 10 years and compare the incidence of bladder cancer over that time. Bladder cancer is more common in the smoker cohort, so they determine that smoking may cause bladder cancer.
- Advantages
- Cheap and easy - cohort studies are far cheaper and easier to perform than randomized controlled trials. Nowadays, entities that provide healthcare (e.g. governments, HMOs, insurance companies) maintain large databases of medical information on their clients, providing researchers with an enormous amount of data that can be used to perform studies with minimal effort and resources.
- Low-incidence outcomes - outcomes with a low incidence (< 3%) can be evaluated. This may be unfeasible in a randomized controlled trial because it requires a very large number of participants.
- Unethical exposures - cohort studies can evaluate exposures that would be unethical in a randomized controlled trial. For example, no trial is going to randomize patients to smoking or other known carcinogens.
- Impractical exposures - cohort studies can evaluate exposures that would be impractical in a randomized controlled trial. For example, it would be impossible to randomize people to drinking alcohol or abstinence, as these behaviors are unlikely to change overnight.
- Medical conditions and outcomes - cohort studies can evaluate associations between medical conditions and outcomes. You can’t randomize people to diseases like rheumatoid arthritis or ulcerative colitis, so the only way to look for a link between these conditions and an outcome like cancer or heart disease is with observational data.
- Disadvantages
- Cannot prove causality
- Data for patient variables may be lacking or missing and must beestimated
- Patients are not randomized, so there is no way to control for unmeasured covariates, confounders, and hidden bias (see randomization for definitions)
META-ANALYSIS
- Overview
- A meta-analysis is an observational study where the results from a number of related studies are pooled and analyzed in order to draw conclusions from a broader set of data. Study types may include randomized controlled trials, cohort studies, case-control studies, or a combination of the three. Meta-analyses use special statistical techniques to manage variance within and across studies and to combine outcome measures that differ in type (e.g. means, odds ratios, correlations).
- Example
- A meta-analysis from the Cochrane review [PMID 24953955] evaluated the effects of vitamin D supplementation on cancer risk. Study authors searched the medical literature for randomized controlled trials comparing vitamin D to placebo or no treatment that included cancer as an outcome. They found 18 studies that met their criteria and were able to pool results for 50,623 patients. Using special statistical techniques, they analyzed the data and found no significant overall effect of vitamin D on cancer occurrence.
- Advantages
- When studies on the same or similar topic come to different conclusions, a meta-analysis offers a way to combine results so that an overall effect can be estimated
- If a topic has been evaluated in a number of small trials, a meta-analysis can pool data from those trials, increasing study power and the chance of finding a significant effect
- The effect size in a meta-analysis, which is often less than what is seen in individual trials, may provide a truer estimate of the overall effect in a large population
- Disadvantages
- Combining results from trials that used different criteria, protocols, and outcome measures lowers precision and increases the chance of erroneous conclusions
- Meta-analyses are subject to publication bias, which occurs when only studies that found a significant or large effect are published. To account for this, researchers often try to obtain unpublished studies.
- Meta-analyses are only as good as the studies they evaluate. If there are not enough good studies to perform a valid meta-analysis, the authors should be willing to abandon their effort.
- If a large randomized controlled trial has been performed on the topic, its results are more consequential
NETWORK META-ANALYSIS
- Overview
- A network meta-analysis is a unique type of meta-analysis that compares interventions indirectly across a network of connected studies. When stated plainly, this can be difficult to comprehend, so it's best to illustrate with an example.
- Example
- There are few head-to-head antidepressant trials, so researchers decide to compare medications using a network meta-analysis. They identify five studies comparing Paxil to placebo, three comparing Lexapro to placebo, and one comparing Lexapro to Zoloft. In a network meta-analysis, Paxil and Lexapro are indirectly compared by contrasting each drug's effect size against their combined placebo groups. Furthermore, a comparison between Zoloft and Paxil can be made because they are connected through the network. The illustration below shows how the drugs are connected, with the size of the nodes being proportional to the number of participants in each group and the width of the lines relative to the number of trials between the drugs. The dotted lines represent indirect comparisons.

- Advantages
- The main advantage of a network meta-analysis is the ability to compare therapies that have not been evaluated in head-to-head trials
- Disadvantages
- A network meta-analysis has all the same disadvantages as a standard meta-analysis, along with an additional layer of error and uncertainty that comes with making indirect comparisons [13]
OBSERVATIONAL STUDY
- Overview
- An observational study is an analysis where subjects are grouped according to their exposure to a select factor (e.g., received an intervention, possess a disease risk factor). Outcomes are then compared between the exposed and unexposed groups. The main types of observational studies are cohort studies and case-control studies. Observational studies differ from randomized controlled trials in that participants are not randomly assigned to the factor of interest.
- Advantages
- Far cheaper and easier to perform than randomized controlled trials
- Able to analyze outcomes with a low incidence (<3%)
- Able to evaluate exposures that would be unethical (e.g., tobacco smoke, medications in pregnancy) or impractical (e.g., long-term alcohol use) in an RCT
- Evaluate the association between medical conditions and outcomes (e.g., rheumatoid arthritis and cancer)
- Disadvantages
- Because subjects are not randomized, observational studies are always susceptible to confounding and bias
RANDOMIZED CONTROLLED TRIAL (RCT)
- Overview
- An RCT is a study where subjects are randomly assigned to receive an intervention or not. Outcomes are then compared between recipients and nonrecipients. When performed correctly, RCTs minimize confounding and bias because both measured and unmeasured covariates are distributed equally between groups, something that is unlikely in an observational study. RCTs are considered the gold standard of medical studies because they are the only study type that can prove causality.
- Advantages
- The randomization process minimizes confounding and bias
- Inclusion criteria and outcomes are clearly defined and accurately measured
- Large RCTs are the only type of study that can prove causality
- Disadvantages
- Expensive and time-consuming
- Can take many years to complete
- If an outcome has a low incidence, RCTs are not typically feasible because many participants must be enrolled, making the trial expensive and complicated
- RCTs are unethical (e.g., smoking vs none) and/or impractical (e.g. long-term alcohol intake vs none) in some situations
PRAGMATIC TRIALS
- Studies referred to as pragmatic trials have become more common recently. The definition of a pragmatic trial varies, but generally, it is defined as a study that seeks to evaluate the real-world effects of an intervention. This differs from traditional trials (also called explanatory trials), whose primary purpose is to assess the efficacy of an intervention under ideal conditions.
- Pragmatic trials often have less restrictive inclusion/exclusion criteria, allowing for a broader patient population more reflective of a typical provider's practice. They are also frequently performed within a single healthcare or clinic system using open-label interventions and simple designs.
- Advocates of pragmatic trials claim their findings are a better measure of how interventions perform in the real world. Critics argue that less restrictive criteria and open-label designs increase the risk of bias and confounding.
REGISTRY-BASED RANDOMIZED CONTROLLED TRIAL (RRCT)
- Overview
- RRCTs are randomized controlled trials that use large medical registries/databases to help streamline and simplify data collection. Certain entities (e.g., HMOs, countries) maintain extensive databases of information on patients in their healthcare system. RRCTs use this information to identify eligible patients, log baseline information, determine randomization, and track outcomes.
- Advantages
- Less expensive and time-consuming than a traditional RCT
- The centralized process makes it easier to enroll and recruit patients
- Registry information helps to identify more eligible patients
- Follow-up rates may be higher as outcomes are tracked in the registry, and investigator-specific follow-up is not required
- Results may be more pragmatic since RRCTs occur during the course of typical patient care and do not include specific methodology
- Disadvantages
- Outcome events are based on registry entries, which means they are not adjudicated and must typically remain definitive or broad (e.g., overall mortality)
- Blinding and placebo control are difficult to implement
- Not amenable to all types of outcome data
PAIRED AVAILABILITY DESIGN STUDY
- Overview
- Paired availability design studies are observational studies that use the time period before an intervention is available (or widely available) as a historical control. Outcomes before and after the intervention is available are measured and compared in all patients who are candidates, even those who do not receive the intervention. By including all eligible subjects in the analysis, selection bias is theoretically limited. Data may be collected prospectively, retrospectively, or using a combination of the two.
- Advantages
- Cheaper and easier than randomized controlled trials
- Data collection is often facilitated by a registry
- Less prone to selection bias than a cohort study
- Disadvantages
- In order for the results to be valid, the following assumptions must hold true:
- Stable population meaning the before and after patient population does not change in some significant way. Because of this, these studies are typically performed at hospitals that serve a specific patient population or geographic region.
- Stable treatment meaning other treatments or care does not change significantly between the before and after periods
- Stable evaluation meaning diagnostic and monitoring modalities are not different between the two periods. Improvements in diagnosis and monitoring can improve outcomes.
- Stable preference meaning new data/information does not influence whether a patient chooses the intervention
- The availability of the intervention should not affect its efficacy. This applies when some people in the before group receive the intervention. Increased availability may lead to earlier treatment in the after group, affecting outcomes. Interventions with a learning curve may not perform as well in the before group compared to the after group. [9]
TEST-NEGATIVE STUDY DESIGN
- Overview
- Test-negative study design is a type of case-control study used to measure vaccine efficacy. It differs from traditional case-control studies in that only subjects who seek medical care and test negative for the condition of interest are included in the control group. Theoretically, test-negative controls reduce bias from healthcare-seeking behavior, which occurs when patients who are more likely to get vaccinated are also more likely to seek healthcare when they have symptoms. Patients with low healthcare-seeking behavior may not seek testing when they have the condition, making the vaccine appear less effective if they are selected as controls. By limiting controls to subjects who seek healthcare when they feel sick, test-negative design reduces healthcare-seeking bias.
- Example
- Researchers want to evaluate the effects of the latest COVID vaccine. Using a medical registry, they identify patients who tested positive for COVID over a three-month period (cases). For the control group, they select patients who were seen in a medical setting and tested negative for COVID. They then compare vaccine rates between groups.
- By limiting controls to patients who sought medical care, they reduce healthcare-seeking bias.
BLINDING
- In blinded trials, participants are masked to their assigned treatment. Blinding is essential when outcomes are subjective and can even influence results when they are objective. There are three types of blinding (see below), while studies that have no blinding are called open-label. A measure called the blinding index is sometimes used to assess whether trial blinding was successful.
- Single-blinded study - only the patient is unaware of their treatment group assignment
- Double-blinded study - patients and researchers in direct contact with patients are unaware of treatment group assignments
- Triple-blinded study - patients, researchers in direct contact, and research committees in charge of monitoring outcomes are unaware of treatment group assignments
- Open-label study - everyone is aware of treatment group assignments
COMPETING RISKS
- The Kaplan–Meier estimator is a survival analysis method often used to evaluate outcomes in prospective studies. It was originally developed to measure survival over a period of time, with overall mortality being the primary outcome. However, its use has expanded, and it is frequently applied to outcomes other than mortality, creating issues with its accuracy. To understand the problem, it's important to consider how patients are treated in survival analysis. In traditional survival analysis, where mortality is the endpoint, all subjects are assumed to experience the primary outcome. Censored patients, i.e., those lost during follow-up, are assumed to die at some point after they are censored, and subjects who are alive at the end of the study are assumed to die after the study ends. When the primary endpoint is something other than death, this assumption may not hold true, causing the risk of the outcome to be overestimated.
- Consider a study in elderly patients using survival analysis to estimate the lifetime risk of a hip replacement. Patients who die before they receive a hip replacement are no longer at risk for one, however, Kaplan-Meier methods assume they are, causing hip replacement risk to be overestimated. To correct for this, researchers often use competing risk techniques (e.g., death as a competing risk) to adjust risk estimates for subjects who die before experiencing the primary event.
HIERARCHICAL WIN RATIO
- Hierarchical win ratio is a method used to compare two or more treatment groups in a clinical trial with multiple outcomes of varying importance. It's particularly useful when some outcomes are considered more significant than others. Instead of analyzing each outcome separately, it combines them into a single composite measure that respects a pre-defined hierarchy. Hierarchical win ratio is performed using the following steps:
- 1. Defining the Hierarchy of Outcomes: The first step is to establish a clear and clinically meaningful order of importance for the outcomes being measured. For example, in a cardiovascular trial, the hierarchy might be:
- All-cause mortality (most important)
- Non-fatal stroke
- Non-fatal myocardial infarction
- Hospitalization for heart failure (least important)
- 2. Pairwise Comparisons: Outcomes between individual participants in each treatment group are compared along the hierarchy.
- 3. Determining the "Winner" in Each Pair: The participant in a pair is declared a "winner" based on the following rules:
- If one participant experiences a better outcome at the highest level of the hierarchy, they are the winner, regardless of the outcomes at lower levels
- If the outcomes are the same at the highest level, the comparison moves to the next level in the hierarchy
- This process continues down the hierarchy until a difference is observed or all outcomes have been compared
- If all outcomes are the same for both participants, it's considered a "tie"
- 4. Calculating the Win Ratio: After comparing all possible pairs of participants between the treatment groups, the win ratio is calculated as:
- Win Ratio = # of pairs where the experimental group wins / # of pairs where the control group wins
- 5. Interpreting the Win Ratio:
- A win ratio greater than 1 suggests that the experimental treatment is more beneficial overall, as more participants in the experimental group experienced better outcomes according to the hierarchy
- A win ratio less than 1 suggests that the control treatment is more beneficial
- A win ratio equal to 1 suggests no overall difference between the treatments
- 6. Statistical Significance: To determine if the observed win ratio is statistically significant, statistical methods (often based on non-parametric tests like the Mann-Whitney U test or extensions of it) are used to calculate a confidence interval and a p-value for the win ratio
- Advantages
- Handles multiple outcomes: Effectively combines several outcomes into a single, interpretable measure
- Incorporates clinical importance: Respects the relative importance of different outcomes by prioritizing them in the analysis
- Potentially more sensitive: Can be more sensitive at detecting a treatment effect when the treatment has a greater impact on the more important outcomes
- Clinically Meaningful Interpretation: The win ratio provides a straightforward way to understand which treatment is "winning" more often in terms of clinically relevant outcomes
- Disadvantages
- Requires Pre-defined Hierarchy: The results are highly dependent on the chosen hierarchy, which needs to be carefully justified and defined a priori (before analyzing the data)
- Loss of Information: By focusing on the "winner" in each pair, effect sizes are not evaluated
- Dilution by less important outcomes: If a treatment has a strong effect on a less clinically relevant outcome lower in the hierarchy, this might drive the win ratio result even if there's no significant impact on more important outcomes. This could lead to a statistically significant result that isn't clinically meaningful.
- Complexity: The calculation and statistical inference can be more complex than analyzing individual outcomes
INSTRUMENTAL VARIABLE ANALYSIS
- Overview
- A major limitation of observational studies is the inability to detect and control for unmeasured confounders. Randomization, which is not possible in observational studies, is the only method that can completely eliminate the effects of unmeasured confounders.
- A technique called instrumental variable analysis has been developed to help reduce the effects of unmeasured confounders in observational studies. An instrumental variable is a patient variable that is strongly associated with the treatment a patient receives while at the same time having no association with the outcome being measured. For example, a new diabetes drug may only be covered by a few insurance companies. Patients with insurance that covers the drug are far more likely to receive the drug than patients without coverage. In this case, insurance drug coverage is an instrumental variable because it is strongly associated with treatment received but should have no association with diabetic outcomes like A1C values and CVD unless the new drug is truly superior to established therapies.
- Once an instrumental variable is identified, researchers can "randomize" patients by comparing outcomes between patients grouped by the instrumental variable. [11]
- Example
- Researchers want to see if gallbladder removal for cholecystitis within 12 hours of presentation is associated with better outcomes than removal beyond twelve hours
- They collect data on patients who have had their gallbladder removed and divide them into 2 groups based on whether they had it removed within 12 hours or after 12 hours
- The two groups differ on a number of covariates so the researchers decide to look for an instrumental variable. They find that patients admitted on the weekend were far more likely to have their gallbladder removed after twelve hours when compared to patients admitted during the week. Since day of hospital admission should not have a direct effect on outcomes, they decide to use it as their instrumental variable. An analysis of the data is then performed with patients grouped by whether they were admitted on the weekend or on a weekday.
INVERSE PROBABILITY OF CENSORING WEIGHTING
- Inverse probability of censoring weighting (IPCW) is a statistical technique used to correct for nonadherence in RCTs. Most RCTs use intention-to-treat analysis, meaning participant outcomes are counted towards their original assigned therapy regardless of whether they are adherent to it. Nonadherence occurs when participants stop their assigned treatment, crossover to competing treatments, or violate study protocol. If nonadherence is significant (e.g., > 10% crossovers), study results may be biased toward the null. Per-protocol analysis, as-treated analysis, and on-treatment analysis are methods used to adjust for nonadherence, but they have many shortcomings.
- IPCW, described in the steps below, is another method used to address nonadherence
- Data from nonadherent participants is censored at the time they become nonadherent, and on
- Participants in treatment arms are grouped based on their risks (e.g., age, sex, lifestyle habits, comorbidities) for developing the primary outcome
- Within each risk group, adherent and nonadherent participants are identified. Outcomes for adherent subjects are given more weight (upweighted) than nonadherent subjects through inverse probability, helping to replace data lost when nonadherent participants are censored. [12]
- For IPCW to be valid, investigators must identify participant characteristics associated with the outcome both at the beginning and during the trial. The method also assumes there are no unmeasured confounders.
- Like other techniques used to account for nonadherence, IPCW shares some of the same shortcomings, including reduced randomization, possible bias toward responders, and unmeasured confounders.
MENDELIAN RANDOMIZATION
- Overview
- Mendelian randomization is an observational study method that uses natural variations in genetic makeup as a means of randomization. For example, HDL cholesterol levels are inversely associated with heart disease risk, with higher levels indicating a lower risk and vice versa. Individual factors that affect levels include exercise, diet, and genetic variants that raise or lower levels. Through Mendelian inheritance, these variants are randomly distributed across a large population, along with other potential confounders, theoretically simulating randomization. To see if HDL levels directly affect cardiovascular disease (CVD) risk, researchers can form patient cohorts based on HDL variants and measure their association with CVD outcomes. If HDL lowers CVD risk, people with HDL-promoting variants will have fewer CVD events and vice versa. In fact, this study was performed, and it found that certain HDL-promoting variants were not associated with lower risk, suggesting HDL is a marker of disease risk but not a modifier. [PMID 22607825] This finding is supported by studies showing HDL-raising drugs (e.g., niacin, fibrates) do not prevent CVD.
- For Mendelian randomization to be valid, the variants of interest must only be associated with the risk factor in question and not other variables that affect the outcome. For example, in the HDL study, if HDL-promoting variants are also associated with lower LDL levels, confounding exists, and incorrect conclusions may be drawn.
- Advantages
- Simulates randomization in an observational study
- May be used for exploratory analysis to see if a proposed risk factor is associated with outcomes. For example, before developing HDL-raising drugs, researchers could have used a Mendelian randomization study to see if HDL is directly related to CVD outcomes.
- Disadvantages
- If a variant is associated with more than one risk factor, confounding is present
- Requires genotyping of a large number of people, which can be expensive [8]
AS-TREATED ANALYSIS
- Overview
- In an as-treated analysis, subjects are counted toward the therapy they actually received, meaning people assigned to treatment A who crossover to treatment B are counted toward treatment B. This differs from intention-to-treat analysis, where crossovers count toward their original assigned group, and per-protocol analysis, where crossovers are excluded.
- As-treated analyses are often performed in studies with high crossover rates, and on the surface, they seem to be a logical way to handle crossovers. However, they are very prone to bias (see example below).
- Example
- Researchers want to compare surgery to physical therapy for sciatic nerve pain, so they randomize 200 patients with sciatica to physical therapy or surgery. The primary outcome is back pain and disability at one year.
- Over the course of the study, 30 people assigned to physical therapy end up receiving surgery, and 10 people assigned to surgery never have surgery and instead receive physical therapy (crossovers)
- The researchers decide to do an as-treated analysis so that crossovers count toward the therapy they received
- Bias in as-treated analyses
- In this example, results from 30 people in the physical therapy group are counted toward surgery, and results from 10 in the surgery group are counted toward physical therapy. This appears logical since the outcomes for crossovers now count toward the treatment they received. In reality, the analysis is likely biased. People often enter trials because they hope to receive one of the treatments being offered. Researchers know this, so they allow crossovers to boost enrollment. In the example above, patients who believe invasive treatments are better may be disappointed if they are assigned to physical therapy. They are biased toward surgery, and if they cross over, the as-teated analysis will also be biased. Another potential source of bias is the 10 subjects who did not receive surgery and counted toward physical therapy. It's possible these patients were sicker, making them poor surgical candidates. Applying their results to physical therapy shifts the balance of debilitated patients to one side.
INTENTION-TO-TREAT ANALYSIS (ITTA)
- Overview
- In ITTA, participant outcomes count toward their original assigned group regardless of whether they stop their treatment or crossover to a competing therapy. ITTA is considered the gold standard of outcome analysis because it is the most unbiased method. Its main drawback occurs when crossover rates are high, as they can skew results toward the null or no effect. Other analytical methods that adjust for crossovers (e.g., per-protocol analysis, as-treated analysis, on-treatment analysis ) may seem logical but are prone to bias and overestimating effect size.
- See also modified intention-to-treat analysis
- Example
- Researchers want to compare surgery to physical therapy for sciatic nerve pain, so they randomize 200 patients with sciatica to physical therapy or surgery. The primary outcome is back pain and disability at one year.
- Over the course of the study, 30 people assigned to physical therapy end up receiving surgery, and 10 people assigned to surgery never have surgery and instead receive physical therapy (crossovers)
- Intention-to-treat analysis
- In ITTA, outcomes for the 30 people who crossed over from physical therapy to surgery still count toward physical therapy, and outcomes for the 10 people who never received surgery still count toward surgery.
MODIFIED INTENTION-TO-TREAT ANALYSIS
- A modified intention-to-treat analysis (ITTA) is an ITTA where the inclusion criteria for the participants have been changed in some way. Modified ITTAs may be completely harmless or extremely bad form, depending on what they modify. The most common type is the benign form, where researchers only count participants who received at least one dose of the study drug; typically, only a handful of participants are excluded, and it eliminates people who were randomized but never participated in the study. In other cases, a modified ITTA may be more nefarious. Sometimes researchers don't like their results, so they look for patient subgroups with lower response rates. If they identify one, they may perform a modified ITTA that excludes these people so their outcomes improve. This is bad form and should always be viewed with skepticism.
ON-TREATMENT ANALYSIS (OTA)
- Overview
- In an on-treatment analysis (OTA), only data from periods where a subject was compliant with their assigned treatment are counted, i.e., periods where the patient stopped their assigned therapy are excluded.
- While an OTA seems like a logical method for handling protocol violators, it introduces bias in many cases
- Example
- Researchers want to compare a new migraine prevention drug to an older one, so they randomize 200 people to the two medications. Patients are told to take the drugs for 6 months, and the number of migraines in that period is recorded.
- Over the course of the trial, 30 people assigned to the new drug and 10 assigned to the old drug stop taking their assigned treatment. Researchers decide to do an on-treatment analysis so that time periods where patients were not compliant with their medication are excluded.
- Bias in on-treatment analysis
- In this example, the on-treatment analysis is biased. Subjects who felt their treatment was working (responders) are more likely to keep taking it, while those who did not perceive a benefit are more likely to stop it. An on-treatment analysis is biased toward responders because they account for more data.
PER-PROTOCOL ANALYSIS (PPA)
- Overview
- In a per-protocol analysis (PPA), only subjects who follow their assigned study protocol are included in the outcome data. Patients who vary from the protocol (e.g., stop treatment, crossover, drop out) are excluded, and their outcomes do not count. PPAs differ from on-treatment analyses and as-treated analyses in that patients who violate protocol are not counted at all, whereas in the other two methods, data from protocol violators is still used, but it is truncated or applied to different treatment groups.
- PPA may seem like a logical method for handling protocol violators, but it can introduce bias (see Example 1 and 2 below). PPAs are more meaningful in conditions that are difficult to treat and/or when there are few treatment options (see Example 3 below).
- In noninferiority trials where established treatments are compared to new treatments, PPAs should always be reported with intention-to-treat analyses because they test the sensitivity of the results to protocol violations
- Example #1
- Researchers want to compare a new migraine prevention drug to an older one, so they randomize 200 people to the two medications. Patients are told to take the drugs for 6 months, and the number of migraines in that period is recorded.
- Over the course of the trial, 30 people assigned to the new drug and 10 assigned to the old drug stop taking their assigned treatment. Researchers decide to do an on-treatment analysis so that time periods where patients were not compliant with their medication are excluded.
- The researchers decide to do a PPA so that people who stopped their medication are excluded
- Bias in PPA
- In this example, the PPA is biased. Subjects stop study medications for various reasons, including side effects and lack of perceived efficacy, and excluding data from nonadherers biases the results toward responders.
- Example #2
- Researchers want to evaluate aspirin for the prevention of DVT, so they randomize 3000 people to aspirin or placebo and measure the incidence of DVT over two years
- Over the course of the trial, 300 subjects in the aspirin group stop their assigned treatment
- Researchers decide to do a PPA, excluding people who stopped aspirin. The analysis shows that compliant aspirin patients had a lower rate of DVT than the placebo group.
- Bias in PPA
- On the surface, this analysis seems legitimate. The outcome is objective, and for the most part, people tolerate aspirin well, so comparing compliant aspirin patients to controls makes sense. The problem is past studies have shown that compliant patients have better outcomes than noncompliant ones, even when they are receiving a placebo. If noncompliant aspirin patients are excluded, the aspirin group becomes biased toward subjects with inherently better outcomes.
- Example #3
- Researchers want to evaluate a new drug for a difficult-to-treat cancer, so they randomize 100 patients to the new medication or standard therapy
- The new drug has many side effects, causing half the participants to stop taking it
- Researchers decide to do a PPA to see if the new drug improved survival in patients who completed therapy
- In this case, the PPA is meaningful because it shows whether patients who can tolerate the therapy benefit from it
PROPENSITY SCORE MATCHING (PSM)
- Overview
- PSM is a statistical technique used in observational studies to control for selection bias. In PSM, study subjects are assigned a score that reflects their probability of receiving a treatment; the score is derived from individual covariates and their association with the treatment. Untreated and treated patients with the same propensity score are then matched, and their outcomes are compared. By controlling for the patient's "propensity" to receive the treatment, selection bias is theoretically limited.
- Procedure
- Probabilities of receiving the treatment based on different covariates (e.g., age, sex, medical conditions) are determined
- For each subject, individual-specific covariate probabilities are combined to derive a propensity score
- Subjects are grouped based on their propensity score, and untreated and treated subjects with similar scores are compared
- Techniques for incorporating propensity scores into data analysis include the following:
- Direct comparison
- In direct comparisons, treated and untreated individuals with the same score are matched and compared. Matching can be one-to-one, one-to-as-many that match, and so on. Subjects can also be stratified by their propensity scores and compared across strata.
- A major disadvantage of direct matching is that all of the available data may not be used. Furthermore, if strata are compared, residual confounding may occur.
- Propensity score as a covariate
- In a multivariate model, the propensity score and the treatment can be used as independent variables, with the outcome as the dependent variable.
- If the propensity score is strongly associated with the outcome, adjustment for it will decrease the strength of association between the treatment and the outcome
- Propensity score weighting
- In propensity score weighting, subjects are weighted by the inverse of their propensity score, causing those who were more likely to receive the treatment to count less and vice versa. In theory, this helps to neutralize selection bias by creating a sample where covariate distribution is independent of treatment selection. [5,6]
RANDOMIZATION
- Definition and importance
- Randomization is the process of arbitrarily assigning subjects to a study arm. When treatment and control groups are formed, it's important that they be as close to identical as possible so that any outcome difference is entirely attributable to the treatment. If the groups differ in some way, bias and confounding may be present, leading to erroneous findings. Many patient characteristics are easy to measure (e.g. age, sex, race, cholesterol, A1C), but others are hard to define. For example, some patients have biases toward treatments. We've all had the patient who "doesn't like to take medications" and stops them at the first sign of a side effect. Conversely, there are the ones who don't get better unless they receive some kind of intervention, even if it's a placebo. Providers are biased too. If a patient is educated and compliant, their doctor may be more inclined to prescribe them a new therapy that requires frequent lab monitoring as opposed to a noncompliant patient where things are kept simple. These types of biases, which are difficult or impossible to measure, can greatly influence trial outcomes. This is why randomization is so important. When patients are randomly assigned to treatment groups that are large enough, each group will have an equal amount of measured and unmeasured variables, and their effects will cancel each other out. The main weakness of observational studies is that subjects are not randomized, creating the possibility that unmeasured variables aren't balanced.
- Cluster randomization
- Cluster randomization is a process where entire groups of people (e.g. villages, hospitals, clinic systems) are randomized to an intervention. This differs from normal randomization, where participants are assigned to treatment groups at the individual level. Cluster randomization simplifies the randomization and treatment processes because every subject in a single locale receives the same intervention. It is mostly used for studies with objective outcomes that do not require blinding. It also helps to prevent scenarios where cross-contamination can occur. For example, vaccine studies often utilize cluster randomization by assigning treatments to different geographic regions. If normal randomization were used, individuals within the same region would receive vaccine or control, and vaccinated subjects may protect control patients (cross-contamination), causing the true effect of the vaccine to be underestimated.
- Stepped wedge cluster randomization
- Stepped wedge cluster randomization (SWCR) is a study design where clusters are randomized to receive an intervention at different time points over the course of the trial. Initially, all clusters enter the study without the intervention. Then, at regular intervals, clusters are randomly assigned to crossover to the intervention. It differs from parallel cluster randomization in that all clusters receive the intervention at some point, and initiation of the intervention is staggered across clusters. The two diagrams below illustrate the difference and how the name was derived.
- SWCR is primarily used to evaluate institutional or policy interventions such as screening tools (e.g., universal depression screening with PHQ-9), disease management protocols (e.g., inpatient CHF protocols, nosocomial infection prevention measures), and healthcare allocation (e.g., public funding for HIV medications, vaccines).
- Advantages
- Because SWCR clusters serve as their own controls (i.e., repeated measures), study power is increased if clusters are homogenous (i.e., low intra-cluster variability among subjects) or large.
- Interventions that are restricted by resource availability (e.g., vaccines) or are logistically complex (e.g., public health insurance) have more time for allocation
- Disadvantages
- If there is an underlying improvement in care over the course of the study, the effects of the intervention may be inflated because data from intervention periods is weighted toward the end of the study. For example, suppose Entresto was approved in the middle of a CHF treatment protocol study. Improvements in CHF outcomes may appear to be from the protocol when, in reality, they were from Entresto. Researchers must adjust for these types of temporal trends, which can lower study power.
- If clusters are heterogeneous (i.e., large intra-cluster variability among subjects), study power is reduced, and parallel cluster studies are superior [17]

ABSOLUTE VS RELATIVE RISK
- Overview
- Absolute and relative risk are related measures (typically expressed as percentages) used to convey differences between groups. When dealing with percentages, most people think in terms of absolute risk, which is the difference in the overall incidence of an outcome in an entire population. Relative risk is the ratio of the outcome event between groups and only considers participants who experienced the event. This can cause the two measures to differ greatly, even though they describe the same data. The example below illustrates the point.
- Conclusions
- As one can see from the example above, relative and absolute risk reductions are far apart at 46% and 2.3%, respectively. It's important to understand the difference because drug companies often emphasize a large relative risk reduction, while the absolute reduction may be small and insignificant. Livelong's manufacturer will likely say it "reduces the risk of heart attack by 46%." Without any context, this number is amorphous. It's more intuitive to say it reduces the overall risk of a heart attack by 2.3%.
- Relative risk and hazard ratios
- Hazard ratios are often interpreted as relative risks, but they are not entirely the same. For example, if the hazard ratio for an outcome is 0.75 for treated versus control patients, it may be assumed the treatment lowers the relative risk of the event by 25%. In most cases, hazard ratios and relative risks are close but not identical. The difference is that relative risk is based on the cumulative number of events over the entire study period regardless of when they occurred, while hazard ratios represent the average risk over the course of the study.
BLINDING INDEX (BI)
- The blinding index (BI) is a measure used to assess the success of blinding in a clinical trial. To calculate the BI, subjects are asked at some point during the trial to guess their treatment arm, with responses categorized as correct, incorrect, or subject said they did not know. Each response is assigned a numerical value, and an index is calculated that reflects the degree of successful blinding.
- Two methods for calculating the BI are the James BI and the Bang BI, which have the following definitions:
- James BI
- Range: 0 to 1, with 0 meaning all responses were correct and 1 meaning they were all incorrect
- A value of 0.5 means half were correct and half incorrect, indicating random guessing and successful blinding
- If the upper limit of the BI confidence interval is less than 0.5, unblinding may be declared
- Bang BI
- Range: -1 to 1, with -1 meaning all responses were incorrect and 1 meaning they were all correct
- A value of 0 means half were correct and half incorrect, indicating random guessing and successful blinding
- In general, a Bang BI of -0.2 to 0.2 is considered successful blinding. If the BI confidence interval does not include 0, unblinding may be declared.
- If all responses are incorrect (James BI of 1 or Bang BI of -1), subjects may be "opposite guessing" [14,15,16]
E-VALUE
- Overview
- The E-value is a statistical measure used in observational studies to assess the effect of unmeasured confounders. It is a risk ratio that quantifies the minimal amount of correlation (while controlling for measured covariates) that an unmeasured confounder must have with the exposure and the outcome to negate the observed association between the two.
- E-values close to one mean it is more likely an unmeasured confounder could affect the observed results, while higher E-values make confounding less likely. [10]
- Example
- An observational study evaluating the association between exercise and cancer finds that exercise significantly lowers cancer risk with a hazard ratio of 0.70 (95%CI [0.50 - 0.88]). To test their findings, the authors calculate an E-value of 3.2, meaning an unmeasured confounder would have to have a hazard ratio of 3.2 between exercise and cancer for it to affect the observed result. Since the E-value is high, it supports the findings.
INCIDENCE AND PREVALENCE
- Incidence - incidence is the proportion of people who experience an outcome over a specified period of time. For example, researchers follow 100 people for a year, and during that time, six of them develop diabetes. The incidence of diabetes over one year is 6%.
- Prevalence - prevalence is the proportion of people who have an outcome at a given time point. For example, researchers survey 100 people, and ten report having diabetes. The prevalence of diabetes in the population is 10%.
LIKELIHOOD RATIO (LR)
- Overview
- Liklihood ratios are used to quantify the predictive value of a test. It is the factor by which the odds of a condition being present are increased based on a positive test (positive likelihood ratio) or decreased based on a negative one (negative likelihood ratio).
- The likelihood ratio can be multiplied by the pre-test odds to determine the post-test odds of the condition (see example below)
- Mathematically, the likelihood ratio is calculated using the sensitivity and specificity of the test:
- Example
- Test A is a screening test for heart disease that has a negative likelihood ratio of 0.1 and a positive ratio of 1.9 (equivalent to a sensitivity of 95% and a specificity of 50%)
- Paul is a 60-year-old male who, based on his family history, age, and medical conditions, has a 25% probability of heart disease
- Convert 25% probability to odds = 25/75 = 0.33
- If Paul has a negative test A, his post-test odds of heart disease are 0.1 X 0.33 = 0.033
- If Paul has a positive test A, his post-test odds of heart disease are 1.9 X 0.33 = 0.627
- Converting back to probability:
- Negative test = 0.033/1.033 = 0.032 or 3.2% probability
- Positive test = 0.627/1.627 = 0.385 or 38.5% probability
- In Paul's case, a positive test increases his probability of heart disease from 25% to 38.5%, while a negative one decreases it from 25% to 3.2%. A negative test is more informative than a positive one because the test has a high sensitivity (low false negatives) and low specificity (high false positives).
NET RECLASSIFICATION INDEX/IMPROVEMENT (NRI)
- Overview
- The net reclassification improvement/index (NRI) is a measure used to evaluate whether adding a factor to a predictive model enhances the accuracy of the model. For example, researchers want to see if the Framingham risk model for heart disease is improved by adding coronary artery calcium scores (CACS). To do so, they compare the model's accuracy with and without CACS and calculate the NRI.
- The NRI is calculated in the following manner:
- Subjects who had the outcome are referred to as "events," and those who didn't are called "nonevents"
- 1. Run the model with and without the factor of interest
- 2. Take the number of events who were reclassified into a higher risk category when the factor was included and subtract the number of events who were reclassified into a lower risk category. Divide this number by the total number of events.
- 3. Take the number of nonevents who were reclassified into a lower-risk category when the new factor was added and subtract the number of nonevents who were reclassified into a higher risk category. Divide this number by the total number of nonevents.
- 4. Add the two proportions together, and you have the NRI
- The formula for the NRI is as follows:
- Example
- The Framingham risk model uses age, gender, total and HDL cholesterol, smoking status, and systolic blood pressure to predict a person's 10-year risk of heart attack. Researchers want to know if adding coronary artery calcium scores (CACS) to the model improves its accuracy, so they run the model with and without CACS on a cohort of 5878 patients and find the following:
- 209 subjects have a heart attack (events)
- 5669 subjects do not have a heart attack (nonevents)
- 71 events are reclassified as higher-risk
- 24 events are reclassified as lower-risk
- 790 nonevents are reclassified as lower-risk
- 657 nonevents are reclassified as higher-risk
- The NRI is calculated as follows:
- In this example, the NRI of adding CACS to the model is 25%. It's important to note that this does not mean the addition of CACS correctly reclassifies 25% of the population. The addition of CACS correctly reclassified 22.5% of events (0.225 X 209 = 47) and 2.3% of nonevents (0.023 X 5669 = 130). The proportion of the total population correctly reclassified is 3% (177/5878 = .03). Whether 3% is clinically meaningful depends on the cost and convenience of the additional factor.
NUMBER NEEDED TO TREAT OR HARM (NNT/NNH)
- Overview
- The number needed to treat (NNT) is the number of people that must receive an intervention for one person to benefit, while the number needed to harm (NNH) is the number that must be exposed for one person to experience an adverse event.
- The NNT is calculated by taking the reciprocal of the absolute risk reduction, and the NNH is the reciprocal of the absolute increase in adverse events
- Example
- In the SELECT trial, Wegovy was compared to placebo for the secondary prevention of CVD events in overweight nondiabetics. Wegovy reduced the absolute risk of the composite outcome (MI, stroke, CVD death) by 1.5%, giving a NNT of 67 (1/0.015 = 67). In the same study, more patients discontinued Wegovy than placebo due to adverse events (16.6% vs 8.2%, diff 8.4%), yielding a NNH of 12 (1/0.084 = 12).
ODDS RATIO
- Overview
- Odds are the ratio of the number of times an event will occur to the number of times it won't, typically expressed as events:nonevents
- Example
- The odds of rolling a 2 on a 6-sided dice is 1:5 because there are six possible outcomes, one event and five nonevents. 1:5 odds can be converted to a probablility by dividing 1 by 6, which yields 16.6% (1/6 = 0.166). Notice that the calculation is not 1/5 because odds are not a fraction, which is a common misinterpretation.
- Formulas for converting between odds and probabilities are as follows:
- Odds ratio
- An odds ratio is a ratio of two odds, while a relative risk (hazard ratio) is a ratio of two fractions. Because of this, the two measures do not always approximate each other. As the prevalence of an outcome rises above 10%, the odds ratio and the relative risk start to diverge. This occurs because odds above 1:1 (0.50 probability) can run to infinity, where a fraction of a whole is always between 0 and 1.
- When the incidence of the outcome is < 10%, the odds ratio and the relative risk can be considered equivalent. When the incidence of the outcome is > 10% and the odds ratio is > 2.5 or < 0.5, the odds ratio will tend to exaggerate the magnitude of the association. In some circumstances, the odds ratio can be converted to a relative risk (see below).
- Odds ratio in medical studies
- Case-control studies
- Odds ratios are always reported in case-control studies because relative risk cannot be measured, i.e., the case group has 100% probability of the outcome, and the control group has 0%.
- Cohort studies
- Logistic regression, which is often used in cohort studies to control for covariates, yields an odds ratio. In some cases, the ratio can be converted to relative risk using the formula below. [2]
- Converting an odds ratio to relative risk
- If a cohort study uses logistic regression and reports an odds ratio, it can be converted to relative risk if the incidence of the outcome in the control group is known
P-VALUE
- Overview
- The p-value is a way of expressing statistical significance that has the following meaning: "If there really is no difference between two groups being compared, the p-value is the probability that the observed difference (or one greater) occurred by chance."
- Example
- Researchers want to evaluate a new drug for DVT prevention, so they randomize 1000 people to the drug and 1000 to placebo and compare the incidence of DVT over 1 year
- At the end of the year, the drug group had 20 fewer DVTs than the placebo group. Researchers compare the groups using statistical techniques and report a p-value of 0.02. This means if the new drug really is no better than placebo, there is a 2% probability that the difference observed (or one greater) occurred by random chance.
- Statistical significance
- Most medical studies use a p-value of < 0.05 to determine statistical significance. While researchers like to declare results significant or nonsignificant based on the 0.05 cutoff, it's better to consider the value and its meaning as opposed to absolutes. After all, a p-value of 0.049 is considered significant, while a value of 0.050 is not, meaning 0.1% can make or break a finding. Because of this, some clinicians feel the following categories are more reasonable:
p-value | Significance |
---|---|
< 0.001 | Very highly significant |
0.001 to <0.01 | Highly significant |
0.01 to <0.05 | Significant |
0.05 to 0.10 | Trend toward significance |
> 0.10 | Nonsignificant |
POSITIVE PREDICTIVE VALUE (PPV) AND NEGATIVE PREDICTIVE VALUE (NPV)
- Overview
- PPV and NPV are measures of test accuracy that have the following meanings:
- Positive predictive value (PPV) - the proportion of patients with a positive test who have the disease
- Negative predictive value (NPV) - the proportion of patients with a negative test who do not have the disease
- PPV and NPV combine the prevalence of the condition in the population being studied with test sensitivity and specificity to provide a more meaningful description of the test results (see example below). Calculations for PPV and NPV are as follows:
- Example
- Assume a colon cancer screening test has a sensitivity of 100% and a specificity of 97%. This test would be positive in everyone with colon cancer and 3% of those without (3% false-positive). If someone has a positive test, they may assume they likely have colon cancer because the test is only wrong 3% of the time. This assumption is false.
- To understand the real meaning of a positive test, the prevalence of colon cancer in the screened population must be considered. Assume the prevalence is 0.5%, meaning there are 50 cases of colon cancer for every 10,000 people. Our test will be positive in the 50 people with colon cancer (100% sensitivity) and in 299 of those without it (specificity 97%, 0.3 X 9950 = 299). The proportion of people with a positive test who have colon cancer (PPV) is 14% (50/349 X 100 = 0.143).
SENSITIVITY AND SPECIFICITY
- Overview
- Sensitivity and specificity are two measures used to describe the accuracy of a test in identifying a condition
- Sensitivity
- Sensitivity is the percentage of people with the condition who get a positive result. If a test has high sensitivity, almost everyone with the disease will have a positive test. Cologuard has a sensitivity of 92%, which means only 8% of people with colon cancer will get a negative result (false negatives). High sensitivity tests are most useful when negative because it means the disease is very unlikely. The significance of a positive result depends on the specificity.
- Specificity
- Specificity is the percentage of people without the disease who get a negative result. Cologuard has a specificity of 90%, which means only 10% of people without colon cancer will get a positive result (false positives). A test with high specificity is most useful when positive because it makes the disease likely. The illustration at the link below helps to explain the relationship.
- Sensitivity, specificity, and predictive value
- It's important to understand how to apply sensitivity and specificity because it can be easy to misinterpret their meanings. For example, assume an ovarian cancer screening test has a sensitivity of 100% and a specificity of 97%. This test would be positive in everyone with ovarian cancer and 3% of those without (3% false-positive). If someone has a positive test, they may assume they likely have ovarian cancer because the test is only wrong 3% of the time. This assumption is false.
- To understand the real meaning of a positive test, the prevalence of ovarian cancer in the screened population must be considered. Assume the prevalence is 0.5%, meaning there are 50 cases of ovarian cancer for every 10,000 people. Our test will be positive in the 50 people with ovarian cancer (100% sensitivity) and in 299 of those without it (specificity 97%, 0.3 X 9950 = 299). The proportion of people with a positive test who have ovarian cancer is 14% (50/349 X 100 = 0.143), which is a measure called the positive predictive value.
HAWTHORNE EFFECT
- Overview
- The Hawthorne effect is a phenomenon where subjects in a study modify their behavior because they know they are being observed. It is particularly relevant in unblinded studies because participants may behave differently depending on their treatment group assignment. It can also increase the effect size of shorter trials because it tends to wane over time.
- Example one (usual care)
- Researchers have developed an EMR prompt system that provides doctors with evidence-based information on congestive heart failure (CHF) when they enter CHF orders. To test their system, they randomize a group of hospitals to receive the prompts and another group to usual care. The Hawthorne effect may cause the usual-care doctors to order more evidence-based treatments because they know they are being observed, thus decreasing the observed impact of the prompt system.
- Example two (unblinded treatment groups)
- Studies in renal denervation exemplify the power of the Hawthorne effect. Renal denervation is a procedure where the sympathetic nerves innervating the kidneys are ablated, thus decreasing sympathetic tone and, in theory, lowering blood pressure. A study published in 2010 randomized 106 patients with resistant hypertension to renal denervation + continuation of current meds or continuation of current meds only. At 6 months, patients who received renal denervation had office-based blood pressure reductions of 32/12 mmHg from baseline, while the control group had no change. [PMID 21093036] In 2014, a similar study was published, except that patients in the control group received a sham procedure so that all subjects were blinded to their treatment. At 6 months, there was no significant difference in SBP reduction (14 vs 11 mmHg). [PMID 24678939] In both studies, blood pressure meds were continued unchanged during the 6-month follow-up period.
- So how does one explain the large discrepancy in results? The Hawthorne effect is one possible explanation. In the first trial, patients who did not receive the procedure were likely disappointed and lost interest in the study. This may have caused differences in medication compliance, with the treated group being more compliant than the untreated. In the second trial, patients were blinded to their treatment assingment, keeping them more engaged.
IMMORTAL TIME BIAS
- Overview
- Immortal time bias occurs when outcomes from a period of time are excluded in one study arm while being included in competing arms. The period that outcomes aren't counted is considered "immortal time" for the arm with the exclusion. It can happen when study enrollment and treatment initiation are asynchronous.
- Example
- Researchers want to evaluate whether Drug A prevents rehospitalization for heart failure, so they take inpatient heart failure patients and divide them into two cohorts based on whether they were prescribed Drug A at discharge. They follow the two cohorts for one year and compare heart failure hospitalization rates. In the Drug A group, they exclude outcomes from discharge to when the patient filled their prescription because the effects of the drug are absent during this period. This excluded period is "immortal time" for the Drug A group.
- In this example, excluding outcomes for Drug A from discharge to prescription fill biases the study toward Drug A because it selects for patients who are event-free during the immortal time. To reduce bias, all participants should count toward the control group during the immortal time.
INVESTIGATOR BIAS
- Investigator bias occurs when the researchers desire a particular outcome for their study. It can occur at every level in the research process, including design, data gathering, data analysis, and outcome reporting.
- Double- and triple-blinding help reduce investigator bias during study execution, but researchers can still insert bias when they report their findings. Examples include emphasizing significant secondary outcomes when primary ones are nonsignificant and highlighting alternative outcome analyses (e.g., per-protocol, on-treatment) when intention-to-treat results are unfavorable.
PROTOPATHIC BIAS
- Overview
- Protopathic bias occurs when an intervention is used to treat the first symptoms of an undiagnosed disease, making the intervention appear to be a risk factor for the disease
- Example
- Researchers find a positive association between patients prescribed Depo-Provera and uterine fibroids. They may conclude that Depo-Provera is a risk factor for fibroids. However, the association is likely due to protopathic bias, as Depo-Provera is used to treat excessive vaginal bleeding, a presenting symptom of fibroids.
RECALL BIAS
- Overview
- Recall bias occurs when a person with an outcome is more likely to remember or overestimate an exposure than someone without the outcome. It is a significant concern in case-control studies, where it can bias results toward a false-positive effect.
- Example
- Researchers conduct a case-control study examining the association between over-the-counter NSAID use in pregnancy and birth defects. Women who have children with birth defects (cases) and those who have children without (controls) are asked about their over-the-counter NSAID use during pregnancy. The risk of recall bias in this study is high, as women who have children with birth defects may be more interested and willing to thoroughly contemplate their NSAID use during pregnancy than those without. This can make the exposure seem higher in the case group when there may be no underlying difference.
REPORTING BIAS
- Definition
- Reporting bias occurs when an event, typically a side effect, is more likely to be reported when it happens with therapy A as opposed to therapy B. This can make it appear as though the event has a stronger association with therapy A when there really may be no underlying difference. Reporting bias is often seen with new therapies because physicians and patients are unfamiliar with their effects.
- Example
- Reporting bias is often seen with new medications because physicians and patients are unfamiliar with their effects. An example is the GLP-1 drugs (GLP-1 analogs and DPP-4 inhibitors), which were introduced around 2005. After their release, postmarketing reports of pancreatitis and pancreatic cancer in GLP-treated patients started coming into the FDA, prompting them to issue a warning about a possible association. After years of collecting and reviewing data, the FDA published a joint paper with its European counterpart in 2013 that concluded there was no definitive link between GLP drugs and pancreatitis or pancreatic cancer.
- There are two possible types of reporting bias here. In the first one, doctors were probably more inclined to report pancreatic events in GLP-treated patients than those receiving other, more familiar therapies like metformin, where an association is not suspected. Secondly, the FDA warning raised awareness, increasing the likelihood that events would be identified. Either way, warnings based on independent reporting, like those in the FDA's Adverse Event Reporting System, should always be considered speculative, as they are susceptible to reporting bias and often disproved when better data becomes available.
SAMPLING/ASCERTAINMENT BIAS
- Overview
- Sampling bias (also called "ascertainment bias") occurs when the procedure for sampling a population inadvertently leads to the over- or under-representation of a subgroup of the population
- Example
- Researchers want to study a new drug for chronic back pain, so they enroll patients presenting to pain management clinics with chronic back pain. Pain management clinics see a subgroup of patients who often have complicated histories that might include opioid-seeking behavior. If the new drug does not have opioid properties, it may not perform well simply because a large portion of the population is seeking opioids. Conversely, if the new drug is an opioid, it may perform better than expected because it is from the desired drug class. In either case, sampling bias has occurred because a subpopulation of patients with chronic back pain (opioid-seekers) is overrepresented in the sample
SELECTION BIAS
- Overview
- Selection bias occurs when an intervention or treatment is more likely to be given to one person over another because they differ in some way. The underlying difference between the individuals may not be obvious or even measurable. Selection bias is a major weakness of observational studies because patients are not randomized.
- Example
- Researchers want to evaluate the effects of Drug A, so they perform an observational study comparing two cohorts: one that received Drug A and one that did not. Drug A requires blood work every 3 months and is only covered through certain premium insurance plans. After following the cohorts for several years, investigators find that patients receiving Drug A had better outcomes.
- Selection bias is likely in this study, as Drug A's required lab monitoring may have caused doctors to prescribe it to more compliant, healthier patients. Furthermore, only patients with select insurance plans could obtain Drug A, which may select for patients in a higher socioeconomic class.
SURVEILLANCE BIAS
- Overview
- Surveillance bias occurs when a proposed risk factor for a condition is independently associated with medical testing (e.g., labs, X-rays, screening). The prevalence of the condition may appear to be greater in patients with the risk factor when in reality, differences in medical testing (surveillance) are the reason for the discrepancy.
- Example
- A cohort study published in the Annals of Rheumatic Diseases, where researchers looked for an association between polymyalgia rheumatica (PMR) and cancer, provides the perfect example of surveillance bias. [PMID 23842460] Using a database, researchers formed two cohorts: 2877 PMR patients and 9942 matched controls without PMR. They followed the cohorts for a median of 7.8 years and compared the incidence of cancer between them. At the end of the study, PMR patients were more likely to be diagnosed with cancer than controls, so the authors concluded that PMR might increase the risk of cancer.
- This study runs a high risk of surveillance bias because it's likely that PMR patients had more medical testing than controls. In fact, the authors found an interaction between time and cancer diagnosis that showed PMR patients were significantly more likely to be diagnosed with cancer within 6 months of their PMR diagnosis but not later. Cancers that were more prevalent in PMR patients included hematologic and urinary tract neoplasms, conditions that cause abnormalities on routine blood work. Most telling, though, is prostate cancer, which was almost 4 times more likely in PMR patients. There is no reason to check a PSA or do a rectal exam for PMR symptoms, showing that these patients received more screening because of their interactions with providers.
- In this example, PMR is the proposed risk factor, cancer is the condition, and surveillance bias exists because PMR is independently associated with more medical testing.
CONFOUNDING
- Definitions
- When discussing confounding, it's important to define several terms:
- Covariate - a covariate is an independent patient variable that may or may not influence the subject’s susceptibility to an outcome. Some covariates are easily measured (e.g. age, sex, race, medical history), while others are not (e.g. patient attitudes and personality traits, compliance). When designing a study, researchers must identify covariates they think are relevant and do their best to distribute them equally between study arms.
- Confounder - a confounder is a covariate that is associated with both the outcome and the exposure. A confounder may either increase or decrease the likelihood of the outcome.
- The figure below illustrates their relationship

- Example one
- Researchers want to see if vitamin D supplements prevent melanoma. Using a medical database, they identify two cohorts of patients: one with people who take a daily vitamin D supplement and one without. They follow the two cohorts for ten years and compare the incidence of melanoma between them.
- In our example, we have the following:
- Outcome - melanoma
- Exposure - daily vitamin D supplement
- There are a number of covariates that should be considered, but we are going to focus on two: Race and Sun exposure
- Using our illustration, we have the following:

- Covariate - Race
- Race is associated with melanoma risk, as light-skinned races have a much higher risk than dark-skinned races
- Race is not associated with taking a daily vitamin D supplement
- Race is a covariate but not a confounder
- The two cohorts were formed based on daily vitamin D use. If they are large enough, the racial composition of each cohort will be similar, and race will not influence the outcome. If the cohorts differ significantly in racial makeup, the analysis should be adjusted for ethnicity.
- Confounder - Sun exposure
- Sun exposure is associated with melanoma risk
- Sun exposure is also associated with taking daily vitamin D supplements. Possible reasons include the following: (1) people who take vitamin D supplements are more health-conscious, and therefore, they also avoid excessive sun exposure; (2) people who wear sunscreen and avoid the sun may have read that they are not making enough vitamin D in their skin, so they take supplements; (3) people who avoid the sun may have lower vitamin D levels, causing their doctor to recommend a supplement.
- Since sun exposure is associated with melanoma risk and taking daily vitamin D, it is a confounder
- When the data is analyzed, it shows that vitamin D users have a lower risk of melanoma. This may lead researchers to conclude that vitamin D supplementation lowers melanoma risk when, in fact, reduced sun exposure among users is the real reason for the lower risk.
- Example 2
- Researchers want to evaluate if liraglutide, a GLP-1 analog used to treat diabetes, is associated with pancreatitis, so they form two cohorts: one with liraglutide-treated diabetics and one without. They follow the cohorts for 5 years and find that pancreatitis is more common in the liraglutide group.
- Confounding by indication is possible in this study because obesity is a risk factor for pancreatitis, and liraglutide causes weight loss, which means it may have been prescribed to more overweight diabetics. Obesity is independently associated with the outcome (pancreatitis) and the treatment (liraglutide), making it a confounder.

RESIDUAL CONFOUNDING
- Overview
- Residual confounding can occur when subjects are grouped or stratified by a covariate (e.g., age, BMI, blood pressure). If subjects in each group don't have similar risks for the outcome, the association between the group and the outcome can become skewed. Residual confounding is a particular concern when a covariate and an outcome have a nonlinear relationship (ex., J-shaped or bimodal)
- Example
- Researchers want to evaluate the association between BMI and mortality, so they form two cohorts: one with a BMI < 27 and one with a BMI ≥ 27. After observing the two cohorts for 10 years, they find no difference in mortality and declare that body weight is not associated with death.
- Residual confounding is present in this study because subjects were not stratified appropriately. BMI has been shown to have a J-shaped relationship with mortality, with very-low and very-high BMI patients having a higher mortality rate than mid-range patients.
- Because the authors only formed two groups, they erroneously concluded that BMI is unrelated to mortality. If they had stratified patients appropriately (e.g., < 20, 21-25, 26-30, 31-35, ≥ 36) so that subjects in each group shared similar risk, they would have come to the correct conclusion.
OVERDIAGNOSIS
- Definition
- Overdiagnosis occurs when a condition is discovered, typically through a screening test, that otherwise would not have gone on to cause symptoms or death. It is a particular concern in cancer screening since almost all cancers are treated, and cancer therapies can affect quality of life, meaning patients suffer for no reason.
- Example
- Patients A and B are both 70-year-old men with slow-growing prostate cancers and heart failure. Patient A has a screening PSA test that detects his prostate cancer, causing him to undergo surgery and androgen deprivation therapy that give him impotence, incontinence, hot flashes, and gynecomastia. He dies five years later from his heart failure. Patient B is never screened for prostate cancer and never knows that he has it. He also dies five years later from heart failure
- Patient A was overdiagnosed because his prostate cancer would not have caused symptoms or death
- Measuring overdiagnosis
- Overdiagnosis is best measured in randomized trials where one group is screened, and the other is not. Consider a trial comparing breast cancer screening with mammography to none in older women. Patients are randomized to mammography (screened group) or none (controls) over several years (screening phase) and then followed for 10 years (follow-up phase). At the end of the screening phase, the mammography group will have more breast cancer diagnoses because screening tests advance the time of cancer diagnosis.
- If overdiagnosis is not present, the number of breast cancers diagnosed in the control group will catch up to the number in the screened group during the follow-up phase
- If overdiagnosis is present, significantly more breast cancers will be diagnosed in the screening group than the control group
- Issues with overdiagnosis
- Measuring overdiagnosis in a trial can be difficult because it requires long follow-up periods to accurately predict cancer incidences in each group, as slow-growing cancers can take many years to manifest
- At the time of diagnosis, it is difficult to predict if the cancer will affect the patient before they die of something else
- Overdiagnosis assumes that undiagnosed cancers do not contribute to the patient's death or morbidity [3]
- Bias with overdiagnosis
- Overdiagnosis can lead to bias in survival statistics. If a screening test detects a significant number of people who never die from the condition (overdiagnosis), it will inflate the survival advantage of the screening test.
- The link below shows an illustration from Wikipedia that demonstrates this phenomenon
- As detailed in the illustration, the screening test detected many patients who did not die from the disease (pseudodisease), which increased the number of affected patients in the screening group and inflated the 10-year survival.
- Screening tests can also inflate survival statistics when they diagnose a condition earlier. For example, assume a cancer has a screening test but no effective treatment. Patients who are screened may have it diagnosed sooner than patients who are not. The screened group will have a longer "survival time" simply because they were diagnosed earlier when, in reality, screening had no real effect on cancer mortality.
IMPORTANT POINTS ABOUT DRUG INTERACTIONS
- Drug interactions are challenging
- Information on drug interactions can be difficult to assimilate
- Certain drug interactions and metabolic pathways are well-defined while others are not
- Factors that can make drug interactions challenging
- New drugs
- During drug development, the FDA requires interaction testing with medications that are known to have significant effects on CYP enzymes and transporters. Obviously, there is no way to test a new drug with every medication available, meaning most drugs come to market with incomplete interaction profiles. After the medication is prescribed to a large number of people, other drug interactions are inevitably discovered.
- Research
- Drug research often occurs in a laboratory setting (in vitro) with animals and cell cultures. Findings from these experiments do not always reflect what happens in the human body (in vivo).
- Evolving information
- Drug metabolism is an evolving field, and researchers are just beginning to understand all the different ways the body eliminates medications. Cytochrome P450 enzymes have been studied for a while, but cell transport systems (e.g. p-glycoprotein, OAT) are a relatively new area of pharmacology, and their role in drug elimination has not been completely elucidated.
- Important points
- Not all drug interactions are known or can be predicted
- Good information on possible drug interactions may not be available
- Not all drug interactions are clinically significant, and patients should consult their healthcare provider if they are concerned about a possible interaction
- Drug interaction checkers provide the most efficient and practical way to check for interactions among multiple medications. A free interaction checker is available from Drugs.com (see Drugs.com interactions checker).
BIBLIOGRAPHY
- 1 - Fundamentals of Biostatistics 7th ed.
- 2 - PMID 9832001 RR paper
- 3 - PMID 20413742
- 4 - PMID 10355372
- 5 - PMID 26501539 - PSM JAMA
- 6 - PMID 17909367 - PS inverse weight
- 7 - PMID 22401135 - Regression dilution bias
- 8 - PMID 29164242 - Mendelian Randomization, JAMA (2017)
- 9 - PMID 11602018 - The paired availability design for historical controls, BMC Med Res Methodol (2001)
- 10 - PMID 30676631 - Using the E-Value to Assess the Potential Effect of Unmeasured Confounding in Observational Studies, JAMA (2019)
- 11 - PMID 31046064 - Using Instrumental Variables to Address Bias From Unobserved Confounders, JAMA (2019)
- 12 - PMID 34032845 - Adjusting for Nonadherence or Stopping Treatments in Randomized Clinical Trials, JAMA (2021)
- 13 - Chaimani A, Caldwell DM, Li T, Higgins JPT, Salanti G. Chapter 11: Undertaking network meta-analyses. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.3 (updated February 2022). Cochrane, 2022. Available from https://training.cochrane.org/handbook/current/chapter-11
- 14 - PMID 8841652 - An index for assessing blindness in a multi-centre clinical trial: disulfiram for alcohol cessation--a VA cooperative study, Stat Med (1996)
- 15 - PMID 15020033 - Assessment of blinding in clinical trials, Control Clin Trials (2004)
- 16 - Blinding Assessment Indexes for Randomized, Controlled, Clinical Trials, Marc Schwartz et al, (2022)
- 17 - PMID 25662947 - The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting, BMJ (2015)