Observational Studies and Experimental Studies

Research studies can be divided into two types: observational studies and experimental studies. Observational studies simply observe the effect of a variable in a population. They can assess the strength of a relationship, for instance between dietary factors and disease. Are vegetarians less likely to develop cancer? Are patients treated with a new diabetes drug less likely to die of heart attacks than patients treated with older drugs? 

Two types of observational studies are cohort and case-control studies. In a cohort study, a set of people with defined characteristics (for instance, patients who have been diagnosed with diabetes) are observed over a period of time to determine whether a variable (such as treatment with a new drug vs. an older drug) is associated with better outcomes (such as fewer deaths or hospitalizations). Cohort studies can be prospective or retrospective. Retrospective cohort studies are less reliable because they depend on patient reports of past events and are subject to recall bias and faulty memory.

In a case-control study, individuals who have a disease are matched to individuals who don’t have the disease but otherwise appear to be identical. There’s a problem: they may not be as identical as they first appear to be. There is always a risk that confounding factors may have been missed. When matching individuals, no one would think to ask them what brand of spaghetti sauce they’ve been buying, but what if more people in the disease group have switched to a new brand that just happens to contain an herb that is effective in combatting the disease in question? As Donald Rumsfeld pointed out, there are known knowns and known unknowns, but the difficult ones are the unknown unknowns. If we don’t know we don’t know about some confounding factor, there’s no way we can control for it.

Correlation is not causation

Observational studies are useful for finding correlations. They may find that the more meat people eat, the more likely they are to die from a heart attack. But we can’t stress often enough that correlation is not causation. If A is correlated to B, that doesn’t tell us that A causes B. Maybe B causes A. Maybe both A and B are caused by another factor C. Maybe it’s not a true correlation but a false finding due to errors in the way the study was designed or carried out. Maybe the apparent correlation is a meaningless coincidence. That happens a lot.

There is a whole website dedicated to spurious correlations, a hilarious compilation by Tyler Vigen. The website shows graphs and calculates correlation percentages for an astounding 30,000 examples. If you’ve never been there, it’s well worth a look. The divorce rate in Maine is correlated with the per capita consumption of margarine. The number of suicides by hanging, strangulation and suffocation is correlated with the number of lawyers in North Carolina. A widely cited example is the almost perfect correlation between diagnoses of autism and sales of organic foods (r=0.9971, p<0.0001). I don’t think anyone imagines that organic foods cause autism or that autism causes increases in organic food sales. The correlation is obviously a meaningless coincidence. And yet…

Many science journalists are not well versed in science and critical thinking. They persist in reporting correlations from observational studies as evidence of causation. The media are full of alarmist headlines like “study shows eating X causes disease Y” and “if you want to avoid disease Y, you should immediately stop eating X.” In my humble opinion, journalists who mis-report correlation as causation should be fired from their jobs and required to get remedial education.

The Bradford Hill criteria for causation is a list of nine principles that can help establish whether a correlation represents causation. They are:

  1. Strength (effect size)
  2. Consistency (reproducibility)
  3. Specificity
  4. Temporality (effect occurs after cause)
  5. Biological gradient (dose-response relationship)
  6. Plausibility
  7. Coherence between different kinds of evidence
  8. Experimental confirmation
  9. Analogy with other associations

Some authors add a 10th principle: Reversibility (if the cause is deleted, the effect should disappear).

Experimental studies

So observational studies can produce suggestive correlations but can’t establish causation. For that, we need the other kind of study: experimental studies. In an experimental study, the researchers introduce an intervention and study the effects. If eating X is correlated with Y, does changing the amount of X in the diet result in different outcomes? The gold standard experimental study is the randomized controlled trial (RCT), preferably double blind.

Sometimes an experimental study is neither possible nor feasible. Consider the question of whether smoking causes lung cancer. The ideal way to answer that question would be to randomize a large number of young people and make half of them start smoking and continue smoking for the rest of their lives while preventing the other half from ever smoking, and we would have to follow them over the many years it takes for lung cancer to develop. That can’t be done. Even if it were ethical, there’s no way to effectively control people’s behavior. Slavery is no longer legal, and even if we could experiment on a captive population of prisoners, there’s no way we could ensure that they wouldn’t find ways to avoid compliance. It’s not as if we could forcibly put lit cigarettes between their lips many times a day and make them inhale. And blinding would be impossible: people who smoke are well aware that they are smoking.

Fortunately, we have not had to resort to experimental studies. There is enough other evidence to have definitively established that smoking causes lung cancer.  The evidence is strong, consistent, and biologically plausible. Ecological studies, epidemiologic studies, animal studies, and in vitro studies all agree. Cigarettes are known to contain carcinogens. And we know that when people stop smoking, their risk of lung cancer declines rapidly. The Bradford Hill principles are all amply met.

Problems with randomized controlled trials

Randomized controlled trials are the “gold standard” of research, but they’re not always appropriate. They may miss outcomes that take a long time to develop or that affect only a small minority of people. And they may inspire false confidence. We must remember that just because a study is a double-blind, randomized controlled trial, we can’t assume its conclusions are correct. 

I review a lot of dietary supplements and so-called alternative medicines; sometimes they report positive results from a randomized, controlled trial that may appear to be a gold standard study but isn’t. 

One of the biggest pitfalls is when they try to do good science on something that has never been shown to exist. I call this Tooth Fairy Science: you could study how much money the Tooth Fairy leaves to children in impoverished families vs. in well-to-do families; you could tabulate the median amount of money children receive for the first tooth vs. the tenth tooth lost. Your study could have all the trappings of science. Your results could be replicable and statistically significant. But your information is meaningless because there’s no such thing as the Tooth Fairy. You’ve been misinterpreting parental behavior and popular customs and misattributing them as the actions of an imaginary being. Other nonexistent things that I have frequently seen studied are acupoints, acupuncture meridians, craniosacral skull movements and rhythmic fluctuations in the cerebrospinal fluid, Kirlian photography, and the human energy field that therapeutic touch practitioners have deluded themselves into thinking they are detecting and manipulating. There have even been RCTs on homeopathy, which is incredibly silly and not only doesn’t work but couldn’t possibly work as claimed.

Another frequent pitfall is reliance on a faulty research design, often called “pragmatic.” The A + B versus B design usually compares an alternative treatment to usual care plus the alternative treatment. If you add anything to the usual care, it is guaranteed that the combination will look better because of expectations, suggestion, the extra attention, and the placebo response. Edzard Ernst has repeatedly criticized the A + B versus B design, for instance in a trial among cancer survivors with chronic musculoskeletal pain, where electroacupuncture plus usual care and auricular acupuncture plus usual care produced greater pain reduction than usual care alone.

Blinding can be difficult, but RCTs that omit blinding are suspect. In double blind studies, neither the patient nor the provider knows whether the patient got the test treatment or the placebo. In triple blind studies, the people who assess the outcomes are also blinded as to which group the patient was in.

Sometimes it is difficult to find an appropriate placebo that patients can’t distinguish from the real thing. The best way to tell if it’s a good placebo control is what I call an exit poll: after the trial is over, subjects are asked to guess whether they had been in the treatment group or the placebo group. If they can guess better than chance, either the placebo failed to fool them, or the information was somehow leaked to the participants.

Pitfalls in all research

Ioannidis showed that most published research findings are false. Research is done by human researchers, who are susceptible to human errors. To list just a few of the many things that can go wrong:

  • Technicians may consciously or unconsciously manipulate data to get the results they think their bosses want.
  • Fabricated data: no experiment was done, the researcher just made up data.
  • There may be calculation errors in the math, or the wrong statistical test may have been used.
  • The published protocol may not have been followed correctly.
  • Reagents may not have been properly stored. 
  • Scientific misconduct and fraud; not always detected, but 1000 studies had to be retracted in 2014.
  • Experimenter bias.
  • Poor compliance of subjects.
  • Results are statistically significant but not clinically significant.
  • Test materials may have been contaminated. 
  • The equipment may not have been properly maintained or calibrated.
  • Even when the data are good, the researchers may have drawn the wrong conclusion.
  • Multiple endpoints may not have been corrected for.

Avoiding errors

Peer review isn’t perfect, but it can help spot errors. It is usually a mistake to believe a single study that has not been replicated or corroborated by other research groups. Even good preliminary studies are all too often followed by larger, better studies that reverse the original findings. When multiple studies disagree with each other, a systematic review or meta-analysis can be done to help sort out the truth. But if the studies reviewed are not high-quality, it may be a matter of garbage in/garbage out (GIGO).Science is the best tool we have for understanding reality, but it’s carried out by fallible humans and results can be false or misinterpreted. Science is difficult and complicated, but don’t despair!  Even with all its flaws, it’s still far more reliable than any other way of knowing.

This article was originally published as a Reality Is the Best Medicine column in Skeptical Inquirer.

Dr. Hall is a contributing editor to both Skeptic magazine and the Skeptical Inquirer. She is a weekly contributor to the Science-Based Medicine Blog and is one of its editors. She has also contributed to Quackwatch and to a number of other respected journals and publications. She is the author of Women Aren’t Supposed to Fly: The Memoirs of a Female Flight Surgeon and co-author of the textbook, Consumer Health: A Guide to Intelligent Decisions.

Scroll to top