## Disappearing Truths or Vanishing Illusions?

Jonah Lehrer, in his provocative article in the New York Times last month (December 13^{th}), drew together an impressive array of scientists, researchers and expert commentators remarking on a puzzling phenomenon. Many empirically established scientific truths apparently become less certain over time. What initially appear to be striking or even robust research findings seem to become diluted by failures to replicate them, dwindling effect-sizes even in successful replications, and/or the appearance of outright counter-examples. I won’t repeat much of Lehrer’s material here; I commend his article for its clarity and scope, even if it may be over-stating the case according to some.

Instead, I’d like to add what I hope is some helpful commentary and additional candidate explanations. Lehrer makes the following general claim:

“Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws. But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain.”

His article surveys three manifestations of the dwindling-truth phenomenon:

- Failure to replicate an initially striking study
- Diminishing effect-sizes (e.g., dwindling effectiveness of therapeutic drugs)
- Diminishing effect-sizes over time in a single experiment or study

Two explanations are then advanced to account for these manifestations. The first is publication bias, and the second is regression to the mean.

Publication bias refers to a long-standing problem in research that uses statistical tests to determine whether, for instance, an experimental medical treatment group’s results differ from those of a no-treatment control group. A statistically significant difference usually is taken to mean that the experimental treatment “worked,” whereas a non-significant difference often is interpreted (mistakenly) to mean that the treatment had no effect. There is evidence in a variety of disciplines that significant findings are more likely to be accepted for publication than non-significant ones, whence “publication bias.”

As Lehrer points out, a simple way to detect publication bias is the “funnel plot,” which is a graph of the size of an experimental effect plotted against a function of the sample size of the study (i.e., the reciprocal of the standard deviation or sometimes the reciprocal of the variance). As we move down the vertical axis, the effect-sizes should spread out symmetrically around their weighted mean if there is no publication bias. Deviations away from the mean will tend to be larger with smaller samples, but their direction does not depend on sample size. If there’s publication bias, the graph will be asymmetrical.

Here’s an example that I compiled several years ago for teaching purposes from Bond and Smith’s (1996) collection of 97 American studies of conformity using the Asch paradigm. The horizontal axis is the effect-size (Cohen’s d), which is a measure of how many standard deviations the experimental group mean differed from the control group mean. The vertical bar corresponds to the weighted average Cohen’s d (.916—an unweighted average gives 1.00). There’s little doubt that the conformity effect exists.

Nevertheless, there is some evidence of publication bias, as can be seen in the lower part of the graph where a handful of six studies with small sample sizes are skewed over to the right because their effect-sizes are the biggest of the lot. Their presence suggests that there should exist another small handful of studies skewed over to the left, perhaps with effects in the opposite direction to the conformity hypothesis. It’s fairly likely that there are some unpublished conformity studies that failed to achieve significant results or, more worryingly, came out with “negative” results. Publication bias also suggests that the average effect-size is being slightly over-estimated because of the biased collection of studies.

How could any of this account for a decline in effect-sizes over time? Well, here’s another twist to publication bias that wasn’t mentioned in Lehrer’s article and may have been overlooked by his sources as well. Studies further in the past tend to have been based on smaller samples than more recent studies. This is due, among other things, to increased computing power, ease and speed of gathering data, and the sheer numbers of active researchers working in teams.

Bigger samples have greater statistical power to detect smaller effects than smaller samples do. Given large enough samples, even trivially tiny effects become statistically significant. Tiny though they may be, significant effects are publishable findings. Ceteris paribus, larger-sample studies are less susceptible to publication bias arising from the failure to obtain significant findings. They’ll identify the same large effects their predecessors did (if the effects really are large) but they’ll also identify smaller effects that were missed by the earlier smaller-sample studies. Therefore, it is entirely likely that more recent studies, on average with larger sample sizes, will show on average smaller effects.

Let’s put this to an impromptu test with my compilation from Bond and Smith. The next two graphs show cumulative averages of sample size and effect size over time. These suggest some support for my hypothesis. Average sample sizes did increase from the mid-50’s to the mid-60’s, after which they levelled out. This increase corresponds to a decrease in (unweighted) average effect-size over the same period. But then there’s a hiccup, a jump upwards from 1965 to 1966. The reason for this is publication in 1966 of two studies with the largest effect-sizes (and small samples, N = 12 in each). It then takes several more years for the effect of those two studies to be washed out. Three of the other “deviant” top six effect-size studies were published in 1956-7 and the remaining one in 1970.

Now what about regression to the mean? Briefly, consider instructors or anyone else who rewards good performance and punishes poor performance. They will observe that the highest performers on one occasion (say, an exam) generally do not do as well on the second (despite having been rewarded) whereas the poorest performers generally do not do as badly on the second occasion (which the instructor may erroneously attribute to their having been punished). The upshot? Punishment appears to be more effective than reward.

However, these effects are not attributable to punishment being more effective than reward (indeed, a vast literature on behaviour modification techniques indicates the converse is true!). It is simply due to the fact that students’ performance on exams is not perfectly correlated, even in the same subject. Some of the good (poor) performers on the first exam had good (bad) luck. Next time around they regress back toward the class average, where they belong.

Note that in order for regression to the mean to contribute to an apparent dwindling-truth phenomenon, it has to operate in conjunction with publication bias. For example, it could account for the declining trend in my effect-size graph after the 1965-6 hiccup. Nevertheless, considerations of regression to the mean for explaining the dwindling-truth phenomena probably haven’t gone quite far enough.

Seemingly the most puzzling manifestation of the dwindling-truth phenomenon is the diminution of an effect over the course of several studies performed by the same investigators or even during a single study, as in Jonathan Schooler’s attempts to replicate one of J.B. Rhine’s precognition experiments. Schooler claims to have seen numerous data-sets where regression to the mean does not account for the decline effect. If true, this can’t be accounted for by significant-effect-only publication bias. Some other selectivity bias must be operating. My chief suspect lies in the decision a scientist makes about whether or not to replicate a study, or whether to keep on gathering more data.

Would a scientist bother conducting a second experiment to replicate an initial one that *failed to find any effect*? Probably not. Would Schooler have bothered gathering as much data if his initial results hadn’t shown any precognition effect? Perhaps not. Or he may not have persisted as long as he did if his results were not positive in the short term. There are principled (Bayesian) methods for deciding when one has gathered “enough” data but most scientists don’t use them. Instead, we usually fix a sample size target in advance or use our own judgment in deciding when to stop. Having done one study, we’re more likely to attempt to replicate it if we’ve found the effect we were looking for or if we’ve discovered some new effect. This is a particular manifestation of what is known in the psychological literature as confirmation bias.

Why is this important? Let’s consider scientific replications generally. Would researchers be motivated to replicate a first-ever study that failed to find a significant effect for a new therapeutic drug? Probably not. Instead, it’s the study that shouts “Here’s a new wonder drug!” that begs to be replicated. After all, checking whether a therapeutic effect can be replicated fits squarely in the spirit of scientific skepticism and impartiality. Or does it? True impartiality would require also replicating “dud” studies such as a clinical trial of a candidate HIV vaccine that has failed to find evidence of its effectiveness.

In short, we have* positive-finding replication bias*: It is likely that scientists suffer from a bias in favor of replicating only those studies with statistically significant findings. This is just like rewarding only the top performers on an exam and then paying attention only to their subsequent performance. It invites dwindling effects and failures of replication due to regression to the mean. If scientists replicated only those studies that had null findings, then we would see regression to the mean in the opposite direction, i.e., the emergence of effects where none had previously been found.

In Schooler’s situation there’s a related regression-to-the-mean pitfall awaiting even the researcher who gathers data with no preconceptions. Suppose we gather a large sample of data, randomly split it in half, and discover some statistically significant patterns or effects in the first half. Aha! New findings! But some of these are real and others are due to chance. Now we turn to the second half of our data and test for the same patterns *on the second half only*. Some will still be there but others won’t be, and on average the strength of our findings will be lessened. Statisticians refer to this as *model inflation*. What we’ve omitted to do is search for patterns in the second half of the data that we didn’t discover in the first half. Model inflation will happen time and time again to scientists when they discover new effects or patterns, despite the fact that their initial conclusions appear statistically solid.

Thus, we have regression to the mean working hand-in-glove with positive-finding replication bias, significance-only publication bias, and increased sample sizes over time, all militating in the same direction. Perhaps it’s no wonder that we’re observing dwindling-truth phenomena that appears to defy the law of averages.

Are there any remedies? To begin with, understanding the ways in which human biases and regression to the mean act in concert gives us the wherewithal to face the problem head-on. Publication bias and model inflation are well-known but often overlooked (especially the latter). I’ve added two speculative but plausible conjectures here (the larger sample-size over time and replication-bias effects) that merit further investigation. It remains to be seen whether the mysterious dwindling-truth phenomena can be accounted for by these four factors. I suspect there may be other causes yet to be detected.

Several remedies for publication bias have been suggested (e.g., Ioannidis 2005), including larger sample sizes, enhanced data-registration protocols, and more attention to false-positive versus false-negative rates for research designs. Matters then may hinge on whether scientific communities can muster the political will to provide incentives for counteracting our all-too-human publication and replication biases.

## Leave a Reply