ignorance and uncertainty

All about unknowns and uncertainties

Posts Tagged ‘Science Communication

Scientists on Trial: Risk Communication Becomes Riskier

with 5 comments

Back in late May 2011, there were news stories of charges of manslaughter laid against six earthquake experts and a government advisor responsible for evaluating the threat of natural disasters in Italy, on grounds that they allegedly failed to give sufficient warning about the devastating L’Aquila earthquake in 2009. In addition, plaintiffs in a separate civil case are seeking damages in the order of €22.5 million (US$31.6 million). The first hearing of the criminal trial occurred on Tuesday the 20th of September, and the second session is scheduled for October 1st.

According to Judge Giuseppe Romano Gargarella, the defendants gave inexact, incomplete and contradictory information about whether smaller tremors in L’Aquila six months before the 6.3 magnitude quake on 6 April, which killed 308 people, were to be considered warning signs of the quake that eventuated. L’Aquila was largely flattened, and thousands of survivors lived in tent camps or temporary housing for months.

If convicted, the defendants face up to 15 years in jail and almost certainly will suffer career-ending consequences. While manslaughter charges for natural disasters have precedents in Italy, they have previously concerned breaches of building codes in quake-prone areas. Interestingly, no action has yet been taken against the engineers who designed the buildings that collapsed, or government officials responsible for enforcing building code compliance. However, there have been indications of lax building codes and the possibility of local corruption.

The trial has, naturally, outraged scientists and others sympathetic to the plight of the earthquake experts. An open letter by the Istituto Nazionale di Geofisica e Vulcanologia (National Institute of Geophysics and Volcanology) said the allegations were unfounded and amounted to “prosecuting scientists for failing to do something they cannot do yet — predict earthquakes”. The AAAS has presented a similar letter, which can be read here.

In pre-trial statements, the defence lawyers also have argued that it was impossible to predict earthquakes. “As we all know, quakes aren’t predictable,” said Marcello Melandri, defence lawyer for defendant Enzo Boschi, who was president of Italy’s National Institute of Geophysics and Volcanology). The implication is that because quakes cannot be predicted, the accusations that the commission’s scientists and civil protection experts should have warned that a major quake was imminent are baseless.

Unfortunately, the Istituto Nazionale di Geofisica e Vulcanologia, the AAAS, and the defence lawyers were missing the point. It seems that failure to predict quakes is not the substance of the accusations. Instead, it is poor communication of the risks, inappropriate reassurance of the local population and inadequate hazard assessment. Contrary to earlier reports, the prosecution apparently is not claiming the earthquake should have been predicted. Instead, their focus is on the nature of the risk messages and advice issued by the experts to the public.

Examples raised by the prosecution include a memo issued after a commission meeting on 31 March 2009 stating that a major quake was “improbable,” a statement to local media that six months of low-magnitude tremors was not unusual in the highly seismic region and did not mean a major quake would follow, and an apparent discounting of the notion that the public should be worried. Against this, defence lawyer Melandri has been reported saying that the panel “never said, ‘stay calm, there is no risk'”.

It is at this point that the issues become both complex (by their nature) and complicated (by people). Several commentators have pointed out that the scientists are distinguished experts, by way of asserting that they are unlikely to have erred in their judgement of the risks. But they are being accused of “incomplete, imprecise, and contradictory information” communication to the public. As one of the civil parties to the lawsuit put it, “Either they didn’t know certain things, which is a problem, or they didn’t know how to communicate what they did know, which is also a problem.”

So, the experts’ scientific expertise is not on trial. Instead, it is their expertise in risk communication. As Stephen S. Hall’s excellent essay in Nature points out, regardless of the outcome this trial is likely to make many scientists more reluctant to engage with the public or the media about risk assessments of all kinds. The AAAS letter makes this point too. And regardless of which country you live in, it is unwise to think “Well, that’s Italy for you. It can’t happen here.” It most certainly can and probably will.

Matters are further complicated by the abnormal nature of the commission meeting on the 31st of March at a local government office in L’Aquila. Boschi claims that these proceedings normally are closed whereas this meeting was open to government officials, and he and the other scientists at the meeting did not realize that the officials’ agenda was to calm the public. The commission did not issue its usual formal statement, and the minutes of the meeting were not completed, until after the earthquake had occurred. Instead, two members of the commission, Franco Barberi and Bernardo De Bernardinis, along with the mayor and an official from Abruzzo’s civil-protection department, held a now (in)famous press conference after the meeting where they issued reassuring messages.

De Bernardinis, an expert on floods but not earthquakes, incorrectly stated that the numerous earthquakes of the swarm would lessen the risk of a larger earthquake by releasing stress. He also agreed with a journalist’s suggestion that residents enjoy a glass of wine instead of worrying about an impending quake.

The prosecution also is arguing that the commission should have reminded residents in L’Aquila of the fragility of many older buildings, advised them to make preparations for a quake, and reminded them of what to do in the event of a quake. This amounts to an accusation of a failure to perform a duty of care, a duty that many scientists providing risk assessments may dispute that they bear.

After all, telling the public what they should or should not do is a civil or governmental matter, not a scientific one. Thomas Jordan’s essay in New Scientist brings in this verdict: “I can see no merit in prosecuting public servants who were trying in good faith to protect the public under chaotic circumstances. With hindsight their failure to highlight the hazard may be regrettable, but the inactions of a stressed risk-advisory system can hardly be construed as criminal acts on the part of individual scientists.” As Jordan points out, there is a need to separate the role of science advisors from that of civil decision-makers who must weigh the benefits of protective actions against the costs of false alarms. This would seem to be a key issue that urgently needs to be worked through, given the need for scientific input into decisions about extreme hazards and events, both natural and human-caused.

Scientists generally are not trained in communication or in dealing with the media, and communication about risks is an especially tricky undertaking. I would venture to say that the prosecution, defence, judge, and journalists reporting on the trial will not be experts in risk communication either. The problems in risk communication are well known to psychologists and social scientists specializing in its study, but not to non-specialists. Moreover, these specialists will tell you that solutions to those problems are hard to come by.

For example, Otway and Wynne (1989) observed in a classic paper that an “effective” risk message has to simultaneously reassure by saying the risk is tolerable and panic will not help, and warn by stating what actions need to be taken should an emergency arise. They coined the term “reassurance arousal paradox” to describe this tradeoff (which of course is not a paradox, but a tradeoff). The appropriate balance is difficult to achieve, and is made even more so by the fact that not everyone responds in the same way to the same risk message.

It is also well known that laypeople do not think of risks in the same way as risk experts (for instance, laypeople tend to see “hazard” and “risk” as synonyms), nor do they rate risk severity in line with the product of probability and magnitude of consequence, nor do they understand probability—especially low probabilities. Given all of this, it will be interesting to see how the prosecution attempts to establish that the commission’s risk communications contained “incomplete, imprecise, and contradictory information.”

Scientists who try to communicate risks are aware of some of these issues, but usually (and understandably) uninformed about the psychology of risk perception (see, for instance, my posts here and here on communicating uncertainty about climate science). I’ll close with just one example. A recent International Commission on Earthquake Forecasting (ICEF) report argues that frequently updated hazard probabilities are the best way to communicate risk information to the public. Jordan, chair of the ICEF, recommends that “Seismic weather reports, if you will, should be put out on a daily basis.” Laudable as this prescription is, there are at least three problems with it.

Weather reports with probabilities of rain typically present probabilities neither close to 0 nor to 1. Moreover, they usually are anchored on tenths (e.g., .2, or .6 but not precise numbers like .23162 or .62947). Most people have reasonable intuitions about mid-range probabilities such as .2 or .6. But earthquake forecasting has very low probabilities, as was the case in the lead-up to the L’Aquila event. Italian seismologists had estimated the probability of a large earthquake in the next three days had increased from 1 in 200,000, before the earthquake swarm began, to 1 in 1,000 following the two large tremors the day before the quake.

The first problem arises from the small magnitude of these probabilities. Because people are limited in their ability to comprehend and evaluate extreme probabilities, highly unlikely events usually are either ignored or overweighted. The tendency to ignore low-probability events has been cited to account for the well-established phenomenon that homeowners tend to be under-insured against low probability hazards (e.g., earthquake, flood and hurricane damage) in areas prone to those hazards. On the other hand, the tendency to over-weight low-probability events has been used to explain the same people’s propensity to purchase lottery tickets. The point is that low-probability events either excite people out of proportion to their likelihood or fail to excite them altogether.

The second problem is in understanding the increase in risk from 1 in 200,000 to 1 in 1,000. Most people are readily able to comprehend the differences between mid-range probabilities such as an increase in the chance of rain from .2 to .6. However, they may not appreciate the magnitude of the difference between the two low probabilities in our example. For instance, an experimental study with jurors in mock trials found that although DNA evidence is typically expressed in terms of probability (specifically, the probability that the DNA sample could have come from a randomly selected person in the population), jurors were equally likely to convict on the basis of a probability of 1 in 1,000 as a probability of 1 in 1 billion. At the very least, the public would need some training and accustoming to miniscule probabilities.

All this leads us to the third problem. Otway and Wynne’s “reassurance arousal paradox” is exacerbated by risk communications about extremely low-probability hazards, no matter how carefully they are crafted. Recipients of such messages will be highly suggestible, especially when the stakes are high. So, what should the threshold probability be for determining when a “don’t ignore this” message is issued? It can’t be the imbecilic Dick Cheney zero-risk threshold for terrorism threats, but what should it be instead?

Note that this is a matter for policy-makers to decide, not scientists, even though scientific input regarding potential consequences of false alarms and false reassurances should be taken into account. Criminal trials and civil lawsuits punishing the bearers of false reassurances will drive risk communicators to lower their own alarm thresholds, thereby ensuring that they will sound false alarms increasingly often (see my post about making the “wrong” decision most of the time for the “right” reasons).

Risk communication regarding low-probability, high-stakes hazards is one of the most difficult kinds of communication to perform effectively, and most of its problems remain unsolved. The L’Aquila trial probably will have an inhibitory impact on scientists’ willingness to front the media or the public. But it may also stimulate scientists and decision-makers to work together for the resolution of these problems.


The Stapel Case and Data Fabrication

with 6 comments

By now it’s all over the net (e.g., here) and international news media: Tilburg University sacked high-profile social psychologist Diederik Stapel, after he was outed as having faked data in his research. Stapel was director of the Tilburg Institute for Behavioral Economics Research, a successful researcher and fundraiser, and as a colleague expressed it, “the poster boy of Dutch social psychology.” He had more than 100 papers published, some in the flagship journals not just of psychology but of science generally (e.g., Science), and won prestigious awards for his research on social cognition and stereotyping.

Tilburg University Rector Philip Eijlander said that Stapel had admitted to using faked data, apparently after Eijlander confronted him with allegations by graduate student research assistants that his research conduct was fraudulent. The story goes that the assistants had identified evidence of data entry by Stapel via copy-and-paste.

Willem Levelt, psycholinguist and former president of the Royal Netherlands Academy of Arts and Sciences, is to lead a panel investigating the extent of the fraud. That extent could be very widespread indeed. In a press conference the Tilburg rector made it clear that Stapel’s fraudulent research practices may have ranged over a number of years. All of his papers would be suspected, and the panel will advise on which papers will have to be retracted. Likewise, the editors of all journals that Stapel published in are also investigating details of his papers that were published in these journals. Then there are Stapel’s own students and research collaborators, whose data and careers may have been contaminated by his.

I feel sorry for my social psychological colleagues, who are reeling in shock and dismay. Some of my closest colleagues knew Stapel (one was a fellow graduate student with him), and none of them suspected him. Among those who knew him well and worked with him, Stapel apparently was respected as a researcher and trusted as a man of integrity. They are asking themselves how his cheating could have gone undetected for so long, and how such deeds could be prevented in the future. They fear its impact on public perception of their discipline and trust in scientific researchers generally.

An understandable knee-jerk reaction is to call for stricter regulation of scientific research, and alterations to the training of researchers. Mark Van Vugt and Anjana Ahuja’s blog post exemplifies this reaction, when they essentially accuse social psychologists of being more likely to engage in fraudulent research because some of them use deception of subjects in their experiments:

“…this means that junior social psychologists are being trained to deceive people and this might be a first violation of scientific integrity. It would be good to have a frank discussion about deception in our discipline. It is not being tolerated elsewhere so why should it be in our field.”

They make several recommendations for reform, including the declaration that “… ultimately it is through training our psychology students into doing ethically sound research that we can tackle scientific fraud. This is no easy feat.”

The most obvious problems with Van Vugt’s and Ahuja’s recommendations are, first, that there is no clear connection between using deception in research designs and faking data, and second, many psychology departments already include research ethics in researcher education and training. Stapel isn’t ignorant of research ethics. But a deeper problem is that none of their recommendations and, thus far, very few of the comments I have seen about this or similar cases, address three of the main considerations in any criminal case: Means, opportunity, and motive.

Let me speak to means and opportunity first. Attempts to more strictly regulate the conduct of scientific research are very unlikely to prevent data fakery, for the simple reason that it’s extremely easy to do in a manner that is extraordinarily difficult to detect. Many of us “fake data” on a regular basis when we run simulations. Indeed, simulating from the posterior distribution is part and parcel of Bayesian statistical inference. It would be (and probably has been) child’s play to add fake cases to one’s data by simulating from the posterior and then jittering them randomly to ensure that the false cases look like real data. Or, if you want to fake data from scratch, there is plenty of freely available code for randomly generating multivariate data with user-chosen probability distributions, means, standard deviations, and correlational structure. So, the means and opportunities are on hand for virtually all of us. They are the very same means that underpin a great deal of (honest) research. It is impossible to prevent data fraud by these means through conventional regulatory mechanisms.

Now let us turn to motive. The most obvious and comforting explanations of cheats like psychologists Stapel or Hauser, or plagiarists like statistician Wegman and political scientist Fischer, are those appealing to their personalities. This is the “X cheated because X is psychopathic” explanation. It’s comforting because it lets the rest of us off the hook (“I wouldn’t cheat because I’m not a psychopath”). Unfortunately this kind of explanation is very likely to be wrong. Most of us cheat on something somewhere along the line. Cheating is rife, for example, among undergraduate university students (among whom are our future researchers!), so psychopathy certainly cannot be the deciding factor there. What else could be the motivational culprit? How about the competitive pressures on researchers generated by the contemporary research culture?

Cognitive psychologist E,J, Wagenmakers (as quoted in Andrew Gelman’s thoughtful recent post) is among the few thus far who have addressed possible motivating factors inherent in the present-day research climate. He points out that social psychology has become very competitive, and

“high-impact publications are only possible for results that are really surprising. Unfortunately, most surprising hypotheses are wrong. That is, unless you test them against data you’ve created yourself. There is a slippery slope here though; although very few researchers will go as far as to make up their own data, many will “torture the data until they confess”, and forget to mention that the results were obtained by torture….”

I would add to E.J.’s observations the following points.

First, social psychology journals (and journals for other areas in psychology) exhibit a strong bias towards publishing only studies that have achieved a statistically significant result. This bias is widely believed in by researchers and their students. The obvious temptation arising from this is to ease an inconclusive finding into being conclusive by adding more “favorable” cases or making some of the unfavorable ones more favorable.

Second, and of course readers will recognize one of my hobby-horses here, the addiction in psychology to hypothesis-testing over parameter estimation amounts to an insistence that every study yield a conclusion or decision: Did the null hypothesis get rejected? The obvious remedy for this is to develop a publication climate that does not insist that each and every study be “conclusive,” but instead emphasizes the importance of a cumulative science built on multiple independent studies, careful parameter estimates and multiple tests of candidate theories. This adds an ethical and motivational rationale to calls for a shift to Bayesian methods in psychology.

Third, journal editors and reviewers routinely insist on more than one study to an article. On the surface, this looks like what I’ve just asked for, a healthy insistence on independent replication. It isn’t, for two reasons. First, even if the multiple studies are replications, they aren’t independent because they come from the same authors and/or lab. Genuinely independent replicated studies would be published in separate papers written by non-overlapping sets of authors from separate labs. However, genuinely independent replication earns no kudos and therefore is uncommon (not just in psychology, either—other sciences suffer from this problem, including those that used to have a tradition of independent replication).

The second reason is that journal editors don’t merely insist on study replications, they also favor studies that come up with “consistent” rather than “inconsistent” findings (i.e., privileging “successful” replications over “failed” replications). By insisting on multiple studies that reproduce the original findings, journal editors are tempting researchers into corner-cutting or outright fraud in the name of ensuring that that first study’s findings actually get replicated. E.J.’s observation that surprising hypotheses are unlikely to be supported by data goes double (squared, actually) when it comes to replication—Support for a surprising hypothesis may occur once in a while, but it is unlikely to occur twice in a row. Again, remedies are obvious: Develop a publication climate which encourages or even insists on independent replication, that treats well-conducted “failed” replications identically to well-conducted “successful” ones, and which does not privilege “replications” from the same authors or lab of the original study.

None of this is meant to say I fall for cultural determinism—Most researchers face the pressures and motivations described above, but few cheat. So personality factors may also exert an influence, along with circumstances specific to those of us who give in to the temptations of cheating. Nevertheless if we want to prevent more Stapels, we’ll get farther by changing the research culture and its motivational effects than we will by exhorting researchers to be good or lecturing them about ethical principles of which they’re already well aware. And we’ll get much farther than we would in a futile attempt to place the collection and entry of every single datum under surveillance by some Stasi-for-scientists.

Written by michaelsmithson

September 14, 2011 at 9:36 am

Communicating about Uncertainty in Climate Change, Part II

with 5 comments

In my previous post I attempted to provide an overview of the IPCC 2007 report’s approach to communicating about uncertainties regarding climate change and its impacts. This time I want to focus on how the report dealt with probabilistic uncertainty. It is this kind of uncertainty that the report treats most systematically. I mentioned in my previous post that Budescu et al.’s (2009) empirical investigation of how laypeople interpret verbal probability expressions (PEs, e.g., “very likely”) in the IPCC report revealed several problematic aspects, and a paper I have co-authored with Budescu’s team (Smithson, et al., 2011) yielded additional insights.

The approach adopted by the IPCC is one that has been used in other contexts, namely identifying probability intervals with verbal PEs. Their guidelines are as follows:
Virtually certain >99%; extremely likely >95%; very likely >90%; likely >66%; more likely than not > 50%; about as likely as not 33% to 66%; unlikely <33%; very unlikely <10%; extremely unlikely <5%; exceptionally unlikely <1%.

One unusual aspect of these guidelines is their overlapping intervals. For instance, “likely” takes the interval [.66,1] and thus contains the interval [.90,1] for “very likely,” and so on. The only interval that doesn’t overlap with others is “as likely as not.” Other interval-to-PE guidelines I am aware of use non-overlapping intervals. An early example is Sherman Kent’s attempt to standardize the meanings of verbal PEs in the American intelligence community.

Attempts to translate verbal PEs into numbers have a long and checkered history. Since the earliest days of probability theory, the legal profession has steadfastly refused to quantify its burdens of proof (“balance of probabilities” or “reasonable doubt”) despite the fact that they seem to explicitly refer to probabilities or at least degrees of belief. Weather forecasters debated the pros and cons of verbal versus numerical PEs for decades, with mixed results. A National Weather Service report on a 1997 survey of Juneau, Alaska residents found that although the rank-ordering of the mean numerical probabilities residents assigned to verbal PE’s reasonably agreed with those assumed by the organization, the residents’ probabilities tended to be less extreme than the organization’s assignments. For instance, “likely” had a mean of 62.5% whereas the organization’s assignments for this PE were 80-100%.

And thus we see a problem arising that has been long noted about individual differences in the interpretation of PEs but largely ignored when it comes to organizations. Since at least the 1960’s empirical studies have demonstrated that people vary widely in the numerical probabilities they associate with a verbal PE such as “likely.” It was this difficulty that doomed Sherman Kent’s attempt at standardization for intelligence analysts. Well, here we have the NWS associating it with 80-100% whereas the IPCC assigns it 66-100%. A failure of organizations and agencies to agree on number-to-PE translations leaves the public with an impossible brief. I’m reminded of the introduction of the now widely-used cyclone (hurricane) category 1-5 scheme (higher numerals meaning more dangerous storms) at a time when zoning for cyclone danger where I was living also had a 1-5 numbering system that went in the opposite direction (higher numerals indicating safer zones).

Another interesting aspect is the frequency of the PEs in the report itself. There are a total of 63 PEs therein. “Likely” occurs 36 times (more than half), and “very likely” 17 times. The remaining 10 occurrences are “very unlikely” (5 times), “virtually certain” (twice), “more likely than not” (twice), and “extremely unlikely” (once). There is a clear bias towards fairly extreme positively-worded PEs, perhaps because much of the IPCC report’s content is oriented towards presenting what is known and largely agreed on about climate change by climate scientists. As we shall see, the bias towards positively-worded PEs (e.g., “likely” rather than “unlikely”) may have served the IPCC well, whether intentionally or not.

In Budescu et al.’s experiment, subjects were assigned to one of four conditions. Subjects in the control group were not given any guidelines for interpreting the PEs, as would be the case for readers unaware of the report’s guidelines. Subjects in a “translation” condition had access to the guidelines given by the IPCC, at any time during the experiment. Finally, subjects in two “verbal-numerical translation” conditions saw a range of numerical values next to each PE in each sentence. One verbal-numerical group was shown the IPCC intervals and the other was shown narrower intervals (with widths of 10% and 5%).

Subjects were asked to provide lower, upper and “best” estimates of the probabilities they associated with each PE. As might be expected, these figures were most likely to be consistent with the IPCC guidelines in the verbal- numerical translation conditions, less likely in the translation condition, and least likely in the control condition. They were also less likely to be IPCC-consistent the more extreme the PE was (e.g., less consistent foro “very likely” than for “likely”). Consistency rates were generally low, and for the extremal PEs the deviations from the IPCC guidelines were regressive (i.e., subjects’ estimates were not extreme enough, thereby echoing the 1997 National Weather Service report findings).

One of the ironic claims by the Budescu group is that the IPCC 2007 report’s verbal probability expressions may convey excessive levels of imprecision and that some probabilities may be interpreted as less extreme than intended by the report authors. As I remarked in my earlier post, intervals do not distinguish between consensual imprecision and sharp disagreement. In the IPCC framework, the statement “The probability of event X is between .1 and .9 could mean “All experts regard this probability as being anywhere between .1 and .9” or “Some experts regard the probability as .1 and others as .9.” Budescu et al. realize this, but they also have this to say:

“However, we suspect that the variability in the interpretation of the forecasts exceeds the level of disagreement among the authors in many cases. Consider, for example, the statement that ‘‘average Northern Hemisphere temperatures during the second half of the 20th century were very likely higher than during any other 50-year period in the last 500 years’’ (IPCC, 2007, p. 8). It is hard to believe that the authors had in mind probabilities lower than 70%, yet this is how 25% of our subjects interpreted the term very likely!” (pg. 8).

One thing I’d noticed about the Budescu article was that their graphs suggested the variability in subjects’ estimates for negatively-worded PEs (e.g., “unlikely”) seemed greater than for positively worded PEs (e.g., “likely”). That is, subjects seemed to have less of a consensus about the meaning of the negatively-worded PEs. On reanalyzing their data, I focused on the six sentences that used the PE “very likely” or “very unlikely”. My statistical analyses of subjects’ lower, “best” and upper probability estimates revealed a less regressive mean and less dispersion for positive than for negative wording in all three estimates. Negative wording therefore resulted in more regressive estimates and less consensus regardless of experimental condition. You can see this in the box-plots below.


In this graph, the negative PEs’ estimates have been reverse-scored so that we can compare them directly with the positive PEs’ estimates. The “boxes” (the blue rectangles) contain the middle 50% of subjects’ estimates and these boxes are consistently longer for the negative PEs, regardless of experimental condition. The medians (i.e., the score below which 50% of the estimates fall) are the black dots, and these are fairly similar for positive and (reverse-scored) negative PEs. However, due to the negative PE boxes’ greater lengths, the mean estimates for the negative PEs end up being pulled further away from their positive PE counterparts.

There’s another effect that we confirmed statistically but also is clear from the box-plots. The difference between the lower and upper estimates is, on average, greater for the negatively-worded PEs. One implication of this finding is that the impact of negative wording is greatest on the lower estimates—And these are the subjects’ translations of the very thresholds specified in the IPCC guidelines.

If anything, these results suggest the picture is worse even than Budescu et al.’s assessment. They noted that 25% of the subjects interpreted “very likely” as having a “best” probability below 70%. The boxplots show that in three of the four experimental conditions at least 25% of the subjects provided a lower probability of less than 50% for “very likely”. If we turn to “very unlikely” the picture is worse still. In three of the four experimental conditions about 25% of the subjects returned an upper probability for “very unlikely” greater than 80%!

So, it seems that negatively-worded PEs are best avoided where possible. This recommendation sounds simple, but it could open a can of syntactical worms. Consider the statement “It is very unlikely that the MOC will undergo a large abrupt transition during the 21st century.” Would it be accurate to equate it with “It is very likely that the MOC will not undergo a large abrupt transition during the 21st century?” Perhaps not, despite the IPCC guidelines’ insistence otherwise. Moreover, turning the PE positive entails turning the event into a negative. In principle, we could have a mixture of negatively- and positively-worded PE’s and events (“It is (un)likely that A will (not) occur”). It is unclear at this point whether negative PEs or negative events are the more confusing, but inspection of the Budescu et al. data suggested that double-negatives were decidedly more confusing than any other combination.

As I write this, David Budescu is spearheading a multi-national study of laypeople’s interpretations of the IPCC probability expressions (I’ll be coordinating the Australian component). We’ll be able to compare these interpretations across languages and cultures. More anon!


Budescu, D.V., Broomell, S. and Por, H.-H. (2009) Improving the communication of uncertainty in the reports of the Intergovernmental panel on climate change. Psychological Science, 20, 299–308.

Intergovernmental Panel on Climate Change (2007). Summary for policymakers: Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. Retrieved May 2010 from http://www.ipcc.ch/pdf/assessment-report/ar4/wg1/ar4-wg1-spm.pdf.

Smithson, M., Budescu, D.V., Broomell, S. and Por, H.-H. (2011) Never Say “Not:” Impact of Negative Wording in Probability Phrases on Imprecise Probability Judgments. Accepted for presentation at the Seventh International Symposium on Imprecise Probability: Theories and Applications, Innsbruck, Austria, 25-28 July 2011.

Communicating about Uncertainty in Climate Change, Part I

with 5 comments

The Intergovernmental Panel on Climate Change (IPCC) guidelines for their 2007 report stipulated how its contributors were to convey uncertainties regarding climate change scientific evidence, conclusions, and predictions. Budescu et al.’s (2009) empirical investigation of how laypeople interpret verbal probability expressions (e.g., “very likely”) in the IPCC report revealed several problematic aspects of those interpretations, and a paper I have co-authored with Budescu’s team (Smithson, et al., 2011) raises additional issues.

Recently the IPCC has amended their guidelines, partly in response to the Budescu paper. Granting a broad consensus among climate scientists that climate change is accelerating and that humans have been a causal factor therein, the issue of how best to represent and communicate uncertainties about climate change science nevertheless remains a live concern. I’ll focus on the issues around probability expressions in a subsequent post, but in this one I want to address the issue of communicating “uncertainty” in a broader sense.

Why does it matter? First, the public needs to know that climate change science actually has uncertainties. Otherwise, they could be misled into believing either that scientists have all the answers or suffer from unwarranted dogmatism. Likewise, policy makers, decision makers and planners need to know the magnitudes (where possible) and directions of these uncertainties. Thus, the IPCC is to be commended for bringing uncertainties to the fore its 2007 report, and for attempting to establish standards for communicating them.

Second, the public needs to know what kinds uncertainties are in the mix. This concern sits at the foundation of the first and second recommendations of the Budescu paper. Their first suggestion is to differentiate between the ambiguous or vague description of an event and the likelihood of its occurrence. The example the authors give is “It is very unlikely that the meridonial overturning circulation will undergo a large abrupt transition during the 21st century” (emphasis added). The first italicized phrase expresses probabilistic uncertainty whereas the second embodies a vague description. People may have different interpretations of both phrases. They might disagree on what range of probabilities is referred to by “very likely” or on what is meant by a “large abrupt” change. Somewhat more worryingly, they might agree on how likely the “large abrupt” change is while failing to realize that they have different interpretations of that change in mind.

The crucial point here is that probability and vagueness are distinct kinds of uncertainty (see, e.g., Smithson, 1989). While the IPCC 2007 report is consistently explicit regarding probabilistic expressions, it only intermittently attends to matters of vagueness. For example, in the statement “It is likely that heat waves have become more frequent over most land areas” (IPCC 2007, pg. 30) the term “heat waves” remains undefined and the time-span is unspecified. In contrast, just below that statement is this one: “It is likely that the incidence of extreme high sea level3 has increased at a broad range of sites worldwide since 1975.” Footnote 3 then goes on to clarify “extreme high sea level” by the following: “Excluding tsunamis, which are not due to climate change. Extreme high sea level depends on average sea level and on regional weather systems. It is defined here as the highest 1% of hourly values of observed sea level at a station for a given reference period.”

The Budescu paper’s second recommendation is to specify the sources of uncertainty, such as whether these arise from disagreement among specialists, absence of data, or imprecise data. Distinguishing between uncertainty arising from disagreement and uncertainty arising from an imprecise but consensual assessment is especially important. In my experience, the former often is presented as if it is the latter. An interval for near-term ocean level increases of 0.2 to 0.8 metres might be the consensus among experts, but it could also represent two opposing camps, one estimating 0.2 metres and the other 0.8.

The IPCC report guidelines for reporting uncertainty do raise the issue of agreement: “Where uncertainty is assessed qualitatively, it is characterised by providing a relative sense of the amount and quality of evidence (that is, information from theory, observations or models indicating whether a belief or proposition is true or valid) and the degree of agreement (that is, the level of concurrence in the literature on a particular finding).” (IPCC 2007, pg. 27) The report then states that levels of agreement will be denoted by “high,” “medium,” and so on while the amount of evidence will be expressed as “much,”, “medium,” and so on.

As it turns out, the phrase “high agreement and much evidence” occurs seven times in the report and “high agreement and medium evidence” occurs twice. No other agreement phrases are used. These occurrences are almost entirely in the sections devoted to climate change mitigation and adaptation, as opposed to assessments of previous and future climate change. Typical examples are:
“There is high agreement and much evidence that with current climate change mitigation policies and related sustainable development practices, global GHG emissions will continue to grow over the next few decades.” (IPCC 2007, pg. 44) and
“There is high agreement and much evidence that all stabilisation levels assessed can be achieved by deployment of a portfolio of technologies that are either currently available or expected to be commercialised in coming decades, assuming appropriate and effective incentives are in place for development, acquisition, deployment and diffusion of technologies and addressing related barriers.” (IPCC2007, pg. 68)

The IPICC guidelines for other kinds of expert assessments do not explicitly refer to disagreement: “Where uncertainty is assessed more quantitatively using expert judgement of the correctness of underlying data, models or analyses, then the following scale of confidence levels is used to express the assessed chance of a finding being correct: very high confidence at least 9 out of 10; high confidence about 8 out of 10; medium confidence about 5 out of 10; low confidence about 2 out of 10; and very low confidence less than 1 out of 10.” (IPCC 2007, pg. 27) A typical statement of this kind is “By 2080, an increase of 5 to 8% of arid and semi-arid land in Africa is projected under a range of climate scenarios (high confidence).” (IPCC 2007, pg. 50)

That said, some parts of the IPCC report do convey disagreeing projections or estimates, where the disagreements are among models and/or scenarios, especially in the section on near-term predictions of climate change and its impacts. For instance, on pg. 47 of the 2007 report the graph below charts mid-century global warming relative to 1980-99. The six stabilization categories are those described in the Fourth Assessment Report (AR4).


Although this graph effectively represents both imprecision and disagreement (or conflict), it slightly underplays both by truncating the scale at the right-hand side. The next figure shows how the graph would appear if the full range of categories V and VI were included. Both the apparent imprecision of V and VI and the extent of disagreement between VI and categories I-III are substantially greater once we have the full picture.


There are understandable motives for concealing or disguising some kinds of uncertainty, especially those that could be used by opponents to bolster their own positions. Chief among these is uncertainty arising from conflict. In a series of experiments Smithson (1999) demonstrated that people regard precise but disagreeing risk messages as more troubling than informatively equivalent imprecise but agreeing messages. Moreover, they regard the message sources as less credible and less trustworthy in the first case than in the second. In short, conflict is a worse kind of uncertainty than ambiguity or vagueness. Smithson (1999) labeled this phenomenon “conflict aversion.” Cabantous (2007) confirmed and extended those results by demonstrating that insurers would charge a higher premium for insurance against mishaps whose risk information was conflictive than if the risk information was merely ambiguous.

Conflict aversion creates a genuine communications dilemma for disagreeing experts. On the one hand, public revelation of their disagreement can result in a loss of credibility or trust in experts on all sides of the dispute. Laypeople have an intuitive heuristic that if the evidence for any hypothesis is uncertain, then equally able experts should have considered the same evidence and agreed that the truth-status of that hypothesis is uncertain. When Peter Collignon, professor of microbiology at The Australian National University, cast doubt on the net benefit of the Australian Fluvax program in 2010, he attracted opprobrium from colleagues and health authorities on grounds that he was undermining public trust in vaccines and the medical expertise behind them. On the other hand, concealing disagreements runs the risk of future public disclosure and an even greater erosion of trust (lying experts are regarded as worse than disagreeing ones). The problem of how to communicate uncertainties arising from disagreement and vagueness simultaneously and distinguishably has yet to be solved.


Budescu, D.V., Broomell, S. and Por, H.-H. (2009) Improving the communication of uncertainty in the reports of the Intergovernmental panel on climate change. Psychological Science, 20, 299–308.

Cabantous, L. (2007). Ambiguity aversion in the field of insurance: Insurers’ attitudes to imprecise and conflicting probability estimates. Theory and Decision, 62, 219–240.

Intergovernmental Panel on Climate Change (2007). Summary for policymakers: Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. Retrieved May 2010 from http://www.ipcc.ch/pdf/assessment-report/ar4/wg1/ar4-wg1-spm.pdf.

Smithson, M. (1989). Ignorance and Uncertainty: Emerging Paradigms. Cognitive Science Series. New York: Springer Verlag.

Smithson, M. (1999). Conflict Aversion: Preference for Ambiguity vs. Conflict in Sources and Evidence. Organizational Behavior and Human Decision Processes, 79: 179-198.

Smithson, M., Budescu, D.V., Broomell, S. and Por, H.-H. (2011) Never Say “Not:” Impact of Negative Wording in Probability Phrases on Imprecise Probability Judgments. Accepted for presentation at the Seventh International Symposium on Imprecise Probability: Theories and Applications, Innsbruck, Austria, 25-28 July 2011.

What are the Functions of Innumeracy?

with 3 comments

Recently a colleague asked me for my views on the social and psychological functions of innumeracy. He aptly summarized the heart of the matter:

“I have long-standing research interests in mathematics anxiety and adult numeracy (or, more specifically, innumeracy, including in particular what I term the ‘adult numeracy conundrum’ – that is, that despite decades of investment in programs to raise adult numeracy rates little, if any, measurable improvements have been achieved. This has led me to now consider the social functions performed by this form of ignorance, as its persistence suggests the presence of underlying mechanisms that provide a more valuable pay-off than that offered by well-meaning educators…)”

This is an interesting deviation from the typical educator’s attack on innumeracy. “Innumeracy” apparently was coined by cognitive scientist Douglas Hofstadter but it was popularized by mathematician John Allen Paulos in his 1989 book, Innumeracy: Mathematical Illiteracy and its Consequences. Paulos’ book was a (IMO, deserved) bestseller and has gone through a second edition. Most educators’ attacks on innumeracy do what Paulos did: Elaborate the costs and dysfunctions of innumeracy, and ask what we can blame for it and how it can be overcome.

Paulos’ list of the consequences of innumeracy include:

  1. Inaccurate media reporting and inability of the public to detect such inaccuracies
  2. Financial mismanagement (e.g., of debts), especially regarding the misunderstanding of compound interest
  3. Loss of money on gambling, in particular caused by gambler’s fallacy
  4. Belief in pseudoscience
  5. Distorted assessments of risks
  6. Limited job prospects

These are bad consequences indeed, but mainly for the innumerate. Consequences 2 through 6 also are windfalls for those who exploit the innumerate. Banks, retailers, pyramid selling fraudsters, and many others either legitimately or illicitly take advantage of consequence 2. Casinos, bookies, online gambling agencies, investment salespeople and the like milk the punters of their funds on the strength of consequences 3 and 5. Peddlers of various religions, magical and pseudo-scientific beliefs batten on consequence 4, and of course numerous employers can keep the wages and benefits low for those trapped by consequence 6.

Of course, the fact that all these interests are served doesn’t imply that innumeracy is created and maintained by a vast conspiracy of bankers, retailers, casino owners, and astrologers. They’re just being shrewd and opportunistic. Nevertheless, these benefits do indicate that we should not expect the beneficiaries to be in the vanguard of a campaign to improve, say, public understanding of compound interest or probability.

Now let’s turn to Paulos’ accounts of the “whodunit” part of innumeracy: What creates and maintains it? A chief culprit is, you guessed it, poor mathematical education. My aforementioned colleague and I would agree: For the most part, mathematics is badly taught, especially at primary and secondary school levels. Paulos, commendably, doesn’t beat up the teachers. Instead, he identifies bad curricula and a lack of mathematical education in teacher training as root causes.

On the other hand, he does blame “us,” that is, the innumerate and even the numerate. The innumerate are castigated for demanding personal relevance and an absence of anxiety in their educations. According to Paulos, personalizing the universe yields disinterest in (depersonalized?) mathematics and science generally, and an unhealthy guillibility for pseudosciences such as astrology and numerology. He seems to have skated onto thin ice here. He doesn’t present empirical evidence for his main claim, and there are plenty of examples throughout history of numerate or even mathematically sophisticated mystics (the Pythagoreans, for one).

Paulos also accuses a subset of the innumerate of laziness and lack of discipline, but the ignorance of the undisciplined would surely extend beyond innumeracy. If we want instances of apathy that actually sustain innumeracy, let’s focus on public institutions that could militate against it but don’t. There, we shall encounter social and political forces that help perpetuate innumeracy, not via any conspiracy or even direct benefits, but simply by self-reinforcing feedback loops.

As the Complete Review points out “… the media isn’t much interested in combating innumeracy (think of how many people got fired after all the networks prematurely declared first Gore then Bush the winner in Florida in the 2000 American presidential election – none…” Media moguls and their editors are interested in selling stories, and probably will become interested in getting the numbers right only when the paying public starts objecting to numerical errors in the media. An innumerate public is unlikely to object, so the media and the public stagnate in a suboptimal but mutually reinforcing equilibrium.

Likewise, politicians don’t want a numerate electorate any more than they want a politically sophisticated one, so elected office-holders also are unlikely to lead the charge to combat innumeracy. Michael Moore, a member of the Australian Capital Territory Legislative Assembly for four terms, observes that governments usually avoid clear, measurable goals for which they can be held accountable (pg. 178, in a chapter he contributed to Gabriele Bammer’s and my book on uncertainty). Political uses of numbers are mainly rhetorical or for purposes of control. Again, we have a mutually reinforcing equilibrium: A largely innumerate public elects office-holders who are happy for the public to remain innumerate because that’s partly what got them elected.

I’ve encountered similar feedback-loops in academia, beginning with my experiences as a math graduate doing a PhD in a sociology department. The ideological stances taken by some departments of cultural studies, anthropology, and sociology position education for numeracy as aligned with so-called “positivist” research methods, against which they are opposed. The upshot is that courses with statistical or other numeracy content are devalued and students are discouraged from taking them. A subset of the innumerate graduates forms a succeeding generation of innumerate academics, and on it goes.

Meanwhile, Paulos blames the rest of us for perpetuating romantic stereotypes in which math and science are spoilers of the sublime, and therefore to be abhorred by anyone with artistic or spiritual sensibilities. So, he is simultaneously stereotyping the innumerate and railing against us for indulging another stereotype (No disrespect to Paulos; I’ve been caught doing this kind of thing often enough).

Lee Dembart, then of the Los Angeles Times, observed that “Paulos is very good at explaining all of this, though sometimes with a hectoring, bitter tone, for which he apologizes at the very end.” Unfortunately, hectoring people, focusing attention on their faults, or telling them they need to work harder “for their own good” seldom persuades them. I’ve taught basic statistics to students in the human sciences for many years. Many of these students dread a course in stats. They’re in it only because it’s a required course, telling them it’s for their own good isn’t going to cut any ice with them, and blaming them for finding statistics difficult or off-putting is a sure-fire way of turning them off entirely.

Now that we all have to be here, I propose to them, let’s see how we can make the best of it. I teach them how to lie with or abuse statistics so that they can gain a bit more power to detect when someone is trying to pull the proverbial wool over their eyes. This also opens the way to considering ethical and moral aspects of statistics. Then I try to link the (ab)uses of stats with important issues and debates in psychology. I let them in on some of psychology’s statistical malpractices (and there are plenty), so they can start detecting these for themselves and maybe even become convinced that they could do better. I also try to convey the view that data analysis is not self-automating; it requires human judgment and interpretive work.

Does my approach work? Judging from student evaluations, a fair amount of the time, but by no means always. To be sure, I get kudos for putting on a reasonably accessible, well-organized course and my tutors get very positive evaluations from the students in their tutorials. Nevertheless, there are some who, after the best efforts by me and my tutors, still say they don’t get it and don’t like it. And many of these reluctant students are not poor students—Most have put in the work and some have obtained good marks. Part of their problem may well be cognitive style. There is a lot of evidence that it is difficult for the human mind to become intuitively comfortable with probability, so those who like intuitive understanding might find statistics and probability aversive.

It’s also possible that my examples and applications simply aren’t motivating enough for these students. Despite the pessimism I share with my colleague, I think there has been a detectable increase in basic statistical literacy both in the public and the media over the past 30 years. It is mainly due to unavoidably statistical aspects of issues that the public and media both deem important (e.g., medical advances or failures, political polls, environmental threats). Acquiring numeracy requires effort and that, in turn, takes motivation. Thank goodness I don’t have the job of persuading first-year undergraduates to voluntarily sign up for a basic statistics course.

Written by michaelsmithson

March 15, 2011 at 1:14 pm

Disappearing Truths or Vanishing Illusions?

leave a comment »

Jonah Lehrer, in his provocative article in the New York Times last month (December 13th), drew together an impressive array of scientists, researchers and expert commentators remarking on a puzzling phenomenon. Many empirically established scientific truths apparently become less certain over time. What initially appear to be striking or even robust research findings seem to become diluted by failures to replicate them, dwindling effect-sizes even in successful replications, and/or the appearance of outright counter-examples. I won’t repeat much of Lehrer’s material here; I commend his article for its clarity and scope, even if it may be over-stating the case according to some.

Instead, I’d like to add what I hope is some helpful commentary and additional candidate explanations. Lehrer makes the following general claim:

“Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws. But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain.”

His article surveys three manifestations of the dwindling-truth phenomenon:

  1. Failure to replicate an initially striking study
  2. Diminishing effect-sizes (e.g., dwindling effectiveness of therapeutic drugs)
  3. Diminishing effect-sizes over time in a single experiment or study

Two explanations are then advanced to account for these manifestations. The first is publication bias, and the second is regression to the mean.

Publication bias refers to a long-standing problem in research that uses statistical tests to determine whether, for instance, an experimental medical treatment group’s results differ from those of a no-treatment control group. A statistically significant difference usually is taken to mean that the experimental treatment “worked,” whereas a non-significant difference often is interpreted (mistakenly) to mean that the treatment had no effect. There is evidence in a variety of disciplines that significant findings are more likely to be accepted for publication than non-significant ones, whence “publication bias.”

As Lehrer points out, a simple way to detect publication bias is the “funnel plot,” which is a graph of the size of an experimental effect plotted against a function of the sample size of the study (i.e., the reciprocal of the standard deviation or sometimes the reciprocal of the variance). As we move down the vertical axis, the effect-sizes should spread out symmetrically around their weighted mean if there is no publication bias. Deviations away from the mean will tend to be larger with smaller samples, but their direction does not depend on sample size. If there’s publication bias, the graph will be asymmetrical.

Here’s an example that I compiled several years ago for teaching purposes from Bond and Smith’s (1996) collection of 97 American studies of conformity using the Asch paradigm. The horizontal axis is the effect-size (Cohen’s d), which is a measure of how many standard deviations the experimental group mean differed from the control group mean. The vertical bar corresponds to the weighted average Cohen’s d (.916—an unweighted average gives 1.00). There’s little doubt that the conformity effect exists.


Nevertheless, there is some evidence of publication bias, as can be seen in the lower part of the graph where a handful of six studies with small sample sizes are skewed over to the right because their effect-sizes are the biggest of the lot. Their presence suggests that there should exist another small handful of studies skewed over to the left, perhaps with effects in the opposite direction to the conformity hypothesis. It’s fairly likely that there are some unpublished conformity studies that failed to achieve significant results or, more worryingly, came out with “negative” results. Publication bias also suggests that the average effect-size is being slightly over-estimated because of the biased collection of studies.

How could any of this account for a decline in effect-sizes over time? Well, here’s another twist to publication bias that wasn’t mentioned in Lehrer’s article and may have been overlooked by his sources as well. Studies further in the past tend to have been based on smaller samples than more recent studies. This is due, among other things, to increased computing power, ease and speed of gathering data, and the sheer numbers of active researchers working in teams.

Bigger samples have greater statistical power to detect smaller effects than smaller samples do. Given large enough samples, even trivially tiny effects become statistically significant. Tiny though they may be, significant effects are publishable findings. Ceteris paribus, larger-sample studies are less susceptible to publication bias arising from the failure to obtain significant findings. They’ll identify the same large effects their predecessors did (if the effects really are large) but they’ll also identify smaller effects that were missed by the earlier smaller-sample studies. Therefore, it is entirely likely that more recent studies, on average with larger sample sizes, will show on average smaller effects.

Let’s put this to an impromptu test with my compilation from Bond and Smith. The next two graphs show cumulative averages of sample size and effect size over time. These suggest some support for my hypothesis. Average sample sizes did increase from the mid-50’s to the mid-60’s, after which they levelled out. This increase corresponds to a decrease in (unweighted) average effect-size over the same period. But then there’s a hiccup, a jump upwards from 1965 to 1966. The reason for this is publication in 1966 of two studies with the largest effect-sizes (and small samples, N = 12 in each). It then takes several more years for the effect of those two studies to be washed out. Three of the other “deviant” top six effect-size studies were published in 1956-7 and the remaining one in 1970.



Now what about regression to the mean? Briefly, consider instructors or anyone else who rewards good performance and punishes poor performance. They will observe that the highest performers on one occasion (say, an exam) generally do not do as well on the second (despite having been rewarded) whereas the poorest performers generally do not do as badly on the second occasion (which the instructor may erroneously attribute to their having been punished). The upshot? Punishment appears to be more effective than reward.

However, these effects are not attributable to punishment being more effective than reward (indeed, a vast literature on behaviour modification techniques indicates the converse is true!). It is simply due to the fact that students’ performance on exams is not perfectly correlated, even in the same subject. Some of the good (poor) performers on the first exam had good (bad) luck. Next time around they regress back toward the class average, where they belong.

Note that in order for regression to the mean to contribute to an apparent dwindling-truth phenomenon, it has to operate in conjunction with publication bias. For example, it could account for the declining trend in my effect-size graph after the 1965-6 hiccup. Nevertheless, considerations of regression to the mean for explaining the dwindling-truth phenomena probably haven’t gone quite far enough.

Seemingly the most puzzling manifestation of the dwindling-truth phenomenon is the diminution of an effect over the course of several studies performed by the same investigators or even during a single study, as in Jonathan Schooler’s attempts to replicate one of J.B. Rhine’s precognition experiments. Schooler claims to have seen numerous data-sets where regression to the mean does not account for the decline effect. If true, this can’t be accounted for by significant-effect-only publication bias. Some other selectivity bias must be operating. My chief suspect lies in the decision a scientist makes about whether or not to replicate a study, or whether to keep on gathering more data.

Would a scientist bother conducting a second experiment to replicate an initial one that failed to find any effect? Probably not. Would Schooler have bothered gathering as much data if his initial results hadn’t shown any precognition effect? Perhaps not. Or he may not have persisted as long as he did if his results were not positive in the short term. There are principled (Bayesian) methods for deciding when one has gathered “enough” data but most scientists don’t use them. Instead, we usually fix a sample size target in advance or use our own judgment in deciding when to stop. Having done one study, we’re more likely to attempt to replicate it if we’ve found the effect we were looking for or if we’ve discovered some new effect. This is a particular manifestation of what is known in the psychological literature as confirmation bias.

Why is this important? Let’s consider scientific replications generally. Would researchers be motivated to replicate a first-ever study that failed to find a significant effect for a new therapeutic drug? Probably not. Instead, it’s the study that shouts “Here’s a new wonder drug!” that begs to be replicated. After all, checking whether a therapeutic effect can be replicated fits squarely in the spirit of scientific skepticism and impartiality. Or does it? True impartiality would require also replicating “dud” studies such as a clinical trial of a candidate HIV vaccine that has failed to find evidence of its effectiveness.

In short, we have positive-finding replication bias: It is likely that scientists suffer from a bias in favor of replicating only those studies with statistically significant findings. This is just like rewarding only the top performers on an exam and then paying attention only to their subsequent performance. It invites dwindling effects and failures of replication due to regression to the mean. If scientists replicated only those studies that had null findings, then we would see regression to the mean in the opposite direction, i.e., the emergence of effects where none had previously been found.

In Schooler’s situation there’s a related regression-to-the-mean pitfall awaiting even the researcher who gathers data with no preconceptions. Suppose we gather a large sample of data, randomly split it in half, and discover some statistically significant patterns or effects in the first half. Aha! New findings! But some of these are real and others are due to chance. Now we turn to the second half of our data and test for the same patterns on the second half only. Some will still be there but others won’t be, and on average the strength of our findings will be lessened. Statisticians refer to this as model inflation. What we’ve omitted to do is search for patterns in the second half of the data that we didn’t discover in the first half. Model inflation will happen time and time again to scientists when they discover new effects or patterns, despite the fact that their initial conclusions appear statistically solid.

Thus, we have regression to the mean working hand-in-glove with positive-finding replication bias, significance-only publication bias, and increased sample sizes over time, all militating in the same direction. Perhaps it’s no wonder that we’re observing dwindling-truth phenomena that appears to defy the law of averages.

Are there any remedies? To begin with, understanding the ways in which human biases and regression to the mean act in concert gives us the wherewithal to face the problem head-on. Publication bias and model inflation are well-known but often overlooked (especially the latter). I’ve added two speculative but plausible conjectures here (the larger sample-size over time and replication-bias effects) that merit further investigation. It remains to be seen whether the mysterious dwindling-truth phenomena can be accounted for by these four factors. I suspect there may be other causes yet to be detected.

Several remedies for publication bias have been suggested (e.g., Ioannidis 2005), including larger sample sizes, enhanced data-registration protocols, and more attention to false-positive versus false-negative rates for research designs. Matters then may hinge on whether scientific communities can muster the political will to provide incentives for counteracting our all-too-human publication and replication biases.

Written by michaelsmithson

January 10, 2011 at 9:03 am

“Negative knowledge”: From Wicked Problems and Rude Surprises to Mathematics

leave a comment »

It is one thing to know that we don’t know, but what about knowing that we can never know something? Karin Knorr-Cetina (1999) first used the term negative knowledge to refer to knowledge about the limits of knowledge. This is a type of meta-knowledge, and is a special case of known unknowns. Philosophical interest in knowing what we don’t know dates back at least to Socrates—certainly long before Donald Rumsfeld’s prize-winning remark on the subject. Actually, Rumsefeld’s “unknown unknowns” were articulated in print much earlier by philosopher Ann Kerwin, whose 1993 paper appeared along with mine and others in a special issue of the journal Science Communication as an outcome of our symposium on “Ignorance in Science” at the AAAS meeting in Boston earlier that year. My 1989 coinage, meta-ignorance, is synonymous with unknown unknowns.

There are plenty of things we know that we cannot know (e.g., I cannot know my precise weight and height at the moment I write this), but why should negative knowledge be important? There are at least three reasons. First, negative knowledge tells us to put a brake on what would otherwise be a futile wild goose-chase for certainty. Second, some things we cannot know we might consider important to know, and negative knowledge humbles us by highlighting our limitations. Third, negative knowledge about important matters may be contestable. We might disagree with others about it.

Let’s begin with the notion that negative knowledge instructs us to cease inquiry. On the face of it, this would seem a good thing: Why waste effort and time on a question that you know cannot be answered? Peter Medawar (1967) famously coined the aphorism that science is the “art of the soluble.” A commonsensical inference follows that if a problem is not soluble then it isn’t a scientific problem and so should be banished from scientific inquiry. Nevertheless, aside from logical flaw in this inference, over-subscribing to this kind of negative-knowledge characterization of science exacts a steep price.

First, there is what philosopher Jerome Ravetz (in the same journal and symposium as Ann Kerwin’s paper) called ignorance of ignorance. By this phrase Ravetz meant something slightly different from meta-ignorance or unknown unknowns. He observed that conventional scientific training systematically shields students from problems outside the soluble. As a result, they remain unacquainted with those problems, i.e., ignorant about scientific ignorance itself. The same charge could be laid on many professions (e.g., engineering, law, medicine).

Second, by neglecting unsolvable problems scientists exclude themselves from any input into what people end up doing about those problems. Are there problem domains where negative knowledge defines the criteria for inclusion? Yes: wicked problems and rude surprises. The characteristics of wicked problems were identified in the classic 1973 paper by Rittel and Webber, and most of these referred to various kinds of negative knowledge. Thus, the very definition and scope of wicked problems are unresolvable; such problems have no definitive solutions; there are no ultimate tests of whether a solution works; every wicked problem is unique; and there are no opportunities to learn how to deal with them by trial-and-error. Claimants to the title of “wicked problem” include how to craft policy responses to climate change, how to combat terrorism, how to end poverty, and how to end war.

Rude surprises are not always wicked problems but nonetheless are, as Todd La Porte describes them in his 2005 paper, “unexpected, potentially overwhelming circumstances that are likely to deliver punishing blows to human life, to political or economic viability, and/or to environmental integrity” (pg. 2). Financial advisors and traders around the world no doubt saw the most recent global financial crisis as a rude surprise.

As Matthias Gross (2010) points out at the beginning of his absorbing book, “ignorance and surprise belong together.” So it should not be, well, surprising that in an uncertain world we are in for plenty of surprises. But why are we so unprepared for surprises? Two important reasons are confirmation bias and the Catch-All Underestimation Bias (CAUB). Confirmation bias is the tendency to be more interested in and pay more attention to information that is likely to confirm what we already know or believe. As Robert Nickerson’s 1998 review sadly informs us, this tendency operates unconsciously even when we’re not trying to defend a position or bolster our self-esteem. The CAUB is a tendency to underestimate the likelihood that something we’ve never seen before will occur. The authors of the classic 1978 study first describing the CAUB pointed out that it’s an inescapable “out of sight, out of mind” phenomenon—After all, how can you have something in sight that never has occurred? And the final sting in the tail is that clever people and domain experts (e.g., scientists, professionals) suffer from both biases just as the rest of us do.

Now let’s move to the second major issue raised at the outset of this post: Not being able to know things we’d like to know. And let’s raise the stakes, from negative knowledge to negative meta-knowledge. Wouldn’t it be wonderful if we had a method of finding truths that was guaranteed not to steer us wrong? Possession of such a method would tame the wild seas of the unknown for us by providing the equivalent of an epistemic compass. Conversely, wouldn’t it be devastating if we found out that we never can obtain this method?

Early in the 20th century, mathematicians underwent the experience of expecting to find such a method and having their hopes dashed. They became among the first (and best) postmodernists. Their story has been told in numerous ways (even as a graphic novel), but for my money the best account is the late Morris Kline’s brilliant (1980) book, “Mathematics: The Loss of Certainty.” Here’s how Kline characterizes mathematicians’ views of their domain at the turn of the century:

“After many centuries of wandering through intellectual fog, by 1900 mathematicians had seemingly imparted to their subject the ideal structure… They had finally recognized the need for undefined terms; definitions were purged of vague or objectionable terms; the several branches were founded on rigorous axiomatic bases; and valid, rigorous, deductive proofs replaced intuitively or empirically based conclusions… mathematicians had cause to rejoice.” (pg. 197)

The tantalizing prospect before them was to establish the consistency and completeness of mathematical systems. Roughly speaking, consistency amounts to a guarantee of never running into paradoxes (well-formed mathematical propositions that nevertheless are provably both true and false) and completeness amounts to a guarantee of never running into undecidables (well-formed mathematical propositions whose truth or falsity cannot be proven). These guarantees would tame the unknown for mathematicians; a proper axiomatic foundation would ensure that any proposition derived from it would be provably true or false.

The famous 1931 paper by Kurt Gödel denied this paradise forever. He showed that if any mathematical theory adequate to deal with whole numbers is consistent, it will be incomplete. He also showed that consistency of such a theory could not be established by the logical principles in use by several foundational schools of mathematics. So, consistency would have to be determined by other methods and, if attained, its price would be incompleteness. But is there a way to ascertain which mathematical propositions are undecidable and which provable? Alan Turing’s 1936 paper on “computable numbers” (in addition to inventing Turing machines!) showed that the answer to this question is “no.” One of the consequences of these results is that instead of a foundational consensus there can be divergent schools of mathematics, each legitimate and selected as a matter of preference. Here we have definitively severe negative knowledge in an area that to most people even today epitomizes certitude.

“Loss of certainty” themes dominate high-level discourse in various intellectual and professional domains throughout the 20th century. Physics is perhaps the most well-known example, but one can find such themes in many other disciplines and fascinating debates around them. To give one example, historian Ann Curthoys’ and John Docker’s 2006 book “Is History Fiction?” begins by identifying three common responses to the book title’s question: Relativists who answer in the affirmative, foundationalists who insist that history is well-grounded in evidence after all, and a third (they claim, largest) puzzled group who says “well, is it?” To give just one more, I’m a mathematical modeler in a discipline where various offspring of the “is psychology a science?” question are seriously debated. In particular, I and others (e.g., here and here) regard the jury as still out on whether there are (m)any quantifiable psychological attributes. Some such attributes can be rank-ordered, perhaps, but quantified? Good question.

Are there limits to negative knowledge—In other words, is there such a thing as negative negative-knowledge? It turns out that there is, mainly in the Gödelian realm of self-referential statements. For example, we cannot believe that we currently hold a false belief; otherwise we’d be compelled to disbelieve it. There are also limits to the extent to which we can self-attribute erroneous belief formation. Philosophers Andy Egan and Adam Elga laid these out in their delightfully titled 2005 paper, “I Can’t Believe I’m Stupid.” According to them, I can believe that in some domains my way of forming beliefs goes wrong all of the time (e.g., I have a sense of direction that is invariably wrong), but I can’t believe that every time I form any belief it goes wrong without undermining that very meta-belief.

Dealing with wicked problems and rude surprises requires input from multiple stakeholders encompassing their perspectives, values, priorities, and (possibly non-scientific) ways of knowing. Likewise, there is no algorithm or sure-fire method to anticipate or forecast rude surprises or Nicolas Taleb’s “black swans.” These are exemplars of insoluble problems beyond the ken of science. But surely none of this implies that input from experts is useless or beside the point. So, are there ways of educating scientists, other experts, and professionals so that they will be less prone to Ravetz’s ignorance of ignorance? And what about the rest of us—Are there ways we can combat confirmation bias and the CAUB? Are there effective methods for dealing with wicked problems or rude surprises? Ah, grounds for a future post!