ignorance and uncertainty

All about unknowns and uncertainties

Posts Tagged ‘Probability

Statistical Significance On Trial

with 2 comments

There is a long-running love-hate relationship between the legal and statistical professions, and two vivid examples of this have surfaced in recent news stories, one situated in a court of appeal in London and the other in the U.S. Supreme Court. Briefly, the London judge ruled that Bayes’ theorem must not be used in evidence unless the underlying statistics are “firm;” while the U.S. Supreme Court unanimously ruled that a drug company’s non-disclosure of adverse side-effects cannot be justified by an appeal to the statistical non-significance of those effects. Each case, in its own way, shows why it is high time to find a way to establish an effective rapprochement between these two professions.

The Supreme Court decision has been applauded by statisticians, whereas the London decision has appalled statisticians of similar stripe. Both decisions require some unpacking to understand why statisticians would cheer one and boo the other, and why these are important decisions not only for both the statistical and legal professions but for other domains and disciplines whose practices hinge on legal and statistical codes and frameworks.

This post focuses on the Supreme Court decision. The culprit was a homoeopathic zinc-based medicine, Zicam, manufactured by Matrixx Initivatives, Inc. and advertised as a remedy for the common cold. Matrixx ignored reports from users and doctors since 1999 that Zicam caused some users to experience burning sensations or even to lose the sense of smell. When this story was aired by a doctor on Good Morning America in 2004, Matrixx stock price plummeted.

The company’s defense was that these side-effects were “not statistically significant.” In the ensuing fallout, Matrixx was faced with more than 200 lawsuits by Zicam users, but the case in point here is Siracusano vs Matrixx, in which Mr. Siracusano was suing on behalf of investors on grounds that they had been misled. After a few iterations through the American court system, the question that the Supreme Court ruled on was whether a claim of securities fraud is valid against a company that neglected to warn consumers about effects that had been found to be statistically non-significant. As insider-knowledgeable Stephen Ziliak’s insightful essay points out, the decision will affect drug supply regulation, securities regulation, liability and the nature of adverse side-effects disclosed by drug companies. Ziliak was one of the “friends of the court” providing expert advice on the case.

A key point in this dispute is whether statistical nonsignificance can be used to infer that a potential side-effect is, for practical purposes, no more likely to occur when using the medicine than when not. Among statisticians it is a commonplace that such inferences are illogical (and illegitimate). There are several reasons for this, but I’ll review just two here.

These reasons have to do with common misinterpretations of the measure of statistical significance. Suppose Matrixx had conducted a properly randomized double-blind experiment comparing Zicam-using subjects with those using an indistinguishable placebo, and observed the difference in side-effect rates between the two groups of subjects. One has to bear in mind that random assignment of subjects to one group or the other doesn’t guarantee equivalence between the groups. So, it’s possible that even if there really is no difference between Zicam and the placebo regarding the side-effect, a difference between the groups might occur by “luck of the draw.”

The indicator of statistical significance in this context would be the probability of observing a difference at least as large as the one found in the study if the hypothesis of no difference were true. If this probability is found to be very low (typically .05 or less) then the experimenters will reject the no-difference hypothesis on the grounds that the data they’ve observed would be very unlikely to occur if that hypothesis were true. They will then declare that there is a statistically significant difference between the Zicam and placebo groups. If this probability is not sufficiently low (i.e., greater than .05) the experimenters will decide not to reject the no-difference hypothesis and announce that the difference they found was statistically non-significant.

So the first reason for concern is that Matrixx acted as if statistical nonsignificance entitles one to believe in the hypothesis of no-difference. However, failing to reject the hypothesis of no difference doesn’t entitle one to believe in it. It’s still possible that a difference might exist and the experiment failed to find it because it didn’t have enough subjects or because the experimenters were “unlucky.” Matrixx has plenty of company in committing this error; I know plenty of seasoned researchers who do the same, and I’ve already canvassed the well-known bias in fields such as psychology not to publish experiments that failed to find significant effects.

The second problem arises from a common intuition that the probability of observing a difference at least as large as the one found in the study if the hypothesis of no difference were true tells us something about the inverse—the probability that the no-difference hypothesis is true if we find a difference at least as large as the one observed in our study, or, worse still, the probability that the no-difference hypothesis is true. However, the first probability on its own tells us nothing about the other two.

For a quick intuitive, if fanciful, example let’s imagine randomly sampling one person from the world’s population and our hypothesis is that s/he will be Australian. On randomly selecting our person, all that we know about her initially is that she speaks English.

There are about 750 million first-or second-language English speakers world-wide, and about 23 million Australians. Of the 23 million Australians, about 21 million of them fit the first- or second-language English description. Given that our person speaks English, how likely is it that we’ve found an Australian? The probability that we’ve found an Australian given that we’ve picked an English-speaker is 21/750 = .03. So there goes our hypothesis. However, had we picked an Australian (i.e., given that our hypothesis were true), the probability that s/he speaks English is 21/23 = .91.

See also Ziliak and McCloskey’s 2008 book, which mounts a swinging demolition of the unquestioned application of statistical significance in a variety of domains.

Aside from the judgment about statistical nonsignificance, the most important stipulation of the Supreme Court’s decision is that “something more” is required before a drug company can justifiably decide to not disclose a drug’s potential side-effects. What should this “something more” be? This sounds as if it would need judgments about the “importance” of the side-effects, which could open multiple cans of worms (e.g., Which criteria for importance? According to what or whose standards?). Alternatively, why not simply require drug companies to report all occurrences of adverse side-effects and include the best current estimates of their rates among the population of users?

A slightly larger-picture view of the Matrixx defense resonates with something that I’ve observed in even the best and brightest of my students and colleagues (oh, and me too). And that is the hope that somehow probability or statistical theories will get us off the hook when it comes to making judgments and decisions in the face of uncertainty. It can’t and won’t, especially when it comes to matters of medical, clinical, personal, political, economic, moral, aesthetic, and all the other important kinds of importance.

Written by michaelsmithson

October 22, 2011 at 11:31 pm

Scientists on Trial: Risk Communication Becomes Riskier

with 5 comments

Back in late May 2011, there were news stories of charges of manslaughter laid against six earthquake experts and a government advisor responsible for evaluating the threat of natural disasters in Italy, on grounds that they allegedly failed to give sufficient warning about the devastating L’Aquila earthquake in 2009. In addition, plaintiffs in a separate civil case are seeking damages in the order of €22.5 million (US$31.6 million). The first hearing of the criminal trial occurred on Tuesday the 20th of September, and the second session is scheduled for October 1st.

According to Judge Giuseppe Romano Gargarella, the defendants gave inexact, incomplete and contradictory information about whether smaller tremors in L’Aquila six months before the 6.3 magnitude quake on 6 April, which killed 308 people, were to be considered warning signs of the quake that eventuated. L’Aquila was largely flattened, and thousands of survivors lived in tent camps or temporary housing for months.

If convicted, the defendants face up to 15 years in jail and almost certainly will suffer career-ending consequences. While manslaughter charges for natural disasters have precedents in Italy, they have previously concerned breaches of building codes in quake-prone areas. Interestingly, no action has yet been taken against the engineers who designed the buildings that collapsed, or government officials responsible for enforcing building code compliance. However, there have been indications of lax building codes and the possibility of local corruption.

The trial has, naturally, outraged scientists and others sympathetic to the plight of the earthquake experts. An open letter by the Istituto Nazionale di Geofisica e Vulcanologia (National Institute of Geophysics and Volcanology) said the allegations were unfounded and amounted to “prosecuting scientists for failing to do something they cannot do yet — predict earthquakes”. The AAAS has presented a similar letter, which can be read here.

In pre-trial statements, the defence lawyers also have argued that it was impossible to predict earthquakes. “As we all know, quakes aren’t predictable,” said Marcello Melandri, defence lawyer for defendant Enzo Boschi, who was president of Italy’s National Institute of Geophysics and Volcanology). The implication is that because quakes cannot be predicted, the accusations that the commission’s scientists and civil protection experts should have warned that a major quake was imminent are baseless.

Unfortunately, the Istituto Nazionale di Geofisica e Vulcanologia, the AAAS, and the defence lawyers were missing the point. It seems that failure to predict quakes is not the substance of the accusations. Instead, it is poor communication of the risks, inappropriate reassurance of the local population and inadequate hazard assessment. Contrary to earlier reports, the prosecution apparently is not claiming the earthquake should have been predicted. Instead, their focus is on the nature of the risk messages and advice issued by the experts to the public.

Examples raised by the prosecution include a memo issued after a commission meeting on 31 March 2009 stating that a major quake was “improbable,” a statement to local media that six months of low-magnitude tremors was not unusual in the highly seismic region and did not mean a major quake would follow, and an apparent discounting of the notion that the public should be worried. Against this, defence lawyer Melandri has been reported saying that the panel “never said, ‘stay calm, there is no risk’”.

It is at this point that the issues become both complex (by their nature) and complicated (by people). Several commentators have pointed out that the scientists are distinguished experts, by way of asserting that they are unlikely to have erred in their judgement of the risks. But they are being accused of “incomplete, imprecise, and contradictory information” communication to the public. As one of the civil parties to the lawsuit put it, “Either they didn’t know certain things, which is a problem, or they didn’t know how to communicate what they did know, which is also a problem.”

So, the experts’ scientific expertise is not on trial. Instead, it is their expertise in risk communication. As Stephen S. Hall’s excellent essay in Nature points out, regardless of the outcome this trial is likely to make many scientists more reluctant to engage with the public or the media about risk assessments of all kinds. The AAAS letter makes this point too. And regardless of which country you live in, it is unwise to think “Well, that’s Italy for you. It can’t happen here.” It most certainly can and probably will.

Matters are further complicated by the abnormal nature of the commission meeting on the 31st of March at a local government office in L’Aquila. Boschi claims that these proceedings normally are closed whereas this meeting was open to government officials, and he and the other scientists at the meeting did not realize that the officials’ agenda was to calm the public. The commission did not issue its usual formal statement, and the minutes of the meeting were not completed, until after the earthquake had occurred. Instead, two members of the commission, Franco Barberi and Bernardo De Bernardinis, along with the mayor and an official from Abruzzo’s civil-protection department, held a now (in)famous press conference after the meeting where they issued reassuring messages.

De Bernardinis, an expert on floods but not earthquakes, incorrectly stated that the numerous earthquakes of the swarm would lessen the risk of a larger earthquake by releasing stress. He also agreed with a journalist’s suggestion that residents enjoy a glass of wine instead of worrying about an impending quake.

The prosecution also is arguing that the commission should have reminded residents in L’Aquila of the fragility of many older buildings, advised them to make preparations for a quake, and reminded them of what to do in the event of a quake. This amounts to an accusation of a failure to perform a duty of care, a duty that many scientists providing risk assessments may dispute that they bear.

After all, telling the public what they should or should not do is a civil or governmental matter, not a scientific one. Thomas Jordan’s essay in New Scientist brings in this verdict: “I can see no merit in prosecuting public servants who were trying in good faith to protect the public under chaotic circumstances. With hindsight their failure to highlight the hazard may be regrettable, but the inactions of a stressed risk-advisory system can hardly be construed as criminal acts on the part of individual scientists.” As Jordan points out, there is a need to separate the role of science advisors from that of civil decision-makers who must weigh the benefits of protective actions against the costs of false alarms. This would seem to be a key issue that urgently needs to be worked through, given the need for scientific input into decisions about extreme hazards and events, both natural and human-caused.

Scientists generally are not trained in communication or in dealing with the media, and communication about risks is an especially tricky undertaking. I would venture to say that the prosecution, defence, judge, and journalists reporting on the trial will not be experts in risk communication either. The problems in risk communication are well known to psychologists and social scientists specializing in its study, but not to non-specialists. Moreover, these specialists will tell you that solutions to those problems are hard to come by.

For example, Otway and Wynne (1989) observed in a classic paper that an “effective” risk message has to simultaneously reassure by saying the risk is tolerable and panic will not help, and warn by stating what actions need to be taken should an emergency arise. They coined the term “reassurance arousal paradox” to describe this tradeoff (which of course is not a paradox, but a tradeoff). The appropriate balance is difficult to achieve, and is made even more so by the fact that not everyone responds in the same way to the same risk message.

It is also well known that laypeople do not think of risks in the same way as risk experts (for instance, laypeople tend to see “hazard” and “risk” as synonyms), nor do they rate risk severity in line with the product of probability and magnitude of consequence, nor do they understand probability—especially low probabilities. Given all of this, it will be interesting to see how the prosecution attempts to establish that the commission’s risk communications contained “incomplete, imprecise, and contradictory information.”

Scientists who try to communicate risks are aware of some of these issues, but usually (and understandably) uninformed about the psychology of risk perception (see, for instance, my posts here and here on communicating uncertainty about climate science). I’ll close with just one example. A recent International Commission on Earthquake Forecasting (ICEF) report argues that frequently updated hazard probabilities are the best way to communicate risk information to the public. Jordan, chair of the ICEF, recommends that “Seismic weather reports, if you will, should be put out on a daily basis.” Laudable as this prescription is, there are at least three problems with it.

Weather reports with probabilities of rain typically present probabilities neither close to 0 nor to 1. Moreover, they usually are anchored on tenths (e.g., .2, or .6 but not precise numbers like .23162 or .62947). Most people have reasonable intuitions about mid-range probabilities such as .2 or .6. But earthquake forecasting has very low probabilities, as was the case in the lead-up to the L’Aquila event. Italian seismologists had estimated the probability of a large earthquake in the next three days had increased from 1 in 200,000, before the earthquake swarm began, to 1 in 1,000 following the two large tremors the day before the quake.

The first problem arises from the small magnitude of these probabilities. Because people are limited in their ability to comprehend and evaluate extreme probabilities, highly unlikely events usually are either ignored or overweighted. The tendency to ignore low-probability events has been cited to account for the well-established phenomenon that homeowners tend to be under-insured against low probability hazards (e.g., earthquake, flood and hurricane damage) in areas prone to those hazards. On the other hand, the tendency to over-weight low-probability events has been used to explain the same people’s propensity to purchase lottery tickets. The point is that low-probability events either excite people out of proportion to their likelihood or fail to excite them altogether.

The second problem is in understanding the increase in risk from 1 in 200,000 to 1 in 1,000. Most people are readily able to comprehend the differences between mid-range probabilities such as an increase in the chance of rain from .2 to .6. However, they may not appreciate the magnitude of the difference between the two low probabilities in our example. For instance, an experimental study with jurors in mock trials found that although DNA evidence is typically expressed in terms of probability (specifically, the probability that the DNA sample could have come from a randomly selected person in the population), jurors were equally likely to convict on the basis of a probability of 1 in 1,000 as a probability of 1 in 1 billion. At the very least, the public would need some training and accustoming to miniscule probabilities.

All this leads us to the third problem. Otway and Wynne’s “reassurance arousal paradox” is exacerbated by risk communications about extremely low-probability hazards, no matter how carefully they are crafted. Recipients of such messages will be highly suggestible, especially when the stakes are high. So, what should the threshold probability be for determining when a “don’t ignore this” message is issued? It can’t be the imbecilic Dick Cheney zero-risk threshold for terrorism threats, but what should it be instead?

Note that this is a matter for policy-makers to decide, not scientists, even though scientific input regarding potential consequences of false alarms and false reassurances should be taken into account. Criminal trials and civil lawsuits punishing the bearers of false reassurances will drive risk communicators to lower their own alarm thresholds, thereby ensuring that they will sound false alarms increasingly often (see my post about making the “wrong” decision most of the time for the “right” reasons).

Risk communication regarding low-probability, high-stakes hazards is one of the most difficult kinds of communication to perform effectively, and most of its problems remain unsolved. The L’Aquila trial probably will have an inhibitory impact on scientists’ willingness to front the media or the public. But it may also stimulate scientists and decision-makers to work together for the resolution of these problems.

Can Greater Noise Yield Greater Accuracy?

with one comment

I started this post in Hong Kong airport, having just finished one conference and heading to Innsbruck for another. The Hong Kong meeting was on psychometrics and the Innsbruck conference was on imprecise probabilities (believe it or not, these topics actually do overlap). Anyhow, Annemarie Zand Scholten gave a neat paper at the math psych meeting in which she pointed out that, contrary to a strong intuition that most of us have, introducing and accounting for measurement error can actually sharpen up measurement. Briefly, the key idea is that an earlier “error-free” measurement model of, say, human comparisons between pairs of objects on some dimensional characteristic (e.g., length) could only enable researchers to recover the order of object length but not any quantitative information about how much longer people were perceiving one object to be than another.

I’ll paraphrase (and amend slightly) one of Annemarie’s illustrations of her thesis, to build intuition about how her argument works. In our perception lab, we present subjects with pairs of lines and ask them to tell us which line they think is the longer. One subject, Hawkeye Harriet, perfectly picks the longer of the two lines every time—regardless of how much longer one is than the other. Myopic Myra, on the other hand, has imperfect visual discrimination and thus sometimes gets it wrong. But she’s less likely to choose the wrong line if the two lines’ lengths considerably differ from one another. In short, Myra’s success-rate is positively correlated with the difference between the two line-lengths whereas Harriet’s uniformly 100% success rate clearly is not.

Is there a way that Myra’s success- and error-rates could tell us exactly how long each object is, relative to the others? Yes. Let pij be the probability that Myra picks the ith object as longer than the jth object, and pji = 1 – pij be the probability that Myra picks the jth object as longer than the ith object. If the ith object has length Li and the jth object has length Lj, then if pij/pji = Li/Lj, Myra’s choice-rates perfectly mimic the ratio of the ith and jth objects’ lengths. This neat relationship owes its nature to the fact that a characteristic such as length has an absolute zero, so we can meaningfully compare lengths by taking ratios.

How about temperature? This is slightly trickier, because if we’re using a popular scale such as Celsius or Fahrenheit then the zero-point of the scale isn’t absolute in the sense that length has an absolute zero (i.e., you can have Celsius and Fahrenheit readings below zero, and each scale’s zero-point differs from the other). Thus, 60 degrees Fahrenheit is not twice as warm as 30 degrees Fahrenheit. However, the differences between temperatures can be compared via ratios. For instance, 40 degrees F is twice as far from 20 degrees F as 10 degrees F is.

We just need a common “reference” object against which to compare each of the others. Suppose we’re asking Myra to choose which of a pair of objects is the warmer. Assuming that Myra’s choices are transitive, there will be an object she chooses less often than any of the others in all of the paired comparisons. Let’s refer to that object as the Jth object. Now suppose the ith object has temperature Ti,the jth object has temperature Tj, and the Jth object has temperature TJ which is lower than both Ti and Tj. Then if Myra’s choice-rate ratio is
piJ/pjJ = (Ti – TJ)/( Tj – TJ),
she functions as a perfect measuring instrument for temperature comparisons between the ith and jth objects. Again, Hawkeye Harriet’s choice-rates will be piJ = 1 and pjJ = 1 no matter what Ti and Tj are, so her ratio always is 1.

If we didn’t know what the ratios of those lengths or temperature differences were, Myra would be a much better measuring instrument than Harriet even though Harriet never makes mistakes. Are there such situations? Yes, especially when it comes to measuring mental or psychological characteristics for which we have no direct access, such as subjective sensation, mood, or mental task difficulty.

Which of 10 noxious stimuli is the more aversive? Which of 12 musical rhythms makes you feel more joyous? Which of 20 types of puzzle is the more difficult? In paired comparisons between each possible pair of stimuli, rhythms or puzzles, Hawkeye Harriet will pick what for her is the correct pair every time, so all we’ll get from her is the rank-order of stimuli, rhythms and puzzles. Myopic Myra will less reliably and less accurately choose what for her is the correct pair, but her choice-rates will be correlated with how dissimilar each pair is. We’ll recover much more precise information about the underlying structure of the stimulus set from error-prone Myra.

Annemarie’s point about measurement is somewhat related to another fascinating phenomenon known as stochastic resonance. Briefly paraphrasing the Wikipedia entry for stochastic resonance (SR), SR occurs when a measurement or signal-detecting system’s signal-to-noise ratio increases when a moderate amount of noise is added to the incoming signal or to the system itself. SR usually is observed either in bistable or sub-threshold systems. Too little noise results in the system being insufficiently sensitive to the signal; too much noise overwhelms the signal. Evidence for SR has been found in several species, including humans. For example, a 1996 paper in Nature reported a demonstration that subjects asked to detect a sub-threshold impulse via mechanical stimulation of a fingertip maximized the percentage of correct detections when the signal was mixed with a moderate level of noise. One way of thinking about the optimized version of Myopic Myra as a measurement instrument is to model her as a “noisy discriminator,” with her error-rate induced by an optimal random noise-generator mixed with an otherwise error-free discriminating mechanism.

Written by michaelsmithson

August 14, 2011 at 10:47 am

Writing about “Agnotology, Ignorance and Uncertainty”

with 2 comments

From time to time I receive invitations to contribute to various “encyclopedias.” Recent examples include an entry on “confidence intervals” in the International Encyclopedia of Statistical Science (Springer, 2010) and an entry on “uncertainty” in the Encyclopedia of Human Behavior (Elsevier, 1994, 2012). The latter link goes to the first (1994) edition; the second edition is due out in 2012. I’ve duly updated and revised my 1994 entry for the 2012 edition.

Having been raised by a librarian (my mother worked in the Seattle Public Library for 23 years), I’m a believer in the value of good reference works. So, generally I’m willing to accept invitations to contribute to them. These days there is a niche market even for non-digital works of this kind, and of course the net has led to numerous hybrid versions.

Despite the fact that such invitations are regarded as markers of professional esteem, they don’t count for much in the university system where I work because they aren’t original research publications. Same goes for textbooks. Thus, for my younger academic colleagues, writing encyclopedia entries or, worse still, writing textbooks actually can harm their careers. They understandably avoid doing so, which leaves it to older academics like me.

Some of these encyclopedias have interesting moments on the world stage. The International Encyclopedia of Statistical Science has been said to have set a record for the number of countries involved (105, via the 619 contributing authors). Its editors were nominated for the 2011 Nobel Peace Prize, apparently the first time any statisticians had received this honor. Meanwhile, V.S. Ramachandran, editor of the Encyclopedia of Human Behavior, was selected by Time Magazine as one of the world’s most influential people of 2011.

However, I digress. The Sage Encyclopedia of Philosophy and the Social Sciences is an intriguing proposal for a reference work that bridges these two intellectual cultures. I regard this aim as laudable, and I’m fortunate insofar as the areas where I work have a tradition of dialogs linking philosophers and social scientists. So, I was delighted to be asked to provide an entry on “agnotology, ignorance and uncertainty”. There is, however, a bit of a catch.

The guidelines for contributors state that “Entries should be written at a level appropriate for students who do not have an extensive background either in philosophy or the social sciences and for academics from other disciplines… it is essential that a reader versed in philosophy only or mostly, or alternatively, in social sciences, should gain by reading entries that aim at expanding their knowledge of concepts and theories as these have developed in the complementary area.” All of this is supposed to be achieved for a treatment of “agnotology, ignorance and uncertainty” in just 1,000 words, with a short list of “further readings” at the end. All of my posts in this blog thus far exceed 1,000 words (gulp). Can I be sufficiently concise without butchering or omitting crucial content?

Here’s my first draft (word count: 1,018). See what you think.

AGNOTOLOGY, IGNORANCE AND UNCERTAINTY

“Agnotology” is the study of ignorance (from the Greek “agnosis”). “Ignorance,” “uncertainty,” and related terms refer variously to the absence of knowledge, doubt, and false belief. This topic has a long history in Western philosophy, rooted in the Socratic tradition. It has a considerably shorter and, until recently, sporadic treatment in the human sciences. This entry focuses on relatively recent developments within and exchanges between both domains.

A key starting-point is that anyone attributing ignorance cannot avoid making claims to know something about who is ignorant of what: A is ignorant from B’s viewpoint if A fails to agree with or show awareness of ideas which B defines as actually or potentially valid. A and B can be identical, so that A self-attributes ignorance. Numerous scholars thereby have noted the distinction between conscious ignorance (known unknowns, learned ignorance) and meta-ignorance (unknown unknowns, ignorance squared).

The topic has been beset with terminological difficulties, due to the scarcity and negative cast of terms referring to unknowns. Several scholars have constructed typologies of unknowns, in attempts to make explicit their most important properties. Smithson’s book, Ignorance and Uncertainty: Emerging Paradigms, pointed out the distinction between being ignorant of something and ignoring something, the latter being akin to treating something as irrelevant or taboo. Knorr-Cetina coined the term “negative knowledge” to describe knowledge about the limits of the knowable. Various authors have tried to distinguish reducible from irreducible unknowns.

Two fundamental concerns have been at the forefront of philosophical and social scientific approaches to unknowns. The first of these is judgment, learning and decision making in the absence of complete information. Prescriptive frameworks advise how this ought to be done, and descriptive frameworks describe how humans (or other species) do so. A dominant prescriptive framework since the second half of the 20th century is subjective expected utility theory (SEU), whose central tenet is that decisional outcomes are to be evaluated by their expected utility, i.e., the product of their probability and their utility (e.g., monetary value, although utility may be based on subjective appraisals). According to SEU, a rational decision maker chooses the option that maximizes her/his expected utility. Several descriptive theories in psychology and behavioral economics (e.g., Prospect Theory and Rank-Dependent Expected Utility Theory) have amended SEU to render it more descriptively accurate while retaining some of its “rational” properties.

The second concern is the nature and genesis of unknowns. While many scholars have treated unknowns as arising from limits to human experience and cognitive capacity, increasing attention has been paid recently to the thesis that unknowns are socially constructed, many of them intentionally so. Smithson’s 1989 book was among the earliest to take up the thesis that unknowns are socially constructed. Related work includes Robert Proctor’s 1995 Cancer Wars and Ulrich Beck’s 1992 Risk Society. Early in the 21st century this thesis has become more mainstream. Indeed, the 2008 edited volume bearing “agnotology” in its title focuses on how culture, politics, and social dynamics shape what people do not know.

Philosophers and social scientists alike have debated whether there are different kinds of unknowns. This issue is important because if there is only one kind then only one prescriptive decisional framework is necessary and it also may be the case that humans have evolved one dominant way of making decisions with unknowns. On the other hand, different kinds of unknowns may require distinct methods for dealing with them.

In philosophy and mathematics the dominant formal framework for dealing with unknowns has been one or another theory of probability. However, Max Black’s ground-breaking 1937 paper proposed that vagueness and ambiguity are distinguishable from each other, from probability, and also from what he called “generality.” The 1960’s and 70’s saw a proliferation of mathematical and philosophical frameworks purporting to encompass non-probabilistic unknowns, such as fuzzy set theory, rough sets, fuzzy logic, belief functions, and imprecise probabilities. Debates continue to this day over whether any of these alternatives are necessary, whether all unknowns can be reduced to some form of probability, and whether there are rational accounts of how to deal with non-probabilistic unknowns. The chief contenders currently include generalized probability frameworks (including imprecise probabilities, credal sets, belief functions), robust Bayesian techniques, and hybrid fuzzy logic techniques.

In the social sciences, during the early 1920’s Keynes distinguished between evidentiary “strength” and “weight,” while Knight similarly separated “risk” (probabilities are known precisely) from “uncertainty” (probabilities are not known). Ellsberg’s classic 1961 experiments demonstrated that people’s choices can be influenced by how imprecisely probabilities are known (i.e., “ambiguity”), and his results have been replicated and extended by numerous studies. Smithson’s 1989 book proposed a taxonomy of unknowns and his 1999 experiments showed that choices also are influenced by uncertainty arising from conflict (disagreeing evidence from equally credible sources); those results also have been replicated.

More recent empirical research on how humans process unknowns has utilized brain imaging methods. Several studies have suggested that Knightian uncertainty (ambiguity) and risk differentially activate the ventral systems that evaluate potential rewards (the so-called “reward center”) and the prefrontal and parietal regions, with the latter two becoming more active under ambiguity. Other kinds of unknowns have yet to be widely studied in this fashion but research on them is emerging. Nevertheless, the evidence thus far suggests that the human brain treats unknowns as if there are different kinds.

Finally, there are continuing debates regarding whether different kinds of unknowns should be incorporated in prescriptive decision making frameworks and, if so, how a rational agent should deal with them. There are several decisional frameworks incorporating ambiguity or imprecision, some of which date back to the mid-20th century, and recently at least one incorporating conflict as well. The most common recommendation for decision making under ambiguity amounts to a type of worst-case analysis. For instance, given a lower and upper estimate of the probability of event E, the usual advice is to use the lower probability for evaluating bets on E occurring but to use the upper probability for bets against E. However, the general issue of what constitutes rational behavior under non-probabilistic uncertainties such as ambiguity, fuzziness or conflict remains unresolved.

Further Readings

Bammer, G. and Smithson, M. (Eds.), (2008). Uncertainty and Risk: Multidisciplinary Perspectives. London: Earthscan.

Beck, Ulrich (1999). World Risk Society. Oxford: Polity Press.

Black, M. (1937). Vagueness: An exercise in logical analysis. Philosophy of Science, 4, 427-455.

Gardenfors, P. and Sahlin, N.-E. (Eds.), (1988). Decision, Probability, and Utility: Selected Readings. Cambridge, UK: Cambridge University Press.

Proctor, R. and Schiebinger, L. (Eds.), (2008). Agnotology: The Making and Unmaking of Ignorance. Stanford, CA: Stanford University Press.

Smithson, M. (1989). Ignorance and Uncertainty: Emerging Paradigms. Cognitive Science Series. New York: Springer Verlag.

Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities. London: Chapman Hall.

You Can Never Plan the Future by the Past

with 2 comments

The title of this post is, of course, a famous quotation from Edmund Burke. This is a personal account of an attempt to find an appropriate substitute for such a plan. My siblings and I persuaded our parents that the best option for financing their long-term in-home care is via a reverse-mortgage. At first glance, the problem seems fairly well-structured: Choose the best reverse mortgage setup for my elderly parents. After all, this is the kind of problem for which economists and actuaries claim to have appropriate methods.

There are two viable strategies for utilizing the loan from a reverse mortgage: Take out a line of credit from which my parents can draw as they wish, or a tenured (fixed) schedule of monthly payments to their nominated savings account. The line of credit (LOC) option’s main attraction is its flexibility. However, the LOC runs out when the equity in my parents’ property is exhausted, whereas the tenured payments (TP) continue as long as they live in their home. So if either of them is sufficiently long-lived then the TP could be the safer option. On the other hand, the LOC may be more robust against unexpected expenses (e.g., medical emergencies or house repairs). Of course, one can opt for a mixture of TP and LOC.

So, this sounds like a standard optimization problem: What’s the optimal mix of TP and LOC? Here we run into the first hurdle: “Optimal” by what criteria? One criterion is to maximize the expected remaining equity in the property. This criterion might be appealing to their offspring, but it doesn’t do my parents much good. Another criterion that should appeal to my parents is maximizing the expected funds available to them. Fortunately, my siblings and I are more concerned for our parents’ welfare than what we’d get from the equity, so we’re happy to go with the second criterion. Nevertheless, it’s worth noting that this issue poses a deeper problem in general—How would a family with interests in both criteria come up with an appropriate weighting for each of them, especially if family members disagreed on the importance of these criteria?

Meanwhile, having settled on an optimization criterion, the next step would seem to be computing the expected payout to my parents for various mixtures of TP and LOC. But wait a minute. Surely we also should be worried about the possibility that some financial exigency could exhaust their funds altogether. So, we could arguably consider a third criterion: Minimizing the probability of their running out of funds. So now we encounter a second hurdle: How do we weigh up maximizing expected payout to our parents against the likelihood that their funds could run out? It might seem as if maximizing payout would also minimize that probability, but this is not necessarily so. A strategy that maximized expected payout could also increase the variability of the available funds over time so that the probability of ruin is increased.

Then there are the unknowns: How long our parents might live, what expenses they might incur (e.g., medical or in-home care), inflation, the behaviour of the LIBOR index that determines the interest rate on what is drawn down from the mortgage, and appreciation or deprecation of the property value. It is possible to come up with plausible-looking models for each of these by using standard statistical tools, and that’s exactly what I did.

I pulled down life-expectancy tables for American men and women born when my parents were born, more than two decades of monthly data on inflation in the USA, a similar amount of monthly data on the LIBOR, and likewise for real-estate values in the area where my parents live. I fitted a several “lifetime” distributions to the relevant parts of the life-expectancy tables to model the probability of my parents living 1, 2, 3, … years longer given that they have survived to their mid-80’s and arrived at a model that fitted the data very well. I modeled the inflation, LIBOR and real-estate data with standard time-series (ARIMA) models whose squared correlations with the data were .91, .98, and .91 respectively—All very good fits.

Finally, my brothers and sisters-in-law obtained the necessary information from my mother regarding our parents’ expenses in the recent past, their income from pensions and so on, and we made some reasonable forecasts of additional expenses that we can foresee in the near term. The transition in this post from “I” to “we” is crucial. This was very much a joint effort. In particular, my youngest brother’s sister-in-law made most of the running on determining the ins and outs of reverse mortgages. She has a terrifically analytical intelligence, and we were able to cross-check one another’s perceptions, intuitions, and calculations.

Armed with all of this information and well-fitted models, it would seem that all we should need to do is run a large enough batch of simulations of the future for each reverse-mortgage scenario under consideration to get reliable estimates of expected payout, expected equity, the probability of ruin, and so on. The inflation model would simulate fluctuations in expenses, the LIBOR model would do so for the interest-rates, the real-estate model for the property value, and the life-expectancy model for how long our parents would live.

But there are at least two flaws in my approach. First, it assumes that my parents’ life-spans can best be estimated by considering them as if they are randomly chosen from the population of American men and women born when they were born who have survived to their mid-80’s. Should I take additional characteristics about them into account and base my estimates on only those who share those characteristics as well as their nation and birth-year? What about diet, or body-mass index, or various aspects of their medical histories? This issue is known as the reference-class problem, and it bedevils every school of statistical inference.

What did I do about this? I fudged my life-expectancy model to be “conservative,” i.e., so that it assumes my parents have a somewhat longer life-span than the original model suggests. In short, I tweaked my model as a risk-averse agent would—The longer my parents live, the greater the risk that they will run short of funds.

The second flaw in my approach is more fundamental. It assumes that the future is going to be just like the past. And before anyone says anything, yes, I’ve read Taleb’s The Black Swan (and was aware of most of the material he covered before reading his book), and yes, I’m aware of most criticisms that have been raised against the kind of models I’ve constructed. The most problematic assumption in my models is what is called stationarity, i.e., that the process driving the ups and downs of, say, the LIBOR index has stable characteristics. There were clear indications that the real-estate market fluctuations in the area where my parents live do not resemble a stationary process, and therefore I should not trust my ARIMA model very much despite its high correlation with the data.

Let me also point out the difference between my approach and the materials provided to us by potential lenders and the HUD counsellor. Their scenarios and forecasts are one-shot spreadsheets that don’t simulate my parents’ expenses, the impact of inflation, or fluctuations in real-estate markets. Indeed, the standard assumption about the latter in their spreadsheets is a constant appreciation in property value of 4% per year.

My simulations are literally equivalent to 10,000 spreadsheets for each scenario, each spreadsheet an appropriate random sample from an uncertain future, and capable of being tweaked to include possibilities such as substantial real-estate downturns. I also incorporated random “shock” expenditures on the order of $5-$75K to see how vulnerable each scenario was to unexpected expenses.

The upshot of all this was that the mix of LOC and TP had a substantial effect on the probability of running out of money, but not a large impact on expected balance or equity (the other factors had large impacts on those). So at least we could home in on a robust mix of LOC and TP, one that would have a lower risk of running out of money than others. This criterion became the primary driver in our choice. We also can monitor how our parents’ situation evolves and revise the mix if necessary.

What about maximizing expected utility? Or optimizing in any sense of the term? No, and no. The deep unknowns inherent even in this relatively well-structured problem make those unattainable goals. What can we do instead? Taleb’s advice is to pay attention to consequences instead of probabilities. This is known as “dominance reasoning.” If option A yields better outcomes than option B no matter what the probabilities of those outcomes are, choose option A. But life often isn’t that simple. We can’t do that here because the comparative outcomes of alternative mixtures of LOC and TP depend on probabilities.

Instead, we have ended up closer to the “bounded rationality” that Herbert Simon wrote about. We can’t claim to have optimized, but we do have robustness and corrigibility on our side, two important criteria for good decision making under ignorance (described in my recent post on that topic). Perhaps most importantly, the simulations gave us insights none of our intuitions could, into how variable the future can be and the consequences of that variability. Sir Edmund was right. We can’t plan the future by the past. But sometimes we can chart a steerable course into that future armed with a few clues from the past to give us an honest check on our intuitions, and a generous measure of scepticism about relying too much on those clues.

Communicating about Uncertainty in Climate Change, Part II

with 5 comments

In my previous post I attempted to provide an overview of the IPCC 2007 report’s approach to communicating about uncertainties regarding climate change and its impacts. This time I want to focus on how the report dealt with probabilistic uncertainty. It is this kind of uncertainty that the report treats most systematically. I mentioned in my previous post that Budescu et al.’s (2009) empirical investigation of how laypeople interpret verbal probability expressions (PEs, e.g., “very likely”) in the IPCC report revealed several problematic aspects, and a paper I have co-authored with Budescu’s team (Smithson, et al., 2011) yielded additional insights.

The approach adopted by the IPCC is one that has been used in other contexts, namely identifying probability intervals with verbal PEs. Their guidelines are as follows:
Virtually certain >99%; extremely likely >95%; very likely >90%; likely >66%; more likely than not > 50%; about as likely as not 33% to 66%; unlikely <33%; very unlikely <10%; extremely unlikely <5%; exceptionally unlikely <1%.

One unusual aspect of these guidelines is their overlapping intervals. For instance, “likely” takes the interval [.66,1] and thus contains the interval [.90,1] for “very likely,” and so on. The only interval that doesn’t overlap with others is “as likely as not.” Other interval-to-PE guidelines I am aware of use non-overlapping intervals. An early example is Sherman Kent’s attempt to standardize the meanings of verbal PEs in the American intelligence community.

Attempts to translate verbal PEs into numbers have a long and checkered history. Since the earliest days of probability theory, the legal profession has steadfastly refused to quantify its burdens of proof (“balance of probabilities” or “reasonable doubt”) despite the fact that they seem to explicitly refer to probabilities or at least degrees of belief. Weather forecasters debated the pros and cons of verbal versus numerical PEs for decades, with mixed results. A National Weather Service report on a 1997 survey of Juneau, Alaska residents found that although the rank-ordering of the mean numerical probabilities residents assigned to verbal PE’s reasonably agreed with those assumed by the organization, the residents’ probabilities tended to be less extreme than the organization’s assignments. For instance, “likely” had a mean of 62.5% whereas the organization’s assignments for this PE were 80-100%.

And thus we see a problem arising that has been long noted about individual differences in the interpretation of PEs but largely ignored when it comes to organizations. Since at least the 1960’s empirical studies have demonstrated that people vary widely in the numerical probabilities they associate with a verbal PE such as “likely.” It was this difficulty that doomed Sherman Kent’s attempt at standardization for intelligence analysts. Well, here we have the NWS associating it with 80-100% whereas the IPCC assigns it 66-100%. A failure of organizations and agencies to agree on number-to-PE translations leaves the public with an impossible brief. I’m reminded of the introduction of the now widely-used cyclone (hurricane) category 1-5 scheme (higher numerals meaning more dangerous storms) at a time when zoning for cyclone danger where I was living also had a 1-5 numbering system that went in the opposite direction (higher numerals indicating safer zones).

Another interesting aspect is the frequency of the PEs in the report itself. There are a total of 63 PEs therein. “Likely” occurs 36 times (more than half), and “very likely” 17 times. The remaining 10 occurrences are “very unlikely” (5 times), “virtually certain” (twice), “more likely than not” (twice), and “extremely unlikely” (once). There is a clear bias towards fairly extreme positively-worded PEs, perhaps because much of the IPCC report’s content is oriented towards presenting what is known and largely agreed on about climate change by climate scientists. As we shall see, the bias towards positively-worded PEs (e.g., “likely” rather than “unlikely”) may have served the IPCC well, whether intentionally or not.

In Budescu et al.’s experiment, subjects were assigned to one of four conditions. Subjects in the control group were not given any guidelines for interpreting the PEs, as would be the case for readers unaware of the report’s guidelines. Subjects in a “translation” condition had access to the guidelines given by the IPCC, at any time during the experiment. Finally, subjects in two “verbal-numerical translation” conditions saw a range of numerical values next to each PE in each sentence. One verbal-numerical group was shown the IPCC intervals and the other was shown narrower intervals (with widths of 10% and 5%).

Subjects were asked to provide lower, upper and “best” estimates of the probabilities they associated with each PE. As might be expected, these figures were most likely to be consistent with the IPCC guidelines in the verbal- numerical translation conditions, less likely in the translation condition, and least likely in the control condition. They were also less likely to be IPCC-consistent the more extreme the PE was (e.g., less consistent foro “very likely” than for “likely”). Consistency rates were generally low, and for the extremal PEs the deviations from the IPCC guidelines were regressive (i.e., subjects’ estimates were not extreme enough, thereby echoing the 1997 National Weather Service report findings).

One of the ironic claims by the Budescu group is that the IPCC 2007 report’s verbal probability expressions may convey excessive levels of imprecision and that some probabilities may be interpreted as less extreme than intended by the report authors. As I remarked in my earlier post, intervals do not distinguish between consensual imprecision and sharp disagreement. In the IPCC framework, the statement “The probability of event X is between .1 and .9 could mean “All experts regard this probability as being anywhere between .1 and .9” or “Some experts regard the probability as .1 and others as .9.” Budescu et al. realize this, but they also have this to say:

“However, we suspect that the variability in the interpretation of the forecasts exceeds the level of disagreement among the authors in many cases. Consider, for example, the statement that ‘‘average Northern Hemisphere temperatures during the second half of the 20th century were very likely higher than during any other 50-year period in the last 500 years’’ (IPCC, 2007, p. 8). It is hard to believe that the authors had in mind probabilities lower than 70%, yet this is how 25% of our subjects interpreted the term very likely!” (pg. 8).

One thing I’d noticed about the Budescu article was that their graphs suggested the variability in subjects’ estimates for negatively-worded PEs (e.g., “unlikely”) seemed greater than for positively worded PEs (e.g., “likely”). That is, subjects seemed to have less of a consensus about the meaning of the negatively-worded PEs. On reanalyzing their data, I focused on the six sentences that used the PE “very likely” or “very unlikely”. My statistical analyses of subjects’ lower, “best” and upper probability estimates revealed a less regressive mean and less dispersion for positive than for negative wording in all three estimates. Negative wording therefore resulted in more regressive estimates and less consensus regardless of experimental condition. You can see this in the box-plots below.

clip_image002

In this graph, the negative PEs’ estimates have been reverse-scored so that we can compare them directly with the positive PEs’ estimates. The “boxes” (the blue rectangles) contain the middle 50% of subjects’ estimates and these boxes are consistently longer for the negative PEs, regardless of experimental condition. The medians (i.e., the score below which 50% of the estimates fall) are the black dots, and these are fairly similar for positive and (reverse-scored) negative PEs. However, due to the negative PE boxes’ greater lengths, the mean estimates for the negative PEs end up being pulled further away from their positive PE counterparts.

There’s another effect that we confirmed statistically but also is clear from the box-plots. The difference between the lower and upper estimates is, on average, greater for the negatively-worded PEs. One implication of this finding is that the impact of negative wording is greatest on the lower estimates—And these are the subjects’ translations of the very thresholds specified in the IPCC guidelines.

If anything, these results suggest the picture is worse even than Budescu et al.’s assessment. They noted that 25% of the subjects interpreted “very likely” as having a “best” probability below 70%. The boxplots show that in three of the four experimental conditions at least 25% of the subjects provided a lower probability of less than 50% for “very likely”. If we turn to “very unlikely” the picture is worse still. In three of the four experimental conditions about 25% of the subjects returned an upper probability for “very unlikely” greater than 80%!

So, it seems that negatively-worded PEs are best avoided where possible. This recommendation sounds simple, but it could open a can of syntactical worms. Consider the statement “It is very unlikely that the MOC will undergo a large abrupt transition during the 21st century.” Would it be accurate to equate it with “It is very likely that the MOC will not undergo a large abrupt transition during the 21st century?” Perhaps not, despite the IPCC guidelines’ insistence otherwise. Moreover, turning the PE positive entails turning the event into a negative. In principle, we could have a mixture of negatively- and positively-worded PE’s and events (“It is (un)likely that A will (not) occur”). It is unclear at this point whether negative PEs or negative events are the more confusing, but inspection of the Budescu et al. data suggested that double-negatives were decidedly more confusing than any other combination.

As I write this, David Budescu is spearheading a multi-national study of laypeople’s interpretations of the IPCC probability expressions (I’ll be coordinating the Australian component). We’ll be able to compare these interpretations across languages and cultures. More anon!

References

Budescu, D.V., Broomell, S. and Por, H.-H. (2009) Improving the communication of uncertainty in the reports of the Intergovernmental panel on climate change. Psychological Science, 20, 299–308.

Intergovernmental Panel on Climate Change (2007). Summary for policymakers: Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. Retrieved May 2010 from http://www.ipcc.ch/pdf/assessment-report/ar4/wg1/ar4-wg1-spm.pdf.

Smithson, M., Budescu, D.V., Broomell, S. and Por, H.-H. (2011) Never Say “Not:” Impact of Negative Wording in Probability Phrases on Imprecise Probability Judgments. Accepted for presentation at the Seventh International Symposium on Imprecise Probability: Theories and Applications, Innsbruck, Austria, 25-28 July 2011.

Communicating about Uncertainty in Climate Change, Part I

with 5 comments

The Intergovernmental Panel on Climate Change (IPCC) guidelines for their 2007 report stipulated how its contributors were to convey uncertainties regarding climate change scientific evidence, conclusions, and predictions. Budescu et al.’s (2009) empirical investigation of how laypeople interpret verbal probability expressions (e.g., “very likely”) in the IPCC report revealed several problematic aspects of those interpretations, and a paper I have co-authored with Budescu’s team (Smithson, et al., 2011) raises additional issues.

Recently the IPCC has amended their guidelines, partly in response to the Budescu paper. Granting a broad consensus among climate scientists that climate change is accelerating and that humans have been a causal factor therein, the issue of how best to represent and communicate uncertainties about climate change science nevertheless remains a live concern. I’ll focus on the issues around probability expressions in a subsequent post, but in this one I want to address the issue of communicating “uncertainty” in a broader sense.

Why does it matter? First, the public needs to know that climate change science actually has uncertainties. Otherwise, they could be misled into believing either that scientists have all the answers or suffer from unwarranted dogmatism. Likewise, policy makers, decision makers and planners need to know the magnitudes (where possible) and directions of these uncertainties. Thus, the IPCC is to be commended for bringing uncertainties to the fore its 2007 report, and for attempting to establish standards for communicating them.

Second, the public needs to know what kinds uncertainties are in the mix. This concern sits at the foundation of the first and second recommendations of the Budescu paper. Their first suggestion is to differentiate between the ambiguous or vague description of an event and the likelihood of its occurrence. The example the authors give is “It is very unlikely that the meridonial overturning circulation will undergo a large abrupt transition during the 21st century” (emphasis added). The first italicized phrase expresses probabilistic uncertainty whereas the second embodies a vague description. People may have different interpretations of both phrases. They might disagree on what range of probabilities is referred to by “very likely” or on what is meant by a “large abrupt” change. Somewhat more worryingly, they might agree on how likely the “large abrupt” change is while failing to realize that they have different interpretations of that change in mind.

The crucial point here is that probability and vagueness are distinct kinds of uncertainty (see, e.g., Smithson, 1989). While the IPCC 2007 report is consistently explicit regarding probabilistic expressions, it only intermittently attends to matters of vagueness. For example, in the statement “It is likely that heat waves have become more frequent over most land areas” (IPCC 2007, pg. 30) the term “heat waves” remains undefined and the time-span is unspecified. In contrast, just below that statement is this one: “It is likely that the incidence of extreme high sea level3 has increased at a broad range of sites worldwide since 1975.” Footnote 3 then goes on to clarify “extreme high sea level” by the following: “Excluding tsunamis, which are not due to climate change. Extreme high sea level depends on average sea level and on regional weather systems. It is defined here as the highest 1% of hourly values of observed sea level at a station for a given reference period.”

The Budescu paper’s second recommendation is to specify the sources of uncertainty, such as whether these arise from disagreement among specialists, absence of data, or imprecise data. Distinguishing between uncertainty arising from disagreement and uncertainty arising from an imprecise but consensual assessment is especially important. In my experience, the former often is presented as if it is the latter. An interval for near-term ocean level increases of 0.2 to 0.8 metres might be the consensus among experts, but it could also represent two opposing camps, one estimating 0.2 metres and the other 0.8.

The IPCC report guidelines for reporting uncertainty do raise the issue of agreement: “Where uncertainty is assessed qualitatively, it is characterised by providing a relative sense of the amount and quality of evidence (that is, information from theory, observations or models indicating whether a belief or proposition is true or valid) and the degree of agreement (that is, the level of concurrence in the literature on a particular finding).” (IPCC 2007, pg. 27) The report then states that levels of agreement will be denoted by “high,” “medium,” and so on while the amount of evidence will be expressed as “much,”, “medium,” and so on.

As it turns out, the phrase “high agreement and much evidence” occurs seven times in the report and “high agreement and medium evidence” occurs twice. No other agreement phrases are used. These occurrences are almost entirely in the sections devoted to climate change mitigation and adaptation, as opposed to assessments of previous and future climate change. Typical examples are:
“There is high agreement and much evidence that with current climate change mitigation policies and related sustainable development practices, global GHG emissions will continue to grow over the next few decades.” (IPCC 2007, pg. 44) and
“There is high agreement and much evidence that all stabilisation levels assessed can be achieved by deployment of a portfolio of technologies that are either currently available or expected to be commercialised in coming decades, assuming appropriate and effective incentives are in place for development, acquisition, deployment and diffusion of technologies and addressing related barriers.” (IPCC2007, pg. 68)

The IPICC guidelines for other kinds of expert assessments do not explicitly refer to disagreement: “Where uncertainty is assessed more quantitatively using expert judgement of the correctness of underlying data, models or analyses, then the following scale of confidence levels is used to express the assessed chance of a finding being correct: very high confidence at least 9 out of 10; high confidence about 8 out of 10; medium confidence about 5 out of 10; low confidence about 2 out of 10; and very low confidence less than 1 out of 10.” (IPCC 2007, pg. 27) A typical statement of this kind is “By 2080, an increase of 5 to 8% of arid and semi-arid land in Africa is projected under a range of climate scenarios (high confidence).” (IPCC 2007, pg. 50)

That said, some parts of the IPCC report do convey disagreeing projections or estimates, where the disagreements are among models and/or scenarios, especially in the section on near-term predictions of climate change and its impacts. For instance, on pg. 47 of the 2007 report the graph below charts mid-century global warming relative to 1980-99. The six stabilization categories are those described in the Fourth Assessment Report (AR4).

clip_image002

Although this graph effectively represents both imprecision and disagreement (or conflict), it slightly underplays both by truncating the scale at the right-hand side. The next figure shows how the graph would appear if the full range of categories V and VI were included. Both the apparent imprecision of V and VI and the extent of disagreement between VI and categories I-III are substantially greater once we have the full picture.

clip_image004

There are understandable motives for concealing or disguising some kinds of uncertainty, especially those that could be used by opponents to bolster their own positions. Chief among these is uncertainty arising from conflict. In a series of experiments Smithson (1999) demonstrated that people regard precise but disagreeing risk messages as more troubling than informatively equivalent imprecise but agreeing messages. Moreover, they regard the message sources as less credible and less trustworthy in the first case than in the second. In short, conflict is a worse kind of uncertainty than ambiguity or vagueness. Smithson (1999) labeled this phenomenon “conflict aversion.” Cabantous (2007) confirmed and extended those results by demonstrating that insurers would charge a higher premium for insurance against mishaps whose risk information was conflictive than if the risk information was merely ambiguous.

Conflict aversion creates a genuine communications dilemma for disagreeing experts. On the one hand, public revelation of their disagreement can result in a loss of credibility or trust in experts on all sides of the dispute. Laypeople have an intuitive heuristic that if the evidence for any hypothesis is uncertain, then equally able experts should have considered the same evidence and agreed that the truth-status of that hypothesis is uncertain. When Peter Collignon, professor of microbiology at The Australian National University, cast doubt on the net benefit of the Australian Fluvax program in 2010, he attracted opprobrium from colleagues and health authorities on grounds that he was undermining public trust in vaccines and the medical expertise behind them. On the other hand, concealing disagreements runs the risk of future public disclosure and an even greater erosion of trust (lying experts are regarded as worse than disagreeing ones). The problem of how to communicate uncertainties arising from disagreement and vagueness simultaneously and distinguishably has yet to be solved.

References

Budescu, D.V., Broomell, S. and Por, H.-H. (2009) Improving the communication of uncertainty in the reports of the Intergovernmental panel on climate change. Psychological Science, 20, 299–308.

Cabantous, L. (2007). Ambiguity aversion in the field of insurance: Insurers’ attitudes to imprecise and conflicting probability estimates. Theory and Decision, 62, 219–240.

Intergovernmental Panel on Climate Change (2007). Summary for policymakers: Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. Retrieved May 2010 from http://www.ipcc.ch/pdf/assessment-report/ar4/wg1/ar4-wg1-spm.pdf.

Smithson, M. (1989). Ignorance and Uncertainty: Emerging Paradigms. Cognitive Science Series. New York: Springer Verlag.

Smithson, M. (1999). Conflict Aversion: Preference for Ambiguity vs. Conflict in Sources and Evidence. Organizational Behavior and Human Decision Processes, 79: 179-198.

Smithson, M., Budescu, D.V., Broomell, S. and Por, H.-H. (2011) Never Say “Not:” Impact of Negative Wording in Probability Phrases on Imprecise Probability Judgments. Accepted for presentation at the Seventh International Symposium on Imprecise Probability: Theories and Applications, Innsbruck, Austria, 25-28 July 2011.

Making the Wrong Decision for the Right Reasons

with 2 comments

There seems to be a widespread intuition that if we use a well-reasoned, evidence-based approach to making decisions under uncertainty then we’ll make the right decision most of the time. Sure, we’ll make some bad calls but the majority of the time we’ll get it right. Or will we?

Here’s an example from law enforcement. Suppose you’re the commanding officer in a local police jurisdiction, and you have to decide how to allocate resources to a missing person case. A worst-case scenario is that the missing person ends up a homicide. Although police are required to treat all missing persons cases seriously, as most do not involve foul play it would be grossly inefficient to treat all missing persons as potential homicides. So, if the missing person isn’t found within 24 hours, you’ll undertake a risk analysis, considering issues such as whether the circumstances are suspicious or out of character, or there is evidence of the commission of a crime.

What would be your best approach to this risk analysis, and how likely would you be to come to the right decision? a landmark UK study examined 32,705 cases of missing persons in the UK between 2000 and 2002, and determined that 0.6 percent were found dead, although not necessary victims of homicide (Newiss, 2006). This is a very low percentage, and it turns out to be the source of a major headache for you as the commander responsible for deciding what resources to allocate to your case.

You have years of experience, wisdom handed down from seasoned investigators who came before you, and you’ve read the relevant literature. You know that where a missing person is found to have been a victim of foul play, risk factors include age and sex, involvement in prostitution, last being seeing in a public place and an absence of a history of suicide attempts or mental health problems.

So, you’re going to make a decision whether to allocate more resources to a missing persons case investigation based on some diagnostic criteria which I’ll denote by D. The criteria included in D are indicators that the missing person may have died. There are four commonly used criteria for evaluating how good D is:

The expressions on the right hand side of these equations are conditional probabilities. For instance, P(D present|death) is the probability that D is present given that the person has died. Sensitivity and specificity measure the ability of the model to detect the occurrence or absence (respectively) of deaths. Predictive value, on the other hand, tells us the probability of making a correct diagnosis (death versus no death) based on D.

Now, suppose D has a sensitivity of .99 and specificity of .99 (far better than can be obtained from the otherwise worthwhile predictors identified by Newiss). The next table shows how well D would perform in distinguishing between cases ending in death and cases not involving death.

D present D absent Error-rate
Dead 196 2 198 0.01 Sensitivity 0.99
Alive 325 32182 32507 0.01 Specificity 0.99
521 32184 32705
pos. pred. neg. pred.
0.3762 0.999938

Because sensitivity is .99, D misses only .01*198 = 2 cases involving deaths, and correctly detects the remaining 196. Likewise, because specificity is .99, D absent misses .01*32507 = 325 cases that do not involve death. That is, there are 325 missing persons with D who will be found to be alive. But 325 is large compared to the number of correctly identified deaths (196). So positive predictive value is poor: P(death|D present) = 196/(196 + 325) = .376. The rate of incorrect positive diagnosis therefore is 1 – .376 = .624. If you, as commander, decided to allocate more resources to cases where D is present you could expect to be wrong about 62% of the time.

Can these uncertainties be reduced? An obvious and frequently recommended remedy is further investigation into factors that may predict the likelihood of a missing person ending up dead and, conditional on death, being a homicide victim. These investigations could be combined with survival analysis of the kind employed by Newiss, to determine whether there is a relationship between the length of time a person has gone missing and the likelihood that the person ends up dead.

But how effective can we expect these remedies to be? Note that improving sensitivity would have only a negligible effect on positive predictive value. To get to the point where positive predictive value was an even-money bet (.5) would require specificity to be .994. To move positive predictive value to .9 would require specificity to be .9993. Thus the test would have to be incredibly accurate in order to not devote considerable resources to investigations where it was not warranted.

These are unachievable standards. Police will inevitably face a considerable error-rate in making resource allocation decisions regarding missing persons cases. Of course, this does not imply that improving predictions of homicide in missing persons cases is futile, but simply tells us not to expect such improvements to raise the probability of a correct decision to a desirable level.

Mind you, it isn’t all gloom and doom. If we consider the false negative problem (e.g., a Britt Lapthorne outcome) it may be possible to obtain a reasonably high predictive value rate without unrealistically accurate predictors. In our unrealistic scenario (with sensitivity and specificity both at .99), negative diagnositicity is .99994. If sensitivity and specificity both were .5 (i.e., coin-toss levels) then negative predictive value would be about 16,253/16,352 = .994. You, as commander, are very unlikely to end up with a Britt Lapthorne case which you stand accused of having failed to treat with due diligence. Instead, you are very likely to be chastised by higher-ups and perhaps the media for “wasting” money and resources on cases where the missing person turned up alive and well.

There is an analogous problem in preventative medical testing, where the disorder to be detected occurs at a low rate in the population. For example pregnant women may wish to test for the possibility that their unborn baby has Downs Syndrome. According to an Australian government health assessment document released in 2002, when used as a single modality, the standard screening by measurement of Nuchal Translucency in the first trimester has a detection rate for Downs of approximately 73%-82% at a false positive rate of 5%-8%.. Additional ultrasound cues can further increase detection rates for Down syndrome to more than 95%.

The next table shows the most optimistic scenario according to those figures, i.e., sensitivity and specificity of 95%. At the time, about 12.8 per 10,000 births yielded a baby with Downs, so I’ve included that rate in the table. Downs Syndrome, thankfully, is rare. The result, as you can see, is a positive predictive value of just 2.38%. Given a test result that says the baby has Downs, the probability that it really does have Downs is about 2.4 chances in 100. If these procedures were widely used, there would be many needlessly upset pregnant women—about 97.6% of those whose combined tests came back positive.

Positive Negative Error-rate
Downs 122 6 128 0.05 Sensitivity 0.95
Normal 4994 94878 99872 0.05 Specificity 0.95
5116 94884 100000
pos. pred. neg. pred.
0.0238 0.9999

In July last year there was a furore over a study published in the Journal of the American Medical Association. The study found that of 2176 participants free of HIV infection who received a vaccine product, 908 tested positive even though they had been exposed to the vaccine, not (of course) the virus. That’s a false positive rate of about 41.7%. Now, suppose a successful vaccine is developed but it also has this reactivity problem. In any Western country where the rate of HIV infection is low, the combination of a large proportion of the population being vaccinated and tested could be a major disaster. This is not to say that an HIV vaccine would be a bad idea; the point is that it could play havoc with HIV detection.

The chief difference between the medical preventative testing quandary and the police commander’s problem is that the negative consequences of the wrong diagnosis fall on the patient instead of the decision maker. Yet this issue is seldom aired in public debates regarding medical testing. Perhaps understandably, the bulk of medical research effort in this domain goes into devising more accurate tests. But hang on—In the Downs test scenario, even with a sensitivity rate of 100% the specificity would have to be 99.87% to raise the positive predictive value to a mere 50%. For a positive predictive value of 90%? Sensitivity would have to be about 99.99%, a crazily impossible target. Realistically, the tests will never be accurate enough to avoid the problem posed by low positive predictive values for rare disorders.

What can a decision maker do? A final point to all this is that in settings where you’re doomed to a high decisional error-rate despite using the best available methods, it may be better to direct your energies toward handling the flak instead of persisting in a futile quest for unattainably accurate predictors or diagnostic cues. The chief difficulty may be educating your clientele, constituency, or bosses that it really is possible to be making the best possible decisions and still getting them wrong most of the time.

Written by michaelsmithson

May 8, 2011 at 2:55 pm

I Can’t Believe What I Teach

with 2 comments

For the past 34 years I’ve been compelled to teach a framework that I’ve long known is flawed. A better framework exists and has been available for some time. Moreover, I haven’t been forced to do this by any tyrannical regime or under threats of great harm to me if I teach this alternative instead. And it gets worse: I’m not the only one. Thousands of other university instructors have been doing the same all over the world.

I teach statistical methods in a psychology department. I’ve taught courses ranging from introductory undergraduate through graduate levels, and I’m in charge of that part of my department’s curriculum. So, what’s the problem—Why haven’t I abandoned the flawed framework for its superior alternative?

Without getting into technicalities, let’s call the flawed framework the “Neyman-Pearson” approach and the alternative the “Bayes” approach. My statistical background was formed as I completed an undergraduate degree in mathematics during 1968-72. My first courses in probability and statistics were Neyman-Pearson and I picked up the rudiments of Bayes toward the end of my degree. At the time I thought these were simply two valid alternative ways of understanding probability.

Several years later I was a newly-minted university lecturer teaching introductory statistics to fearful and sometimes reluctant students in the social sciences. The statistical methods used in the social science research were Neyman-Pearson, so of course I taught Neyman-Pearson. Students, after all, need to learn to read the literature of their discipline.

Gradually, and through some of my research into uncertainty, I became aware of the severe problems besetting the Neyman-Pearson framework. I found that there was a lengthy history of devastating criticisms raised against Neyman-Pearson even within the social sciences, criticisms that had been ignored by practising researchers and gatekeepers to research publication.

However, while the Bayesian approach may have been conceptually superior, in the late ‘70’s through early ‘80’s it suffered from mathematical and computational impracticalities. It provided few usable methods for dealing with complex problems. Disciplines such as psychology were held in thrall to Neyman-Pearson by a combination of convention and the practical requirements of complex research designs. If I wanted to provide students or, for that matter, colleagues who came to me for advice, with effective statistical tools for serious research then usually Neyman-Pearson techniques were all I could offer.

But what to do about teaching? No university instructor takes a formal oath to teach the truth, the whole truth, and nothing but the truth; but for those of us who’ve been called to teach it feels as though we do. I was sailing perilously close to committing Moore’s Paradox in the classroom (“I assert Neyman-Pearson but I don’t believe it”).

I tried slipping in bits and pieces alerting students to problematic aspects of Neyman-Pearson and the existence of the Bayesian alternative. These efforts may have assuaged my conscience but they did not have much impact, with one important exception. The more intellectually proactive students did seem to catch on to the idea that theories of probability and statistics are just that—Theories, not god-given commandments.

Then Bayes got a shot in the arm. In the mid-80’s some powerful computational techniques were adapted and developed that enabled this framework to fight at the same weight as Neyman-Pearson and even better it. These techniques sail under the banner of Markov chain Monte Carlo methods, and by the mid-90’s software was available (free!) to implement them. The stage was set for the Bayesian revolution. I began to dream of writing a Bayesian introductory statistics textbook for psychology students that would set the discipline free and launch the next generation of researchers.

It didn’t happen that way. Psychology was still deeply mired in Neyman-Pearson and, in fact, in a particularly restrictive version of it. I’ll spare you the details other than saying that it focused, for instance, on whether the researcher could reject the claim that an experimental effect was nonexistent. I couldn’t interest my colleagues in learning Bayesian techniques, let alone undergraduate students.

By the late ‘90’s a critical mass of authoritative researchers convinced the American Psychological Association to form a task-force to reform statistical practice, but this reform really amounted to shifting from the restrictive Neyman-Pearson orientation to a more liberal one that embraced estimating how big an experimental effect is and setting a “confidence interval” around it.

It wasn’t the Bayesian revolution, but I leapt onto this initiative because both reforms were a long stride closer to the Bayesian framework and would still enable students to read the older Neyman-Pearson dominated research literature. So, I didn’t write a Bayesian textbook after all. My 2000 introductory textbook was, so far as I’m aware, one of the first to teach introductory statistics to psychology students from a confidence interval viewpoint. It was generally received well by fellow reformers, and I got a contract to write a kind of researcher’s confidence interval handbook that came out in 2003. The confidence interval reform in psychology was under weigh, and I’d booked a seat on the juggernaut.

Market-wise, my textbook flopped. I’m not singing the blues about this, nor do I claim sour grapes. For whatever reasons, my book just didn’t take the market by storm. Shortly after it came out, a colleague mentioned to me that he’d been at a UK conference with a symposium on statistics teaching where one of the speakers proclaimed my book the “best in the world” for explaining confidence intervals and statistical power. But when my colleague asked if the speaker was using it in the classroom he replied that he was writing his own. And so better-selling introductory textbooks continued to appear. A few of them referred to the statistical reforms supposedly happening in psychology but the majority did not. Most of them are the nth edition of a well-established book that has long been selling well to its set of long-serving instructors and their students.

My 2003 handbook fared rather better. I had put some software resources for computing confidence intervals on a webpage and these got a lot of use. These, and my handbook, got picked up by researchers and their graduate students. Several years on, the stuff my scripts did started to appear in mainstream commercial statistics packages. It seemed that this reform was occurring mainly at the advanced undergraduate, graduate and researcher levels. Introductory undergraduate statistical education in psychology remained (and still remains) largely untouched by it.

Meanwhile, what of the Bayesian movement? In this decade, graduate-level social science oriented Bayesian textbooks began to appear. I recently reviewed several of them and have just sent off an invited review of another. In my earlier review I concluded that the market still lacked an accessible graduate-level treatment oriented towards psychology, a gap that may have been filled by the book I’ve just finished reviewing.

Have I tried teaching Bayesian methods? Yes, but thus far only in graduate-level workshops, and on my own time (i.e., not as part of the official curriculum). I’ll be doing so again in the second half of this year, hoping to recruit some of my colleagues as well as graduate students. Next year I’ll probably introduce a module on Bayes for our 4th-year (Honours) students.

It’s early days, however, and we remain far from being able to revamp the entire curriculum. Bayesian techniques still rarely appear in the mainstream research literature in psychology, and so students still need to learn Neyman-Pearson to read that literature with a knowledgably critical eye. A sea-change may be happening, but it’s going to take years (possibly even decades).

Will I try writing a Bayesian textbook? I already know from experience that writing a textbook is a lot of time and hard work, often for little reward. Moreover, in many universities (including mine) writing a textbook counts for nothing. It doesn’t bring research money, it usually doesn’t enhance the university’s (or the author’s) scholarly reputation, it isn’t one of the university’s “performance indicators,” and it seldom brings much income to the author. The typical university attitude towards textbooks is as if the stork brings them. Writing a textbook, therefore, has to be motivated mainly by a passion for teaching. So I’m thinking about it…

Exploiting Randomness

with 3 comments

Books such as Nicholas Taleb’s Fooled by Randomness and the psychological literature on our mental foibles such as gambler’s fallacy warn us to beware randomness. Well and good, but randomness actually is one of the most domesticated kinds of uncertainty. In fact, it is one form of uncertainty we can and do exploit.

One obvious way randomness can be exploited is in designing scientific experiments. To experimentally compare, say, two different fertilizers for use in growing broad beans, an ideal would be to somehow ensure that the bean seedlings exposed to one fertilizer were identical in all ways to the bean seedlings exposed to the other fertilizer. That isn’t possible in any practical sense. Instead, we can randomly assign each seedling to receive one or the other fertilizer. We won’t end up with two identical groups of seedlings, but the differences between those groups will have occurred by chance. If their subsequent growth-rates differ by more than we would reasonably expect by chance alone, then we can infer that one fertilizer is likely to have been more effective than the other.

Another commonplace exploitation of randomness is random sampling, which is used in all sorts of applications from quality-control engineering to marketing surveys. By randomly sampling a specific percentage of manufactured components coming off the production line, a quality-control analyst can decide whether a batch should be scrapped or not. By randomly sampling from a population of consumers, a marketing researcher can estimate the percentage of that population who prefer a particular brand of a consumer item, and also calculate how likely that estimate is to be within 1% of the true percentage at the time.

There is a less well-known use for randomness, one that in some respects is quite counter-intuitive. We can exploit randomness to improve our chances of making the right decision. The story begins with Tom Cover’s 1987 chapter which presents what Dov Samet and his co-authors recognized in their 2002 paper as a solution to a switching decision that has been at the root of various puzzles and paradoxes.

Probably the most famous of these is the “two envelope” problem. You’re a contestant in a game show, and the host offers you a choice between two envelopes, each containing a cheque of a specific value. The host explains that one of the cheques is for a greater amount than the other, and offers you the opportunity to toss a fair coin to select one envelope to open. After that, she says, you may choose either to retain the envelope you’ve selected or exchange it for the other. You toss the coin, open the selected envelope, and see the value of the cheque therein. Of course, you don’t know the value of the other cheque, so regardless of which way you choose, you have a probability of ½ of ending up with the larger cheque. There’s an appealing but fallacious argument that says you should switch, but we’re not going to go into that here.

Cover presents a remarkable decisional algorithm whereby you can make that probability exceed ½.

  1. Having chosen your envelope via the coin-toss, use a random number generator to provide you with a number anywhere between zero and some value you know to be greater than the largest cheque’s value.
  2. If this number is larger than the value of the cheque you’ve seen, exchange envelopes.
  3. If not, keep the envelope you’ve been given.

Here’s a “reasonable person’s proof” that this works (for more rigorous and general proofs, see Robert Snapp’s 2005 treatment or Samet et al., 2002). I’ll take the role of the game-show contestant and you can be the host. Suppose $X1 and $X2 are the amounts in the two envelopes. You have provided the envelopes and so you know that X1, say, is larger than X2. You’ve also told me that these amounts are less than $100 (the specific range doesn’t matter). You toss a fair coin, and if it lands Heads you give me the envelope containing X1 whereas if it lands Tails you give me the one containing X2. I open my envelope and see the amount there. Let’s call my amount Y. All I know at this point is that the probability that Y = X1 is ½ and so is the probability that Y = X2.

I now use a random number generator to produce a number between 0 and 100. Let’s call this number Z. Cover’s algorithm says I should switch envelopes if Z is larger than Y and I should retain my envelope if Z is less than or equal to Y. The claim is that my chance of ending up with the envelope containing X1 is greater than ½.

As the picture below illustrates, the probability that my randomly generated Z has landed at X2 or below is X2/100, and the probability that Z has landed at X1 or below is X1/100. Likewise, the probability that Z has exceeded X2 is 1 – X2/100, and the probability that Z has exceeded X1 is 1 – X1/100.

clip_image001

The proof now needs four steps to complete it:

  1. If Y = X1 then I’ll make the right decision if I decide to keep my envelope, i.e., if Y is less than or equal to X1, and my probability of doing so is X1/100.
  2. If Y = X2 then I’ll make the right decision if I decide to exchange my envelope, i.e., if Y is greater than X2, and my probability of doing so is 1 – X2/100.
  3. The probability that Y = X1 is ½ and the probability that Y = X2 also is ½. So my total probability of ending up with the envelope containing X1 is
    ½ of X1/100, which is X1/200, plus ½ of 1 – X2/100, which is ½ – X2/200.
    That works out to ½ + X1/200 – X2/200.
  4. But X1 is larger than X2, so X1/200 – X2/200 must be larger than 0.
    Therefore, ½ + X1/200 – X2/200 is larger than ½.

Fine, you might say, but could this party trick ever help us in a real-world decision? Yes, it could. Suppose you’re the director of a medical clinic with a tight budget in a desperate race against time to mount a campaign against a disease outbreak in your region. You have two treatments available to you but the research literature doesn’t tell you which one is better than the other. You have time and resources to test only one of those treatments before deciding which one to adopt for your campaign.

Toss a fair coin, letting it decide which treatment you test. The resulting cure-rate from the chosen treatment will be some number, Y, between 0% and 100%. The structure of your decisional situation now is identical to the two-envelope setup described above. Use a random number generator to generate a number, Z, between 0 and 100. If Z is less than or equal to Y use your chosen treatment for your campaign. If Z is greater than Y use the other treatment instead. You chance of having chosen the treatment that would have yielded the higher cure-rate under your test conditions will be larger than ½ and you’ll be able to defend your decision if you’re held accountable to any constituency or stakeholders.

In fact, there are ways whereby you may be able to do even better than this in a real-world situation. One is by shortening the range, if you know that the cure-rate is not going to exceed some limit, say L, below 100%. The reason this would help is because X1/2L – X2/2L will be greater than X1/200 – X2/200. The highest it can be is 1 – X2/X1. Another way, as Snapp (2005) points out, is by knowing the probability distribution generating X1 and X2. Knowing that distribution boosts your probability of being correct to ¾.

However, before we rush off to use Cover’s algorithm for all kinds of decisions, let’s consider its limitations. Returning to the disease outbreak scenario, suppose you have good reasons to suspect that one treatment (Ta, say) is better than the other (Tb). You could just go with Ta and defend your decision by pointing out that, according to your evidence the probability that Ta actually is better than Tb is greater than ½. Let’s denote this probability by P.

A reasonable question is whether you could do better than P by using Cover’s algorithm.  Here’s my claim:

  • If you test Ta or Tb and use the Cover algorithm to decide whether to use it for your campaign or switch to the other treatment, your probability of having chosen the treatment that would have given you the best test-result cure rate will converge to the Cover algorithm’s probability of a correct choice. This may or may not be greater than P (remember, P is greater than ½).

This time, let X1 denote the higher cure rate and X2 denote the lower cure-rate you would have got, depending on whether the treatment you tested was the better or the worse.

  1. If the cure rate for Ta is X1 then you’ll make the right decision if you decide to use Ta, i.e., if Y is less than or equal to X1, and your probability of doing so is X1/100.
  2. If the cure rate for Ta is X2 then you’ll make the right decision if you decide to use Tb, i.e., if Y is greater than X2, and your probability of doing so is 1 – X2/100.
  3. We began by supposing the probability that the cure rate for Ta is X1 is P, which is greater than ½. The probability that the cure rate for Ta is X2 is 1 – P, which is less than ½.   So your total probability of ending up with the treatment whose cure rate is X1 is
    P*X1/100 + (1 – P)*(1 – X2/100).
    The question we want to address is when this probability is greater than P, i.e.,
    P*X1/100 + (1 – P)*(1 – X2/100) > P.
    It turns out that a rearrangement of this inequality gives us a clue.
  4. First, we subtract P*X1/100 from both sides to get
    (1 – P)*( 1 – X2/100) > P – P*X1/100.
  5. Now, we divide both sides of this inequality by 1 – P to get
    ( 1 – X2/100)/P > P*(1 – X1/100)/(1 – P),
    and then divide both sides by ( 1 – X1/100) to get
    (1 – X2/100)/( 1 – X1/100) > P/(1 – P).

We can now see that the values of X2 and X1 have to make the odds of the Cover algorithm larger than the odds resulting from P. If P = .6, say, then P/(1 – P) = .6/.4 = 1.5. Thus, for example, if X2 = 40% and X1 = 70% then (1 – X2/100)/( 1 – X1/100) = .6/.3 = 2.0 and the Cover algorithm will improve your chances of making the right choice.  However, if X2 = 40% and X1 = 60% then the algorithm offers no improvement on P and if we increase X2 above 40% the algorithm will return a lower probability than P. So, if you already have strong evidence that one alternative is better than the other then don’t bother using the Cover algorithm.

Nevertheless, by exploiting randomness we’ve ended up with a decisional guide that can apply to real-world situations. Faced with being able to test only one of two alternatives, if you are undecided about which one is superior but can only test one alternative, test one of them and use Cover’s algorithm to decide which to adopt.  You’ll end up with a higher probability of making the right decision than tossing a coin.

 

Written by michaelsmithson

March 21, 2011 at 9:52 am

Follow

Get every new post delivered to your Inbox.