## Posts Tagged ‘**Bayesian probability**’

## Statistical Significance On Trial

There is a long-running love-hate relationship between the legal and statistical professions, and two vivid examples of this have surfaced in recent news stories, one situated in a court of appeal in London and the other in the U.S. Supreme Court. Briefly, the London judge ruled that Bayes’ theorem must not be used in evidence unless the underlying statistics are “firm;” while the U.S. Supreme Court unanimously ruled that a drug company’s non-disclosure of adverse side-effects cannot be justified by an appeal to the statistical non-significance of those effects. Each case, in its own way, shows why it is high time to find a way to establish an effective rapprochement between these two professions.

The Supreme Court decision has been applauded by statisticians, whereas the London decision has appalled statisticians of similar stripe. Both decisions require some unpacking to understand why statisticians would cheer one and boo the other, and why these are important decisions not only for both the statistical and legal professions but for other domains and disciplines whose practices hinge on legal and statistical codes and frameworks.

This post focuses on the Supreme Court decision. The culprit was a homoeopathic zinc-based medicine, Zicam, manufactured by Matrixx Initivatives, Inc. and advertised as a remedy for the common cold. Matrixx ignored reports from users and doctors since 1999 that Zicam caused some users to experience burning sensations or even to lose the sense of smell. When this story was aired by a doctor on Good Morning America in 2004, Matrixx stock price plummeted.

The company’s defense was that these side-effects were “not statistically significant.” In the ensuing fallout, Matrixx was faced with more than 200 lawsuits by Zicam users, but the case in point here is Siracusano vs Matrixx, in which Mr. Siracusano was suing on behalf of investors on grounds that they had been misled. After a few iterations through the American court system, the question that the Supreme Court ruled on was whether a claim of securities fraud is valid against a company that neglected to warn consumers about effects that had been found to be statistically non-significant. As insider-knowledgeable Stephen Ziliak’s insightful essay points out, the decision will affect drug supply regulation, securities regulation, liability and the nature of adverse side-effects disclosed by drug companies. Ziliak was one of the “friends of the court” providing expert advice on the case.

A key point in this dispute is whether statistical nonsignificance can be used to infer that a potential side-effect is, for practical purposes, no more likely to occur when using the medicine than when not. Among statisticians it is a commonplace that such inferences are illogical (and illegitimate). There are several reasons for this, but I’ll review just two here.

These reasons have to do with common misinterpretations of the measure of statistical significance. Suppose Matrixx had conducted a properly randomized double-blind experiment comparing Zicam-using subjects with those using an indistinguishable placebo, and observed the difference in side-effect rates between the two groups of subjects. One has to bear in mind that random assignment of subjects to one group or the other doesn’t guarantee equivalence between the groups. So, it’s possible that even if there really is no difference between Zicam and the placebo regarding the side-effect, a difference between the groups might occur by “luck of the draw.”

The indicator of statistical significance in this context would be the probability of observing a difference at least as large as the one found in the study if the hypothesis of no difference were true. If this probability is found to be very low (typically .05 or less) then the experimenters will reject the no-difference hypothesis on the grounds that the data they’ve observed would be very unlikely to occur if that hypothesis were true. They will then declare that there is a statistically significant difference between the Zicam and placebo groups. If this probability is not sufficiently low (i.e., greater than .05) the experimenters will decide not to reject the no-difference hypothesis and announce that the difference they found was statistically non-significant.

So the first reason for concern is that Matrixx acted as if statistical nonsignificance entitles one to believe in the hypothesis of no-difference. However, failing to reject the hypothesis of no difference doesn’t entitle one to believe in it. It’s still possible that a difference might exist and the experiment failed to find it because it didn’t have enough subjects or because the experimenters were “unlucky.” Matrixx has plenty of company in committing this error; I know plenty of seasoned researchers who do the same, and I’ve already canvassed the well-known bias in fields such as psychology not to publish experiments that failed to find significant effects.

The second problem arises from a common intuition that the probability of observing a difference at least as large as the one found in the study if the hypothesis of no difference were true tells us something about the inverse—the probability that the no-difference hypothesis is true if we find a difference at least as large as the one observed in our study, or, worse still, the probability that the no-difference hypothesis is true. However, the first probability on its own tells us nothing about the other two.

For a quick intuitive, if fanciful, example let’s imagine randomly sampling one person from the world’s population and our hypothesis is that s/he will be Australian. On randomly selecting our person, all that we know about her initially is that she speaks English.

There are about 750 million first-or second-language English speakers world-wide, and about 23 million Australians. Of the 23 million Australians, about 21 million of them fit the first- or second-language English description. Given that our person speaks English, how likely is it that we’ve found an Australian? The probability that we’ve found an Australian given that we’ve picked an English-speaker is 21/750 = .03. So there goes our hypothesis. However, had we picked an Australian (i.e., given that our hypothesis were true), the probability that s/he speaks English is 21/23 = .91.

See also Ziliak and McCloskey’s 2008 book, which mounts a swinging demolition of the unquestioned application of statistical significance in a variety of domains.

Aside from the judgment about statistical nonsignificance, the most important stipulation of the Supreme Court’s decision is that “something more” is required before a drug company can justifiably decide to not disclose a drug’s potential side-effects. What should this “something more” be? This sounds as if it would need judgments about the “importance” of the side-effects, which could open multiple cans of worms (e.g., Which criteria for importance? According to what or whose standards?). Alternatively, why not simply require drug companies to report all occurrences of adverse side-effects and include the best current estimates of their rates among the population of users?

A slightly larger-picture view of the Matrixx defense resonates with something that I’ve observed in even the best and brightest of my students and colleagues (oh, and me too). And that is the hope that somehow probability or statistical theories will get us off the hook when it comes to making judgments and decisions in the face of uncertainty. It can’t and won’t, especially when it comes to matters of medical, clinical, personal, political, economic, moral, aesthetic, and all the other important kinds of importance.

## Are There Different Kinds of Uncertainties? Probability, Ambiguity and Conflict

The domain where I work is a crossroads connecting cognitive and social psychology, behavioral economics, management science, some aspects of applied mathematics and analytical philosophy. Some of the traffic there carries some interesting debates about the nature of uncertainty. Is it all one thing? For instance, can every version of uncertainty be reduced to some form of probability? Or are there different kinds? Was Keynes right to distinguish between “strength” and “weight” of evidence?

Why is this question important? Making decisions under uncertainty in a partially learnable world is one of the most important adaptive challenges facing humans and other species. If there are distinct (and consequential) kinds, then we may need distinct ways of coping with them. Each kind also may have its own special uses. Some kinds may be specific to an historical epoch or to a culture.

On the other hand, if there really is only one kind of uncertainty then a rational agent should have only one (consistent) method of dealing with it. Moreover, such an agent should not be distracted or influenced by any other “uncertainty” that is inconsequential.

A particular one-kind view dominated accounts of rationality in economics, psychology and related fields in the decades following World War II. The rational agent was a Bayesian probabilist, a decision-maker whose criterion for best option was the option that yielded the greatest expected return. This rational agent maximizes expected outcomes. Would you rather be given $1,000,000 right now for sure, or offered a fair coin-toss in which Heads nets you $2,400,000 and Tails leaves you nothing? The rational agent would go for the coin-toss because the expected return is 1/2 times $2,400,000, i.e., $1,200,000, which exceeds $1,000,000. A host of experimental evidence says that most of us would choose the $1M instead.

One of the Bayesian rationalist’s earliest serious challenges came from a 1961 paper by Daniel Ellsberg (Yes, *that* Daniel Ellsberg). Ellsberg set up some ingenious experiments demonstrating that people are influenced by another kind of uncertainty besides probability: Ambiguity. Ambiguous probabilities are imprecise (e.g., a probability we can only say is somewhere between 1/3 and 2/3). Although a few precursors such as Frank Knight (1921) had explored some implications for decision makers when probabilities aren’t precisely known, Ellsberg brought this matter to the attention of behavioral economists and cognitive psychologists. His goal was to determine whether the distinction between ambiguous and precise probabilities has any “behavioral significance.”

Here’s an example of what he meant by that phrase. Suppose you have to choose between drawing one ball from one of two urns, A or B. Each urn contains 90 balls, with the offer of a prize of $10 if the first ball drawn is either Red or Yellow. In Urn A, 30 of the balls are Red and the remaining 60 are either Black or Yellow but the number of Yellow and Black balls has been randomly selected by a computer program, and we don’t know what it is. Urn B contains 30 Red, 30 Yellow, and 30 Black balls. If you prefer Urn B, you’re manifesting an aversion to not knowing precisely how many Yellow balls there are in Urn A– i.e., *ambiguity aversion*. In experiments of this kind, most people choose Urn B.

But now consider a choice between drawing a ball from one of two bags, each containing a thousand balls with a number on each ball, and a prize of $1000 if your ball has the number 500 on it. In Bag A, the balls were numbered consecutively from 1 to 1000, so we know there is exactly one ball with the number 500 on it. Bag B, however, contains a thousand balls with numbers randomly chosen from 1 to 1000. If you prefer Bag B, then you may have reasoned that there is a chance of more than one ball in there with the number 500. If so, then you are demonstrating *ambiguity seeking*. Indeed, most people choose Bag B.

Now, Bayesian rational agents would not have a preference in either of the scenarios just described. They would find the expected return in Urns A and B to be identical, and likewise with Bags A and B. So, our responses to these scenarios indicate that we are reacting to the ambiguity of probabilities as if ambiguity is a consequential kind of uncertainty distinct from probability.

Ellsberg’s most persuasive experiment (the one many writers describe in connection with “Ellsberg’s Paradox“) is as follows. Suppose we have an urn with 90 balls, of which 30 are Red and 60 are either Black or Yellow (but the proportions of each are unknown). If asked to choose between gambles A and B as shown in the upper part of the table below (i.e., betting on Red versus betting on Black), most people prefer A.

30 | 60 | ||

Urn | Red | Black | Yellow |

A | $100 | $0 | $0 |

B | $0 | $100 | $0 |

30 | 60 | ||

Urn | Red | Black | Yellow |

C | $100 | $0 | $100 |

D | $0 | $100 | $100 |

However, when asked to choose between gambles C and D, most people prefer D. People preferring A to B and D to C are violating a principle of rationality (often called the “Sure-Thing Principle”) because the (C,D) pair simply adds $100 for drawing a Yellow ball to the (A,B) pair. If rational agents prefer A to B, they should also prefer C to D. Ellsberg demonstrated that ambiguity influences decisions in a way that is incompatible with standard versions of rationality.

But is it irrational to be influenced by ambiguity? It isn’t hard to find arguments for why a “reasonable” (if not rigorously rational) person would regard ambiguity as important. Greater ambiguity could imply greater variability in outcomes. Variability can have a downside. I tell students in my introductory statistics course to think about someone who can’t swim considering whether to wade across a river. All they know is that the average depth of the river is 2 feet. If the river’s depth doesn’t vary much then all is well, but great variability could be fatal. Likewise, financial advisors routinely ask clients to consider how comfortable they are with “volatile” versus “conservative” investments. As many an investor has found out the hard way, high average returns in the long run are no guarantee against ruin in the short term.

So, a prudent agent could regard ambiguity as a tell-tale marker of high variability and decide accordingly. It all comes down to whether the agent restricts their measure of good decisions to expected return (i.e., the mean) or admits a second criterion (variability) into the mix. Several models of risk behavior in humans and other animals incorporate variability. A well-known animal foraging model, the energy budget rule, predicts that an animal will prefer foraging options with smaller variance when the expected energy intake exceeds its requirements. However, it will prefer greater variance when the expected energy intake is below survival level because greater variability conveys a greater chance of obtaining the intake needed for survival.

There are some fascinating counter-arguments against all this, and counter-counter-arguments too. I’ll revisit these in a future post.

I was intrigued by Ellsberg’s paper and the literature following on from it, and I’d also read William Empson’s classic 1930 book on seven types of ambiguity. It occurred to me that some forms of ambiguity are analogous to internal conflict. We speak of being “in two minds” when we’re undecided about something, and we often simulate arguments back and forth in our own heads. So perhaps ambiguity isn’t just a marker for variability. It could indicate conflict as well. I decided to run some experiments to see whether people respond to conflicting evidence as they do to ambiguity, but also pitting conflict against ambiguity.

Imagine that you’re serving on a jury, and an important aspect of the prosecution’s case turns on the color of the vehicle seen leaving the scene of a nighttime crime. Here are two possible scenarios involving eyewitness testimony on this matter:

A) One eyewitness says the color was blue, and another says it was green.

B) Both eyewitnesses say the color was either blue or green.

Which testimony would you rather receive? In which do the eyewitnesses sound more credible? If you’re like most people, you’d choose (B) for both questions, despite the fact that from a purely informational standpoint there is no difference between them.

Evidence from this and other choice experiments in my 1999 paper suggests that in general, we are “conflict averse” in two senses:

- Conflicting messages from two equally believable sources are dispreferred in general to two informatively equivalent, ambiguous, but agreeing messages from the same sources; and
- Sharply conflicting sources are perceived as less credible than ambiguous agreeing sources.

The first effect goes further than simply deciding what we’d rather hear. It turns out that we will choose options for ourselves and others on the basis of *conflict aversion*. I found this in experiments asking people to make decisions about eyewitness testimony, risk messages, dieting programs, and hiring and firing employees. Nearly a decade later, Laure Cabantous (then at the Université Toulouse) confirmed my results in a 2007 paper demonstrating that insurers would charge a higher premium for insurance against mishaps whose risk information was conflictive than if the risk information was merely ambiguous.

Many of us have an intuition that if there’s equal evidence for and against a proposition, then both kinds of evidence should be accessed by all knowledgeable, impartial experts in that domain. If laboratory A produces 10 studies supporting an hypothesis (H, say) and laboratory B produces 10 studies disconfirming H, something seems fishy. The collection of 20 studies will seem more trustworthy if laboratories A and B each produce 5 studies supporting H and 5 disconfirming it, even though there are still 10 studies for and 10 studies against H. We expect experts to agree with one another, even if their overall message is vague. Being confronted with sharply disagreeing experts sets off alarms for most of us because it suggests that the experts may not be impartial.

The finding that people attribute greater credibility to agreeing-ambiguous sources than conflicting-precise sources poses strategic difficulties for the sources themselves. Communicators embroiled in a controversy face a decision equivalent to a Prisoners’ Dilemma, when considering communications strategies while knowing that other communicators hold views contrary to their own. If they “soften up” by conceding a point or two to the opposition they might win credibility points from their audience, but their opponents could play them for suckers by not conceding anything in return. On the other hand, if everyone concedes nothing then they all could lose credibility.

None of this is taken into account by the one-kind view of uncertainty. A probability “somewhere between 0 and 1” is treated identically to a probability of “exactly 1/2,” and sharply disagreeing experts are regarded as the same as ambiguous agreeing experts. But you and I know differently.

###### Related Articles

- Ambiguity is Another Reason to Mitigate Climate Change (wallstreetpit.com)