ignorance and uncertainty

All about unknowns and uncertainties

Posts Tagged ‘Statistics

A Few (More) Myths about “Big Data”

leave a comment »

Following on from Kate Crawford’s recent and excellent elaboration of six myths about “big data”, I should like to add four more that highlight important issues about such data that can misguide us if we ignore them or are ignorant of them.

Myth 7: Big data are precise.

As with analyses of almost any other kind of data, big data analyses largely consists of estimates. Often these estimates are based on sample data rather than population data, and the samples may not be representative of their referent populations (as Crawford points out, but also see Myth 8). Moreover, big data are even less likely than “ordinary” data to be free of recording errors or deliberate falsification.

Even when the samples are good and the sample data are accurately recorded, estimates still are merely estimates, and the most common mistake decision makers and other stakeholders make about estimates is treating them as if they are precise or exact. In a 1990 paper I referred to this as the fallacy of false precision. Estimates always are imprecise, and ignoring how imprecise they are is equivalent to ignoring how wrong they could be. Major polling companies gradually learned to report confidence intervals or error-rates along with their estimates and to take these seriously, but most government departments apparently have yet to grasp this obvious truth.

Why might estimate error be a greater problem for big data than for “ordinary” data? There are at least two reasons. First, it is likely to be more difficult to verify the integrity or veracity of big data simply because it is integrated from numerous sources. Second, if big datasets are constructed from multiple sources, each consisting of an estimate with its own imprecision, then these imprecisions may propagate. To give a brief illustration, if estimate X has variance x2, estimate Y has variance y2, X and Y are independent of one another, and our “big” dataset consists of adding X+Y to get Z, then the variance of Z will be x2 + y2.

Myth 8: Big data are accurate.

There are two senses in which big data may be inaccurate, in addition to random variability (i.e., sampling error): Biases, and measurement confounds. Economic indicators of such things as unemployment rates, inflation, or GDP in most countries are biased. The bias stems from the “shadow” (off the books) economic activity in most countries. There is little evidence that economic policy makers in most countries pay any attention to such distortions when using economic indicators to inform policies.

Measurement confounds are a somewhat more subtle issue, but the main idea is that data may not measure what we think it is measuring because it is influenced by extraneous factors. Economic indicators are, again, good examples but there are plenty of others (don’t get me started on the idiotic bibliometrics and other KPIs that are imposed on us academics in the name of “performance” assessment). Web analytics experts are just beginning to face up to this problem. For instance, webpage dwell times are not just influenced by how interested the visitor is in the content of a webpage, but may also reflect such things as how difficult the contents are to understand, the visitor’s attention span, or the fact that they left their browsing device to do something else and then returned much later. As in Myth 7, bias and measurement confounds may be compounded in big data to a greater extent than they are in small data, simply because big data often combines multiple measures.

Myth 9. Big data are stable.

Data often are not recorded just once, but re-recorded as better information becomes available or as errors are discovered. In a recent Wall Street Journal article, economist Samuel Rines presented several illustrations of how unstable economic indicator estimates are in the U.S. For example, he observed that in November 2012 the first official estimate of net employment increase was 146,000 new jobs. By the third revision that number had increased by 68% to 247,000. In another instance, he pointed out that American GDP annual estimates each year typically are revised several times, and often substantially, as the year slides into the past.

Again, there is little evidence that people crafting policy or making decisions based on these numbers take their inherent instability into account. One may protest that often decisions must be made before “final” revisions can be completed. However, where such revisions in the past have been recorded, the degree of instability in these indicators should not be difficult to estimate. These could be taken into account, at the very least, in worst- and best-case scenario generation.

Myth 10. We have adequate computing power and techniques to analyse big data.

Analysing big data is a computationally intense undertaking, and at least some worthwhile analytical goals are beyond our reach, in terms of computing power and even, in some cases, techniques. I’ll give just one example. Suppose we want to model the total dwell time per session of a typical user who is browsing the web. The number of items on which the user dwells is a random variable, and so is the amount of dwell time for each item. The total dwell time, then, is what is called a “randomly stopped sum”. The expression for the probability distribution of a randomly stopped sum doesn’t have a closed form (it’s an infinite sum), so it can’t be modelled via conventional statistical estimation techniques (least-squares or maximum likelihood). Instead, there are two viable approaches: Simulation and Bayesian hierarchical MCMC. I’m writing a paper on this topic, and from my own experience I can declare that either technique would require a super-computer for datasets of the kind dealt with, e.g., by NRS PADD.

Written by michaelsmithson

July 26, 2013 at 6:57 am

A Book Ate My Blog

leave a comment »

Sad, but true—I’ve been off blogging since late 2011 because I was writing a book (with Ed Merkle) under a contract with Chapman and Hall. Writing the book took up the time I could devote to writing blog posts. Anyhow, I’m back in the blogosophere, and for those unusual people out there who have interests in multivariate statistical modelling, you can find out about the book here or here.

Written by michaelsmithson

July 26, 2013 at 6:30 am

I Can’t Believe What I Teach

with 2 comments

For the past 34 years I’ve been compelled to teach a framework that I’ve long known is flawed. A better framework exists and has been available for some time. Moreover, I haven’t been forced to do this by any tyrannical regime or under threats of great harm to me if I teach this alternative instead. And it gets worse: I’m not the only one. Thousands of other university instructors have been doing the same all over the world.

I teach statistical methods in a psychology department. I’ve taught courses ranging from introductory undergraduate through graduate levels, and I’m in charge of that part of my department’s curriculum. So, what’s the problem—Why haven’t I abandoned the flawed framework for its superior alternative?

Without getting into technicalities, let’s call the flawed framework the “Neyman-Pearson” approach and the alternative the “Bayes” approach. My statistical background was formed as I completed an undergraduate degree in mathematics during 1968-72. My first courses in probability and statistics were Neyman-Pearson and I picked up the rudiments of Bayes toward the end of my degree. At the time I thought these were simply two valid alternative ways of understanding probability.

Several years later I was a newly-minted university lecturer teaching introductory statistics to fearful and sometimes reluctant students in the social sciences. The statistical methods used in the social science research were Neyman-Pearson, so of course I taught Neyman-Pearson. Students, after all, need to learn to read the literature of their discipline.

Gradually, and through some of my research into uncertainty, I became aware of the severe problems besetting the Neyman-Pearson framework. I found that there was a lengthy history of devastating criticisms raised against Neyman-Pearson even within the social sciences, criticisms that had been ignored by practising researchers and gatekeepers to research publication.

However, while the Bayesian approach may have been conceptually superior, in the late ‘70’s through early ‘80’s it suffered from mathematical and computational impracticalities. It provided few usable methods for dealing with complex problems. Disciplines such as psychology were held in thrall to Neyman-Pearson by a combination of convention and the practical requirements of complex research designs. If I wanted to provide students or, for that matter, colleagues who came to me for advice, with effective statistical tools for serious research then usually Neyman-Pearson techniques were all I could offer.

But what to do about teaching? No university instructor takes a formal oath to teach the truth, the whole truth, and nothing but the truth; but for those of us who’ve been called to teach it feels as though we do. I was sailing perilously close to committing Moore’s Paradox in the classroom (“I assert Neyman-Pearson but I don’t believe it”).

I tried slipping in bits and pieces alerting students to problematic aspects of Neyman-Pearson and the existence of the Bayesian alternative. These efforts may have assuaged my conscience but they did not have much impact, with one important exception. The more intellectually proactive students did seem to catch on to the idea that theories of probability and statistics are just that—Theories, not god-given commandments.

Then Bayes got a shot in the arm. In the mid-80’s some powerful computational techniques were adapted and developed that enabled this framework to fight at the same weight as Neyman-Pearson and even better it. These techniques sail under the banner of Markov chain Monte Carlo methods, and by the mid-90’s software was available (free!) to implement them. The stage was set for the Bayesian revolution. I began to dream of writing a Bayesian introductory statistics textbook for psychology students that would set the discipline free and launch the next generation of researchers.

It didn’t happen that way. Psychology was still deeply mired in Neyman-Pearson and, in fact, in a particularly restrictive version of it. I’ll spare you the details other than saying that it focused, for instance, on whether the researcher could reject the claim that an experimental effect was nonexistent. I couldn’t interest my colleagues in learning Bayesian techniques, let alone undergraduate students.

By the late ‘90’s a critical mass of authoritative researchers convinced the American Psychological Association to form a task-force to reform statistical practice, but this reform really amounted to shifting from the restrictive Neyman-Pearson orientation to a more liberal one that embraced estimating how big an experimental effect is and setting a “confidence interval” around it.

It wasn’t the Bayesian revolution, but I leapt onto this initiative because both reforms were a long stride closer to the Bayesian framework and would still enable students to read the older Neyman-Pearson dominated research literature. So, I didn’t write a Bayesian textbook after all. My 2000 introductory textbook was, so far as I’m aware, one of the first to teach introductory statistics to psychology students from a confidence interval viewpoint. It was generally received well by fellow reformers, and I got a contract to write a kind of researcher’s confidence interval handbook that came out in 2003. The confidence interval reform in psychology was under weigh, and I’d booked a seat on the juggernaut.

Market-wise, my textbook flopped. I’m not singing the blues about this, nor do I claim sour grapes. For whatever reasons, my book just didn’t take the market by storm. Shortly after it came out, a colleague mentioned to me that he’d been at a UK conference with a symposium on statistics teaching where one of the speakers proclaimed my book the “best in the world” for explaining confidence intervals and statistical power. But when my colleague asked if the speaker was using it in the classroom he replied that he was writing his own. And so better-selling introductory textbooks continued to appear. A few of them referred to the statistical reforms supposedly happening in psychology but the majority did not. Most of them are the nth edition of a well-established book that has long been selling well to its set of long-serving instructors and their students.

My 2003 handbook fared rather better. I had put some software resources for computing confidence intervals on a webpage and these got a lot of use. These, and my handbook, got picked up by researchers and their graduate students. Several years on, the stuff my scripts did started to appear in mainstream commercial statistics packages. It seemed that this reform was occurring mainly at the advanced undergraduate, graduate and researcher levels. Introductory undergraduate statistical education in psychology remained (and still remains) largely untouched by it.

Meanwhile, what of the Bayesian movement? In this decade, graduate-level social science oriented Bayesian textbooks began to appear. I recently reviewed several of them and have just sent off an invited review of another. In my earlier review I concluded that the market still lacked an accessible graduate-level treatment oriented towards psychology, a gap that may have been filled by the book I’ve just finished reviewing.

Have I tried teaching Bayesian methods? Yes, but thus far only in graduate-level workshops, and on my own time (i.e., not as part of the official curriculum). I’ll be doing so again in the second half of this year, hoping to recruit some of my colleagues as well as graduate students. Next year I’ll probably introduce a module on Bayes for our 4th-year (Honours) students.

It’s early days, however, and we remain far from being able to revamp the entire curriculum. Bayesian techniques still rarely appear in the mainstream research literature in psychology, and so students still need to learn Neyman-Pearson to read that literature with a knowledgably critical eye. A sea-change may be happening, but it’s going to take years (possibly even decades).

Will I try writing a Bayesian textbook? I already know from experience that writing a textbook is a lot of time and hard work, often for little reward. Moreover, in many universities (including mine) writing a textbook counts for nothing. It doesn’t bring research money, it usually doesn’t enhance the university’s (or the author’s) scholarly reputation, it isn’t one of the university’s “performance indicators,” and it seldom brings much income to the author. The typical university attitude towards textbooks is as if the stork brings them. Writing a textbook, therefore, has to be motivated mainly by a passion for teaching. So I’m thinking about it…

Follow

Get every new post delivered to your Inbox.