A Few (More) Myths about “Big Data”
Following on from Kate Crawford’s recent and excellent elaboration of six myths about “big data”, I should like to add four more that highlight important issues about such data that can misguide us if we ignore them or are ignorant of them.
Myth 7: Big data are precise.
As with analyses of almost any other kind of data, big data analyses largely consists of estimates. Often these estimates are based on sample data rather than population data, and the samples may not be representative of their referent populations (as Crawford points out, but also see Myth 8). Moreover, big data are even less likely than “ordinary” data to be free of recording errors or deliberate falsification.
Even when the samples are good and the sample data are accurately recorded, estimates still are merely estimates, and the most common mistake decision makers and other stakeholders make about estimates is treating them as if they are precise or exact. In a 1990 paper I referred to this as the fallacy of false precision. Estimates always are imprecise, and ignoring how imprecise they are is equivalent to ignoring how wrong they could be. Major polling companies gradually learned to report confidence intervals or error-rates along with their estimates and to take these seriously, but most government departments apparently have yet to grasp this obvious truth.
Why might estimate error be a greater problem for big data than for “ordinary” data? There are at least two reasons. First, it is likely to be more difficult to verify the integrity or veracity of big data simply because it is integrated from numerous sources. Second, if big datasets are constructed from multiple sources, each consisting of an estimate with its own imprecision, then these imprecisions may propagate. To give a brief illustration, if estimate X has variance x2, estimate Y has variance y2, X and Y are independent of one another, and our “big” dataset consists of adding X+Y to get Z, then the variance of Z will be x2 + y2.
Myth 8: Big data are accurate.
There are two senses in which big data may be inaccurate, in addition to random variability (i.e., sampling error): Biases, and measurement confounds. Economic indicators of such things as unemployment rates, inflation, or GDP in most countries are biased. The bias stems from the “shadow” (off the books) economic activity in most countries. There is little evidence that economic policy makers in most countries pay any attention to such distortions when using economic indicators to inform policies.
Measurement confounds are a somewhat more subtle issue, but the main idea is that data may not measure what we think it is measuring because it is influenced by extraneous factors. Economic indicators are, again, good examples but there are plenty of others (don’t get me started on the idiotic bibliometrics and other KPIs that are imposed on us academics in the name of “performance” assessment). Web analytics experts are just beginning to face up to this problem. For instance, webpage dwell times are not just influenced by how interested the visitor is in the content of a webpage, but may also reflect such things as how difficult the contents are to understand, the visitor’s attention span, or the fact that they left their browsing device to do something else and then returned much later. As in Myth 7, bias and measurement confounds may be compounded in big data to a greater extent than they are in small data, simply because big data often combines multiple measures.
Myth 9. Big data are stable.
Data often are not recorded just once, but re-recorded as better information becomes available or as errors are discovered. In a recent Wall Street Journal article, economist Samuel Rines presented several illustrations of how unstable economic indicator estimates are in the U.S. For example, he observed that in November 2012 the first official estimate of net employment increase was 146,000 new jobs. By the third revision that number had increased by 68% to 247,000. In another instance, he pointed out that American GDP annual estimates each year typically are revised several times, and often substantially, as the year slides into the past.
Again, there is little evidence that people crafting policy or making decisions based on these numbers take their inherent instability into account. One may protest that often decisions must be made before “final” revisions can be completed. However, where such revisions in the past have been recorded, the degree of instability in these indicators should not be difficult to estimate. These could be taken into account, at the very least, in worst- and best-case scenario generation.
Myth 10. We have adequate computing power and techniques to analyse big data.
Analysing big data is a computationally intense undertaking, and at least some worthwhile analytical goals are beyond our reach, in terms of computing power and even, in some cases, techniques. I’ll give just one example. Suppose we want to model the total dwell time per session of a typical user who is browsing the web. The number of items on which the user dwells is a random variable, and so is the amount of dwell time for each item. The total dwell time, then, is what is called a “randomly stopped sum”. The expression for the probability distribution of a randomly stopped sum doesn’t have a closed form (it’s an infinite sum), so it can’t be modelled via conventional statistical estimation techniques (least-squares or maximum likelihood). Instead, there are two viable approaches: Simulation and Bayesian hierarchical MCMC. I’m writing a paper on this topic, and from my own experience I can declare that either technique would require a super-computer for datasets of the kind dealt with, e.g., by NRS PADD.