In the era of Big Data, the increased volume of data analysis inevitably entails an upsurge in bad analysis. The author argues that consumers should be extra vigilant when interpreting and relying on data analysis and encourages readers to improve their critical assessment of data analysis to optimize decision making.
In Numbersense, Kaiser Fung, a professional statistician and adjunct statistics professor at New York University, explores a variety of analytical issues, ranging from deciding when it makes economic sense for a merchant to accept a Groupon deal to evaluating performance in fantasy sports leagues. His jumping-off point is “Big Data,” a buzzword in the high-tech area since around 2010 that refers to the enormous amounts of data that are available for analysis nowadays. Fung believes that we should care about Big Data not because of the proliferation of data but because of the increased volume of data analysis, which inevitably entails an upsurge in bad analysis. Because we are all consumers of data analysis, he urges us to learn to be smarter consumers. He argues that consumers need to be especially discerning in this data-rich world.
Fung coins the term “numbersense,” which is the one quality he desires most in a data analyst. Numbersense adds a critical third dimension to the two other essential traits—technical ability and business thinking. It is an intangible ability that Fung defines as the noise that sounds in your head when you observe bad data or bad analysis. It also encompasses the desire and persistence to get closer to the truth.
According to Fung, numbersense is difficult to teach in a traditional classroom setting. It cannot be automated, and textbook examples do not transfer well to the real world. The surest ways to enhance numbersense are through direct practice or by learning from others.
The book includes several chapters inspired by news items from the past five years that involved cases of people making claims that they backed up with data. Fung demonstrates how he tested these assertions by, among other things, asking incisive questions, using quantitative reasoning, and analyzing relevant data. Topics explored in the book include the link between mortality and obesity, the accuracy of government inflation and unemployment data, and how universities can game the school ranking process.
Numbersense advises readers to examine the counterfactual when analyzing data. The counterfactual represents “what could have been” and is a fundamental construct in statistics.
Fung provides an example in which the commercial value of unlicensed software (often referred to as “piracy losses”) was overstated. To evaluate the impact of piracy properly, one must imagine the counterfactual—that is, what the world would look like if software could not be pirated. Many users of pirated software, especially those living in poor countries, simply would do without the software if piracy were somehow eradicated. Thus, not every dollar’s worth of unlicensed software translates into a direct loss to the industry.
The industry most commonly associated with Big Data is online marketing. E-commerce websites generate massive volumes of data daily. Big Data’s existence is the basis for claims that online marketing and advertising are more measurable and accountable than traditional marketing and advertising. According to the author, experts in this fledgling field frequently fail the counterfactual test.
For example, a Louisville, Kentucky, restaurant sold 800 coupons through Groupon, an online marketer that e-mails consumers offers of sizable discounts on a variety of goods and services. The restaurant wanted to increase its profits by picking up new customers. Fung focuses on the counterfactual, in which the restaurant sold no coupons. Key to his analysis is the “newbies-to-free-riders” ratio. Some Groupon buyers are newbies, who have never visited the restaurant, whereas others are free riders, who dine there regularly. Free riders plan to dine at the restaurant and are willing to pay full fare but are able to reduce the cost by using a coupon intended for new patrons. By exploring the counterfactual argument, Fung concludes that this restaurant would have been significantly more profitable if it had not sold the coupons in the first place, even though at first blush, the coupon strategy appears to have been successful.
Fung cautions against confusing causation with correlation and provides examples in his chapter on obesity and mortality. He cites a study that revealed an above-average risk of stroke for obese men. Observational data cannot explain this result. Several theories that blame abdominal fat deposits are unproven. Weight is a marker of potential ill health, but it is not a direct cause of diabetes or heart disease. In this case, the bridge from cause to effect is built on theory. More generally, it is important to recognize which part of an analysis rests on data and which part is strictly theory. It is not uncommon for evidence of correlation to feed the public “causation creep,” by which correlation erroneously bleeds into causation.
Ultimately, Fung argues that consumers need to be extra vigilant when interpreting and relying on data analysis because of, among other things, data mining, overconfidence, and confirmation bias on the part of researchers and companies. It is also prudent to assess the incentive framework of the researcher or institution providing the data analysis to detect any potential bias. Fung appropriately encourages readers to improve their critical assessment of data analysis to optimize their decision making in the era of Big Data.
—M.K.B.