A new study shows that using big data to predict the future isn't as easy as it looks—and that raises questions about how Internet companies gather and use information
Big data: as buzzwords go, it’s inescapable. Gigantic corporations like SAS and IBM tout their big data analytics, while experts promise that big data—our exponentially growing ability to collect and analyze information about anything at all—will transform everything from business to sports to cooking. Big data was—no surprise—one of the major themes coming out of this month’s SXSW Interactive conference. It’s inescapable.
One of the most conspicuous examples of big data in action is Google’s data-aggregating tool Google Flu Trends (GFT). The program is designed to provide real-time monitoring of flu cases around the world based on Google searches that match terms for flu-related activity. Here’s how Google explains it:
We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for “flu” is actually sick, but a pattern emerges when all the flu-related search queries are added together. We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world.
Seems like a perfect use of the 500 million plus Google searches made each day. There’s a reason GFT became the symbol of big data in action, in books like Kenneth Cukier and Viktor Mayer-Schonberger’s Big Data: A Revolution That Will Transform How We Live, Work and Think. But there’s just one problem: as a new article in Science shows, when you compare its results to the real world, GFT doesn’t really work.
GFT overestimated the prevalence of flu in the 2012-2013 and 2011-2012 seasons by more than 50%. From August 2011 to September 2013, GFT over-predicted the prevalence of the flu in 100 out 108 weeks. During the peak flu season last winter, GFT would have had us believe that 11% of the U.S. had influenza, nearly double the CDC numbers of 6%. If you wanted to project current flu prevalence, you would have done much better basing your models off of 3-week-old data on cases from the CDC than you would have been using GFT’s sophisticated big data methods. “It’s a Dewey beats Truman moment for big data,” says David Lazer, a professor of computer science and politics at Northeastern University and one of the authors of the Science article.
Just as the editors of the Chicago Tribune believed it could predict the winner of the close 1948 Presidential election—they were wrong—Google believed that its big data methods alone were capable of producing a more accurate picture of real-time flu trends than old methods of prediction from past data. That’s a form of “automated arrogance,” or big data hubris, and it can be seen in a lot of the hype around big data today. Just because companies like Google can amass an astounding amount of information about the world doesn’t mean they’re always capable of processing that information to produce an accurate picture of what’s going on—especially if turns out they’re gathering the wrong information. Not only did the search terms picked by GFT often not reflect incidences of actual illness—thus repeatedly overestimating just how sick the American public was—it also completely missed unexpected events like the nonseasonal 2009 H1N1-A flu pandemic. “A number of associations in the model were really problematic,” says Lazer. “It was doomed to fail.”
Nor did help that GFT was dependent on Google’s top-secret and always changing search algorithm. Google modifies its search algorithm to provide more accurate results, but also to increase advertising revenue. Recommended searches, based on what other users have searched, can throw off the results for flu trends. While GFT assumes that the relative search volume for different flu terms is based in reality—the more of us are sick, the more of us will search for info about flu as we sniffle above our keyboards—in fact Google itself alters search behavior through that ever-shifting algorithim. If the data isn’t reflecting the world, how can it predict what will happen?
GFT and other big data methods can be useful, but only if they’re paired with what the Science researchers call “small data”—traditional forms of information collection. Put the two together, and you can get an excellent model of the world as it actually is. Of course, if big data is really just one tool of many, not an all-purpose path to omniscience, that would puncture the hype just a bit. You won’t get a SXSW panel with that kind of modesty.
A bigger concern, though, is that much of the data being gathered in “big data”—and the formulas used to analyze it—is controlled by private companies that can be positively opaque. Google has never made the search terms used in GFT public, and there’s no way for researchers to replicate how GFT works. There’s Google Correlate, which allows anyone to find search patterns that purport to map real-life trends, but as the Science researchers wryly note: “Clicking the link titled ‘match the pattern of actual flu actvity (this is how we built Google Flu Trends!)’ will not, ironically, produce a replication of the GFT search terms.” Even in the academic papers on GFT written by Google researchers, there’s no clear contact information, other than a generic Google email address. (Academic papers almost always contain direct contact information for lead authors.)
At its best, science is an open, cooperative and cumulative effort. If companies like Google keep their big data to themselves, they’ll miss out on the chance to improve their models, and make big data worthy of the hype. “To harness the research community, they need to be more transparent,” says Lazer. “The models for collaboration around big data haven’t been built.” It’s scary enough to think that private companies are gathering endless amounts of data on us. It’d be even worse if the conclusions they reach from that data aren’t even right.