How Big Data, cloud computing, Amazon and poll quants won the U.S. election

By: Gregory P. Bufithis, Esq.   Founder/CEO, The Cloud and E-Discovery

15 November 2012 –   As Daniel Honan of Big Think pointed out, just like in baseball and politics, there are winners and losers in a data-driven world. The losers in baseball, for instance, are the over-rated prospects who will never be drafted because data analysis has a way of finding them out early on in their careers. In politics, the biggest loser will be the horse race pundit, the guy who spins the polls to reinforce one side’s belief that it is winning when it’s actually losing. Sometimes this is done for partisan reasons, in the hope of creating “momentum,” and sometimes it is done to create a more compelling media narrative.

This was indeed a choice election, and the choice was between following entertainment journalism or data-based journalism. As Andrew Beaujon has pointed out, entertainment is fun, and math is hard. Well, math won.

Data analysis at its best

It is a fascinating area of data analysis.  As part of my neuroinformatics degree program, I recently had the chance to sit down with some data scientists for Greenplum, the Big Data/analytics division of EMC. Polymaths who seem to know all-things-data.  It was simply a brilliant start-to-finish discussion of a poll/election statistical analysis starting with the setting and the substantive questions of interest, moving through data collection, modeling, data analysis, auxiliary analyses, validation (model checking), and presentation and distribution of results, including important practical details, such as time constraints.

For more than a year, a group of what the reporters described as “math geeks and data wizards” worked long days creating elaborate statistical models to determine which voters were likely to vote for the president. And they were not just working in broad terms — young people, single women, Asians, et al. They were identifying individuals and providing data to field workers who went to the doors of those people and persuaded them to vote.

Right now, in the U.S., the man of the hour is statistician and New York Times blogger Nate Silver, a chap I profiled earlier this week in my personal blog (click here).  Silver has been a bit of a lightning rod, as one would expect in a hotly contested election.  But his results are astounding.  In 2008 he made forecasting elections “nerd-cool”.   Yes, that guy.  The guy who predicted in 2008 the winner of 49 of 50 states … and all 35 senate races.  And he hit it perfectly again this year.   A nice piece about Silver from Wired magazine blog and be found with a click here plus a good intro piece on how Silver’s analysis works with a click here. But for an excellent, detailed analysis about Silver from David Smith’s blog Revolutions just click here.

And Sam Wang, the Princeton neuroscientist and part-time election forecaster profiled in the Chronicle article I noted in my blog, hit it perfect, too.  Wang was also 50 for 50. And … his prediction of the percentage of the popular vote going to each candidate was dead on: Obama 51.1, Romney 48.9. Oh, and he was also 10 for 10 in U.S. Senate races.

Oh, and Drew Linzer who posted on his website in June 2012 that the election would be won with a result of 332 votes for Obama, 206 for Romney. Over the months that followed, that prediction didn’t change, even as new information came in. After Florida was fully counted the final electoral count came in … 332 votes for Obama, 206 for Mitt Romney.

The campaigns

And let’s not forget the campaigns themselves. Obama’s campaign ran an extremely sophisticated and relentless digital operation that threw out the rule book and took no assumption for granted. The team had elite and, for tech, senior talent — by which I mean that most of them were in their 30s — from Twitter, Google, Facebook, Craigslist, Quora, and some of Chicago’s own software companies such as Orbitz and Threadless. For two very good pieces on how the Obama campaign excelled at data analysis and how it all worked click here and click here.

Obama’s campaign employed dozens of “data crunchers” that analyzed information collected over two years, which helped them raise $1 billion, remade the process of targeting TV ads and created detailed models of swing-state voters that could be used to increase the effectiveness of everything from phone calls and door knocks to direct mailings and social media. It helped the campaigners to predict behavior and results, and by this changed the content of the campaign to drive better results.

Remember the contest that offered dinner with George Clooney? Ever wonder why it was with him and not someone else? Results collected from big data analysis showed that women aged 45-49 from the West Coast are likely to spend money on a chance to win a dinner with George Clooney.  Obama’s campaigners looked for a celebrity with a similar profile that would drive similar objectives.

One of the most intriguing bits:  most of the money raised came from voters. It was raised by email campaigns.  The data crunchers developed a metric-driven email campaigns. Testing different versions, analyzing results, changing content, resending and so on, were a big part of the success. By the way, Michelle Obama’s emails were the most successful ones.

Sasha Issenberg, author of a new book, The Victory Lab, talks about how “Mr and Mrs Sixpack” can be sent different advertising.  The scene could be in Tampa, or Santa Barbara, or Chicago. Mr and Mrs Sixpack are relaxing after dinner with their iPads. Each is looking at the same news website, but each will be shown different political advertising. He sees something about naval bases, from the Romney camp; she sees a post about the president’s environmental record.

This is the new trick. Behind this year’s digital campaigns — whether through e-mail, social networks, apps or web advertising — lies an enormous body of data that have been integrated for the first time. The campaigns were able to link online and offline data. Voter-registration files were merged with vast quantities of bought consumer data, on top of which come bought or acquired e-mails, mobile and landline numbers, as well as data gathered through canvassing, phone banks and social-media pages. The campaigns were also making use of cookies, the crumbs of data people leave behind when they browse the net.

The Romney campaign had struggled to keep up in this digital arms race and resorted to an old-fashioned conservative rhetoric posted in blogs, sympathetic TV and radio stations which attracted headlines and roused followers but Romney’s YouTube channel only attracted 23,700 subscribers and 26 million page views (Obama’s had 275 million page views). For a good review of how the Romney campaign simply failed at it click here.  Social media sites were flooded with comments on the Romney campaign’s “Big Orca fail” (Orca was the Romney campaign’s voter-database operation, so named because orcas are the natural predators of narwhals, which the Obama campaign’s operation was named after). The advantages of incumbency can’t be dismissed, but the GOP’s data denialists were crippling them. The ill-considered, ignorant backlash against Nate Silver (and the many, many other poll-watchers who had Obama’s probability of winning in a similar, 70+ percent range) revealed a party-wide willingness to double down on bad hands.

Romney ran on the competence that only a lifetime of private-sector, data-driven, consultant-turnaround experience can bring. The flipside to that is plenty of people in the private sector have been saddled with the computer systems that high-priced consultants leave in their wake. They don’t always work. But yikes; if you have been reading the stories in the LA Times, Huffington Post and Ars Techina:

– the login IDs and passwords provided to statewide volunteers were incorrect, and barred them from accessing the app

– the campaign didn’t even release the app until election day. They didn’t even release its 60 pages of documentation until the night before…so nearly 40,000 people needed to get up to speed on ORCA at the moment they needed to actually start logging data with it

– they were given a URL for ORCA that pointed to a nonexistent http//: address, instead of the correct https:// one

– they originally had a load balancer and a bunch of app servers, but for some reason couldn’t get it to work properly

– part of the issue was Orca’s architecture. While 11 backend database servers had been provisioned for the system—probably running on virtual machines—the “mobile” piece of Orca was a Web application supported by a single Web server and a single application server. Rather than a set of servers in the cloud, all the servers were in Boston at the Garden or a data center nearby.

Cloud computing and Amazon rule the roost

And cloud computing?  A major participant.  Amazon Web Services (AWS) was very much behind-the-scenes of the Obama campaign, something that AWS has been touting now that the presidential election is over.  As AWS said in its blog: “To set the stage, imagine setting up the technology infrastructure needed to power a dynamic, billion-dollar organization under strict time limits using volunteer labor, with traffic peaking for just one day, and then shutting everything down shortly thereafter. The words “mission critical” definitely apply here. With the opportunity to lead the United States as the prize, the stakes were high.”

The Obama campaign’s technology team built, deployed, ran, and scaled up their applications on AWS. After the election, they backed it all up to Amazon S3 and scaled way, way down.  The campaign used AWS to avoid an IT investment that would have run into the tens of millions of dollars. Along the way they built and ran more than 200 applications on AWS, scaled to support millions of users. One of these apps, the campaign call tool, supported 7,000 concurrent users and placed over two million calls on the last four days of the campaign.

Databases were the key.  A database running on Amazon RDS, served as the primary registry of voter file information. This database integrated data from a number of sources including and donor information from the finance team) in order to give the campaign managers a dynamic, fully-integrated view of what was going on. Alongside this database, an analytics system running on EC2 Cluster Compute instances (cc2.8xl).  Another database cluster ran Leveldb on a trio of High-Memory Quadruple Extra Large instances.

This array of databases allowed campaign workers to target and segment prospective voters, shift marketing resources based on near real-time feedback on the effectiveness of certain ads, drive a donation system that collected over one billion dollars (the 30th largest ecommerce site in the world).

These were very complex applications, comparable in scope and complexity to those seen in the largest enterprises and data-rich startups. For example: they had massive data modeling using Vertica and Elastic MapReduce; multi-channel media management via TV, print, web, mobile, radio and email using dynamic production, targeting, retargeting, and multi-variant testing like you’d find in a sophisticated digital media agency; social coordination and collaboration of volunteers, donors, and supporters; massive transaction processing; voter abuse prevention and protection, including capture of incoming incidents and dispatch of volunteers; a rich information delivery system for campaign news, polls, information on the issues, voter registration, and more.

The applications made use of virtually every AWS service including EC2, Route 53, SQS, DynamoDB, SES, RDS, VPC, EBS Provisioned IOPS, SNS, ElastiCache, Elastic Load Balancing, Auto Scaling, and CloudFront. They also took advantage of Solution Architects and AWS Premium Support.

Much of my information is coming from AWS via its blogs and Twitter chatter. Plus I received an invitation to AWS Re: Invent. And although the event is sold-out you can still watch the keynote addresses by signing up (click here). As part of this event they have assembled an impressive “Big Data and the US Presidential Campaign” panel.  You’ll get to hear from the people who built and ran the applications described above: Miles Ward (AWS Solution Architect), Harper Reed (Obama Campaign CTO), Leo Zhadanovsky (Director of Systems Engineering at the DNC), JP Schneider (DevOps Engineer, Obama for America) plus many more.

And some startling tech trends

Mobile phones offered the big new opportunity this time, as over half of voters now have smartphones. About 10% of donations were sent via text or mobile app. This time round the election was a properly social affair, with both sides engaged in a full range of platforms such as Facebook, Twitter, YouTube, Tumblr and Instagram.

Social networks can reach voters who do not watch live television and have mobile phones rather than landlines. It can also establish links with people who tend not to vote by targeting their more politically motivated friends and families. It all seemed to be working. All market research to date has shown that of the American internet-users polled in September and October, more than a quarter said social media had influenced their political opinions.

Conventional politics has been turned on its head

For campaign professionals, this is all a major leap. Politics long has been ruled by truisms, conventional wisdom and intuition, with millions spent based on a murky mix of polling and focus groups. The shift to data-driven decision-making has been gradual and steady — becoming increasingly sophisticated as political parties amass more information about individual voters through traditional means, such as polls, and new ones, such as data mining. The result was obvious on election night. While, on Fox News Karl Rove was sputtering and fuming, insisting that the result in Ohio could not be true because it did not match his own expectations, and while on the same cable network Dick Morris was flabbergasted that voting patterns had not returned to the norm of 2004, the young math nerds in Chicago were watching state after state fall exactly as they had predicted.

What all this means is pretty simple.  Polls need to update their methodologies.  Campaigns will get even more digital savvy. Data analysis is growing ever smarter. New methods of data analysis will continue to emerge, greatly improving our ability to predict the future, to analyze the electorate.

If you want to get a handle on this material, take some time and read Silver’s The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t to get a feel for sabermetric analysis — yes, the same stuff Billy Beane, of Moneyball fame, used to make the small market Oakland A’s competitive with mega-market teams like The New York Yankees.

No doubt Republicans will be recruiting teams of young mathematicians and data miners to help them emulate what the Obama crew has done. Meanwhile, a whole generation of political Svengalis may shortly find their status diminished to the level of tarot card flippers and palm readers.

So the quants and their statistical models were right, while the political pundits and their guts were wrong.

The victory of mathematics over bloviation has been resounding.

You can leave a response, or trackback from your own site.

Leave a Reply