Who Swears the Most? How Foursquare Used Hadoop to Find Out
We told you who swears the most in their code, but what about in the real world? Foursquare, the location check-in service, has used its rather large dataset to graph the “rudest” places in the English-speaking world — Manchester, U.K. takes top honors.
While the results should be taken with a grain of salt — after all the swearing is limited to Foursquare users and there’s no hint of what constitutes a swear word — the methods Foursquare used to get the data make a great intro to the world of Apache Hadoop and Apache Hive.
Hadoop is an open-source MapReduce framework — a way of processing huge datasets stored in large server clusters (or grids). While MapReduce frameworks were originally introduced by Google (which has very large datasets to work with) they’ve since grown beyond Google and their usefulness isn’t limited to large companies with massive databases.
In fact, with Amazon’s Elastic MapReduce just about anyone can easily and cheaply run their own Hadoop framework and process vast amounts of data just like Google does.
Because word search processing is generally considered the canonical example of what makes a MapReduce framework useful, Foursquare’s blog post offers a good overview of how you can use MapReduce to mine through anything from large text documents to user-contributed data like the check-in snippets Foursquare is processing.
Foursquare’s server setup is specific to them, but there’s one key element that’s worth bearing in mind — store your Hadoop data well away from your production system. MapReduce doesn’t work at the speed of the web and you don’t want it dragging your site down.
In Foursquare’s case that means using Amazon’s Elastic MapReduce plus a simple Ruby on Rails server. The result is, as Foursquare Engineer Matthew Rathbone puts it, “a powerful (and cheap) data analysis tool.”
If you’re new to MapReduce and functional programming in general, read through the Foursquare post for an overview on how MapReduce is useful and then check out the Hadoop site, as well as this overview video from Cloudera.