Archive for the ‘Databases’ Category

Behind the Scenes at Instagram: Tools for Building Reliable Web Services

In case you missed it, yesterday Facebook acquired Instagram, a photo-sharing service with some 30 million users and hundreds of millions of images on its servers.

The reported sale price of one billion dollars no doubt has many developers dreaming of riches, but how do you build a service and scale it to the size and success of Instagram? At least part of the answer lies in choosing your tools wisely.

Fortunately for outside developers, Instagram’s devs have been documenting the tools they used all along. The company’s engineering blog outlined its development stack last year and has further detailed how it uses several of the tools it’s chosen.

Instagram uses an interesting mashup of tried-and-true technologies alongside more cutting-edge tools, mixing SQL databases with NoSQL tools like Redis, and chosing to host its traditional Ubuntu servers in Amazon’s cloud.

In a blog post last year Instagram outlined its core principles when it comes to chosing tools, writing, “keep it very simple, don’t reinvent the wheel [and] go with proven and solid technologies when you can.”

In other words, go with the boring stuff that just works.

For Instagram that means a Django-based stack that runs on Ubuntu 11.04 servers and uses PostgreSQL for storage. There are several additional layers for load balancing, push notifications, queues and other tasks, but overwhelmingly Instagram’s stack consists of stolid, proven tools.

Among the newer stuff is Instagram’s use of Redis to store hundreds of millions of key-value pairs for fast feeds, and Gunicorn instead of Apache as a web server.

All in all it’s a very impressive setup that has, thus far, helped Instagram avoid the down time that has plague many similar services hit with the same kind of exponential growth. (Twitter, I’m looking at you.) For more details on how Instagram looks behind the scenes and which tools the company uses, be sure to check out the blog post as well as the archives.

File Under: Databases, Web Services

OpenStreetBlock Gives Geodata the Human Touch

Location-based web services are all the rage right now, but for most of us the actual geographic location isn’t very interesting — do you know where “40.737813,-73.997887″ is off the top of your head? No? How about “West 14th Street bet. 6th Ave. and 7th Ave?”

For the geographic web to become useful geodata has to be converted into something humans actually understand. Enter OpenStreetBlock.

OpenStreetBlock is a new web service that takes geographic coordinates (latitude/longitude pairs) and turns them into an actual city block description. The result is textual information which, in many cases, will be even more meaningful to your users than the ubiquitous pin on a map.

If you’d like to play around with a sampling of data from New York, head over to OpenStreetBlock and try out the New York demos.

If you’ve ever wanted to build your own version of EveryBlock — which pinpoints events, news stories and public data at the city-block level — OpenStreetBlock will go a long way toward getting you there. So long as you can pull geo coordinates out of your source data, OpenStreetBlock can turn that into more meaningful information.

Under the hood OpenStreetBlock relies on OpenStreetMap data and uses PHP in conjunction with a geographic database to turn your coordinates into block descriptions.

As cool as OpenStreetBlock is, getting it up and running on your own site will require a bit of work. Luckily, there are some good tutorials available that will walk you through the process of installing and setting up many of the prerequisites like PostgreSQL and PostGIS (I’ll assume you already have an Apache server with PHP installed).

To get started with OpenStreetBlock, grab the code from GitHub. The next thing you’ll need is a PostgreSQL database with all the PostGIS tools installed. Luckily those are also prerequisites for GeoDjango, so head over to the GeoDjango installation page, skip the Django-specific parts and just follow the Postgres and PostGIS installation instructions.

Next you’ll need to download Osmosis and Osm2pgsql to convert OpenStreetMap data into something Postgres can handle. Head over to OpenStreetMap, zoom into an area you’d like to query with OpenStreetBlock and then choose “export.” Select the OpenStreetMap XML Data option and save the file.

From there you can check out the guide to importing the OpenStreetMap XML Data in the OpenStreetBlock read me.

See Also:

File Under: Databases

Open Data’s Access Problem, and How to Solve it

The recent Gov 2.0 summit in Washington D.C. saw several promising new announcements which will help government agencies share code and best practices for making public data available to developers.

The idea behind new projects like, the FCC’s new developer tools and the Civic Commons is that by giving developers access to data previously stored in dusty filing cabinets, they can create tools to give ordinary citizens greater access to that data.

Unfortunately, not everything open data project leads to good things. It is critical that if open data is made available on the web, it must be accompanied by some effort to ensure everyone can access it.

We’ve seen an explosion in creative hacks that use this newly available data to provide excellent online resources. Public data sites like EveryBlock, or the Sunlight Foundation’s Design for America contest have highlighted some of the amazing ways open data can make our lives better. Whether it’s finding out crime stats, real estate values, health hazards and business license statuses in your neighborhood, or visualizing how the government is spending your tax dollars through innovative maps, open data and what you can do with it is the current hotness among web developers.

Most of the benefits are close to home — in the U.S., just about everyone has access to online government resources thanks to web-enabled computers in free public libraries.

But extend that argument to the rest of the world and the number of people that really have access to the data drops significantly. If you don’t have an easy way to get online, you can’t benefit from open data.

Michael Gurstein, Executive Director of the Center for Community Informatics Research, recently highlighted some of the problems with open data accessibility.

Gurstein points out a number of assumptions about open data that are often overlooked by those most enthusiastic about making such data publicly available.

Worse, he shows how such data can be used against you.

Continue Reading “Open Data’s Access Problem, and How to Solve it” »

File Under: Databases, Other

Big Data in the Deep Freeze: John Jacobsen of IceCube

John Jacobsen works for the IceCube telescope project, the world’s largest neutrino detector, located at the South Pole. The project’s mission is to search for the radioactive sub-atomic particles that have been generated by violent astrophysical events: “exploding stars, gamma ray bursts, and cataclysmic phenomena involving black holes and neutron stars,” according to the project website.

Jacobsen is one of the people in charge of handling the massive amounts of data collected by IceCube. In the video, shot this week at the O’Reilly OSCON 2010 conference in Portland, Oregon, John explains how they collect a terabyte of raw data per hour, then send everything to IceCube’s remote research and backup facilities using a finicky satellite hook-up.

Antarctica is one of the least accommodating places on Earth to perform scientific research with computers. It’s the driest spot on the planet — atmospheric humidity hovers around zero — and bursts of static electricity threaten the integrity of IceCube’s data stores. The lack of humidity causes the server clusters’ cooling systems to break down. And if something fails, a spare might take six months to arrive.

File Under: Databases, Visual Design

Sunlight Labs Offering $5K for Best Government Data Mashups


Artists, web developers and data visualization geniuses, here’s a chance to strut your stuff, serve your country and win some serious money in the process.

Sunlight Foundation, a non-profit organization that provides tools to make government data more transparent, has announced a new contest called Design for America. Billed as a “design and data visualization extravaganza,” Sunlight is encouraging the public to create and publish data visualizations that help make complex government data easier for people to digest and interact with.

There are several different categories open for submission, including: visualizations of data that shows how the stimulus money is being spent, visualizations showing how a bill becomes a law, a redesign of a .gov website, and a redesign of any government form. Top prize in each category is a cool $5,000.

Creations can be in any form — a website, a game, a poster, a sculpture, whatever — though we suspect most of the entries will be either posters or interactive Flash graphics.

The contest is being run by Sunlight Labs, the skunkworks wing of the larger Sunlight Foundation. The Sunlight group spends most of its energy collecting government data, organizing it into publicly accessible databases, then creating tools that make it easier for ordinary people to access that data. The non-profit works with organizations like OpenCongress, MapLight, FollowTheMoney and Sunlight also maintains a list of APIs developers can use to access the data.

The Design for America contest encourages participants to sift through the vast datasets available from all of these organizations, as well as the datasets maintained by Sunlight Foundation and any raw government data that’s available. As the Sunlight Labs blog says, the goal of the contest is to “tell interesting stories” that go beyond what can be an overwhelming amount of unfiltered data.

Visualizations can be in any medium, not just the web, so if you’re a video or infographic specialist, you can still enter the contest. The main criteria for judging are the visual quality of the artwork and how well the underlying information is conveyed.

Continue Reading “Sunlight Labs Offering $5K for Best Government Data Mashups” »