Variations on the traditional web stack, such as Ruby on Rails, provide a huge initial boost to productivity but face a challenging proposition when it comes to growing a business. These frameworks are easy to prototype with, but they really are a scalability nightmare — It’s not their fault. The choke point is the relational databases they rely on, such as MySQL. It won’t be long before websites that need to scale will see the light and move to a distributed back end like Hadoop.
In our current RDBMS-dependent web stacks, scalability problems tend to hit the hardest at the database level. For applications with just a handful of common use cases that access a lot of the same data, distributed in-memory caches, such as memcached, provide some relief. However, for interactive applications that hope to reliably scale and support vast amounts of IO, the traditional RDBMS setup isn’t going to cut it. Unlike small applications that can fit their most active data into memory, applications that sit on top of massive stores of shared content require a distributed solution if they hope to survive the long tail usage pattern commonly found on content-rich sites.
MySQL and other similar RDBMS’s tend to choke on this type of access pattern. The typical solution to scaling up your vanilla RDBMS is to boot up memcached and later do some sharding and cloning. This comes with a host of fun extras you get to deal with, such as which db is cloning which other db, which engine to use for clones vs masters, whether or not to pair masters, read strategies, write schedules, query batching, connection pooling, syncing data through streaming binary logs, denormalizing schemas, restoring fallen db’s, rebooting whenever you need to add a box… the list goes on. If you’re facing scalability issues at the db level, it may be because the relational model was never meant to be scaled to the magnitude of data commonly encountered in web apps today.

Hadoop blows relational databases out of the water.
Hadoop is a distributed file system based on Google‘s MapReduce architecture. Its reason for existing is to scale durable storage into the petabyte range and beyond. Data can be replicated and distributed among thousands of nodes automatically. You don’t have to craft individual failover tactics for clones. Just set your replication factor and go. As a bonus, you can use long-running parallel jobs to generate statistics, reports, and other nifty data distillates. The only problem with using Hadoop as a sole back end for web work is that your queries will take ages to complete. This is where HBase saves the day.
HBase is a column-oriented data store based on Google’s BigTable. It provides low latency column and range management on top of Hadoop. This combo provides something analogous to “sharding” and “cloning,” which are currently hot topics in the pragmatic RDBMS world, but are already old news if you look at what’s happening in the distributed space.
Here are some fun Google Trend charts that show the decreasing popularity of MySQL and the increasing popularity of Hadoop.


I have to admit that MySQL and other RDBMS’s have stratospherically more market share than Hadoop, but like any investment, it’s the future you should be considering. The industry is trending towards distributed systems, and Hadoop is a major player. If it were a startup, VC’s would be clamoring over a chance at first round equity. Any sane individual should be keen on gaining a stake in this budding sector ASAP.
Over the weekend, I’ll be investigating the possibility of hooking Hadoop + HBase up to Django, the exceptionally well-documented Python web framework that’s gaining ground on the current industry poster child, Ruby on Rails.
( I’ll also be job hunting. If you’re hiring, here’s my resume. )
Hi,
Did you get time to investigate the possibility of hooking Hadoop + Hbase up to Django ?
I’m currently building a small Hadoop cluster composed of Virtual Machines and plan to use Hbase for python db access. Thus I’m very interested in feedback w/ such usage of Hadoop. These days, lots of people look at django + couchdb but Hadoop is certainly more mature than couchdb.
Keep the good work
Hi Angus,
Currently, I’m wrestling with Django + AppEngine, but I will return to Django + Hadoop in the future.
Kevin — any progress with Django + Hadoop? My team here is doing some investigating as to changing platforms and a framework that can grow into Hadoop support is ideal. Thanks!
Hadoop is definitely on the watch list as it matures. You also face similar complexity when dealing with Hadoop/Hbase/Hive/HDFS (like setting up, breaking things down into tasks). But for many many applications, MySQL (or RDBMS) ain’t going anywhere. I see smart companies use both for different portions of their operations. Unless Hadoop can do real-time, low latency yet in distributed farms effortlessly, there is no clear winner now, or ever. Maybe the trend on real-time search (Twitter, FB) might be able to speed this up.
@Kevin C – I’m actually hoping that Hadoop will come out of large scale offline analysis and into the real-time world sometime.
@Son Nguyen – Nice observation. I do hope real time search will entice the Hadoopers to get a real time layer into the system.