Hadoop, the future of web back ends 

Variations on the traditional web stack, such as Ruby on Rails, provide a huge initial boost to productivity but face a challenging proposition when it comes to growing a business. These frameworks are easy to prototype with, but they really are a scalability nightmare — It’s not their fault. The choke point is the relational databases they rely on, such as MySQL. It won’t be long before websites that need to scale will see the light and move to a distributed back end like Hadoop.

In our current RDBMS-dependent web stacks, scalability problems tend to hit the hardest at the database level. For applications with just a handful of common use cases that access a lot of the same data, distributed in-memory caches, such as memcached, provide some relief. However, for interactive applications that hope to reliably scale and support vast amounts of IO, the traditional RDBMS setup isn’t going to cut it. Unlike small applications that can fit their most active data into memory, applications that sit on top of massive stores of shared content require a distributed solution if they hope to survive the long tail usage pattern commonly found on content-rich sites.

MySQL and other similar RDBMS’s tend to choke on this type of access pattern. The typical solution to scaling up your vanilla RDBMS is to boot up memcached and later do some sharding and cloning. This comes with a host of fun extras you get to deal with, such as which db is cloning which other db, which engine to use for clones vs masters, whether or not to pair masters, read strategies, write schedules, query batching, connection pooling, syncing data through streaming binary logs, denormalizing schemas, restoring fallen db’s, rebooting whenever you need to add a box… the list goes on. If you’re facing scalability issues at the db level, it may be because the relational model was never meant to be scaled to the magnitude of data commonly encountered in web apps today.

hadoop_fish2

Hadoop blows relational databases out of the water.

Hadoop is a distributed file system based on Google‘s MapReduce architecture. Its reason for existing is to scale durable storage into the petabyte range and beyond. Data can be replicated and distributed among thousands of nodes automatically. You don’t have to craft individual failover tactics for clones. Just set your replication factor and go. As a bonus, you can use long-running parallel jobs to generate statistics, reports, and other nifty data distillates. The only problem with using Hadoop as a sole back end for web work is that your queries will take ages to complete. This is where HBase saves the day.

HBase is a column-oriented data store based on Google’s BigTable. It provides low latency column and range management on top of Hadoop. This combo provides something analogous to “sharding” and “cloning,” which are currently hot topics in the pragmatic RDBMS world, but are already old news if you look at what’s happening in the distributed space.

Here are some fun Google Trend charts that show the decreasing popularity of MySQL and the increasing popularity of Hadoop.

Yes, Hadoop's market share is relatively miniscule.

It's growing!

I have to admit that MySQL and other RDBMS’s have stratospherically more market share than Hadoop, but like any investment, it’s the future you should be considering. The industry is trending towards distributed systems, and Hadoop is a major player. If it were a startup, VC’s would be clamoring over a chance at first round equity. Any sane individual should be keen on gaining a stake in this budding sector ASAP.

Over the weekend, I’ll be investigating the possibility of hooking Hadoop + HBase up to Django, the exceptionally well-documented Python web framework that’s gaining ground on the current industry poster child, Ruby on Rails.

( I’ll also be job hunting. If you’re hiring, here’s my resume. )