<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kevin Chiu &#187; web</title>
	<atom:link href="http://kevinchiu.org/archives/tag/web/feed" rel="self" type="application/rss+xml" />
	<link>http://kevinchiu.org</link>
	<description>Things are only impossible until they&#039;re not.</description>
	<lastBuildDate>Fri, 27 Jan 2012 08:19:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Hadoop, the future of web back ends</title>
		<link>http://kevinchiu.org/archives/hadoop-the-future-of-web-back-ends</link>
		<comments>http://kevinchiu.org/archives/hadoop-the-future-of-web-back-ends#comments</comments>
		<pubDate>Sun, 18 Jan 2009 14:42:23 +0000</pubDate>
		<dc:creator>Kevin Chiu</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[ruby on rails]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://kevinchiu.org/blog/?p=866</guid>
		<description><![CDATA[Variations on the traditional web stack, such as Ruby on Rails, provide a huge initial boost to productivity but face a challenging proposition when it comes to growing a business. These frameworks are easy to prototype with, but they really are a scalability nightmare &#8212; It&#8217;s not their fault. The choke point is the relational [...]]]></description>
			<content:encoded><![CDATA[<p>Variations on the traditional web stack, such as <a href="http://rubyonrails.com">Ruby on Rails</a>, provide a huge initial boost to productivity but face a challenging proposition when it comes to growing a business. These frameworks are easy to prototype with, but they really are a scalability nightmare &#8212; <em>It&#8217;s not their fault.</em> The choke point is the relational databases they rely on, such as <a href="http://www.mysql.com/">MySQL</a>. It won&#8217;t be long before websites that need to scale will see the light and move to a distributed back end like <a href="http://hadoop.apache.org/core/">Hadoop</a>.</p>
<p>In our current RDBMS-dependent web stacks, scalability problems tend to hit the hardest at the database level. For applications with just a handful of common use cases that access a lot of the same data, distributed in-memory caches, such as <a href="http://code.google.com/p/memcached/">memcached</a>, provide some relief. However, for interactive applications that hope to reliably scale and support vast amounts of IO, the traditional RDBMS setup isn&#8217;t going to cut it. Unlike small applications that can fit their most active data into memory, applications that sit on top of massive stores of shared content require a distributed solution if they hope to survive the <a href="http://en.wikipedia.org/wiki/The_Long_Tail">long tail</a> usage pattern commonly found on content-rich sites.</p>
<p>MySQL and other similar RDBMS&#8217;s tend to choke on this type of access pattern. The typical solution to scaling up your vanilla RDBMS is to boot up memcached and later do some sharding and cloning. This comes with a host of fun extras you get to deal with, such as which db is cloning which other db, which engine to use for clones vs masters, whether or not to pair masters, read strategies, write schedules, query batching, connection pooling, syncing data through streaming binary logs, denormalizing schemas, restoring fallen db&#8217;s, rebooting whenever you need to add a box&#8230; the list goes on. If you&#8217;re facing scalability issues at the db level, it may be because the relational model was never meant to be scaled to the magnitude of data commonly encountered in web apps today.</p>
<p><img src="http://kevinchiu.org/blog/wp-content/uploads/2009/01/hadoop_fish2.jpg" alt="hadoop_fish2" title="hadoop_fish2" width="530" height="299" class="aligncenter size-full wp-image-901" /></p>
<p><strong>Hadoop blows relational databases out of the water.</strong></p>
<p><a href="http://hadoop.apache.org/core/">Hadoop</a> is a distributed file system based on <a href="http://google.com">Google</a>&#8216;s <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a> architecture. Its reason for existing is to scale durable storage into the petabyte range and beyond. Data can be replicated and distributed among thousands of nodes automatically. You don&#8217;t have to craft individual failover tactics for clones. Just set your replication factor and go. As a bonus, you can use long-running parallel jobs to generate statistics, reports, and other <a href="http://en.wikipedia.org/wiki/PageRank">nifty data distillates</a>. The only problem with using Hadoop as a sole back end for web work is that your queries will take ages to complete. This is where HBase saves the day.</p>
<p><a href="http://hadoop.apache.org/hbase/">HBase</a> is a column-oriented data store based on Google&#8217;s <a href="http://labs.google.com/papers/bigtable.html">BigTable</a>. It provides low latency column and range management on top of Hadoop. This combo provides something analogous to &#8220;sharding&#8221; and &#8220;cloning,&#8221; which are currently hot topics in the pragmatic RDBMS world, but are already old news if you look at what&#8217;s happening in the distributed space.</p>
<p>Here are some fun Google Trend charts that show the decreasing popularity of MySQL and the increasing popularity of Hadoop.</p>
<p><img src="http://kevinchiu.org/blog/wp-content/uploads/2009/01/picture-4.png" alt="Yes, Hadoop's market share is relatively miniscule." title="picture-4" width="584" height="290" class="aligncenter size-full wp-image-878" /></p>
<p><img src="http://kevinchiu.org/blog/wp-content/uploads/2009/01/picture-1.png" alt="It's growing!" title="picture-1" width="585" height="290" class="alignnone size-full wp-image-872" /></p>
<p>I have to admit that MySQL and other RDBMS&#8217;s have stratospherically more market share than Hadoop, but like any investment, it&#8217;s the future you should be considering. The industry is trending towards distributed systems, and Hadoop is a major player. If it were a <a href="http://cloudera.com">startup</a>, VC&#8217;s would be clamoring over a chance at first round equity. Any sane individual should be keen on gaining a stake in this budding sector ASAP.</p>
<p>Over the weekend, I&#8217;ll be investigating the possibility of hooking Hadoop + HBase up to <a href="http://djangoproject.com">Django</a>, the exceptionally well-documented <a href="http://www.python.org/">Python</a> web framework that&#8217;s <a href="http://kevinchiu.org/blog/archives/tipping-point">gaining ground</a> on the current industry poster child, Ruby on Rails.</p>
<p>( I&#8217;ll also be job hunting. If you&#8217;re hiring, here&#8217;s my <a href="http://kevinchiu.org/cv.pdf">resume</a>. )</p>]]></content:encoded>
			<wfw:commentRss>http://kevinchiu.org/archives/hadoop-the-future-of-web-back-ends/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

