Sunday, April 29, 2012

Isilon and Hadoop



     I've been testing an Isilon in the lab (you might catch on that I like scale out storage architectures and IP based storage).  It has been working great and the performance is pretty good for a 5 node system with NFS.  I have to give the Isilon guys props on the way they do their load balancing for connections.  At first the idea of delegating a DNS zone to the Isilon seemed very strange but after doing it and seeing it in action they get a gold star.  I'm using this storage right now to power a vCloud environment.  The VMware integration is still in progress and in some cases the lack of VAAI and VADP are really annoying.  When running behind vCloud 1.5 which has linked clones it's not as noticeable day to day operations of deploying and deleting virtual machines.
    While I was testing this, the code update that enables HDFS support was released (so we had to try that right).  The update process on an Isilon is a nice experience that pretty much just works.  It updates a node at a time and ensures data remains online and available while it happens.  Isilon's approach to Hadoop seems very... non-hadoopish at first glance.  They offer the HDFS support but you have to supply the Hadoop compute nodes, so you separate compute and storage which is kind of a fundamental architecture decision of traditional Hadoop.  In this case I think the trade off may be well worth it because of the benefits Isilon brings to the party.

     One of the downsides to traditional Hadoop is that a lot of thought has to be put into how to place data for redundancy and the name node for HDFS is NOT redundant.  The Isilon solves these problems with its architecture and also allows processing of data that was written to the Isilon over a different protocol without a second import process.  In our case we had filesystems shared via NFS and wrote several Apache server logs to the Isilon via NFS.  We then processed that data using HDFS on Hadoop nodes and the output of the Hadoop jobs could then be accessed via NFS again.  So as an example, we processed the Apache web server logs to show the breakdown of iPhone/iPad devices hitting the website site based on the user agent string.  This code took about 30 minutes to write and ran across 1.6 billion log entries in about 4 minutes with our setup.  The report showed the breakdown by hour of the day and iPad versus iPhone and the top 10 most request URLs, etc.  Really we could report on whatever, but this was an easy first test case.  The Hadoop output is also accessible via NFS so it can be served up by a traditional web server back to the marketing department who requested the report.  So a slight modification of our Apache log rotation to copy files to the NFS mount or you could write the Apache logs directly to the Isilon and you could have a report generated pretty much automatically.  This is part of what makes Isilon different from other HDFS implementations.  Being able to use one copy of the data and access via multiple protocols is a very powerful thing and it makes it very easy.  The other advantage is you get the ability to use Isilon's snapshots (and even replication).  For many people that may not matter for your Hadoop data, but this is really about adding Hadoop capability to existing data. 
 
   In order to create this setup, I used RPMs from the GreenplumHD distribution of Hadoop and the Java distribution came from java.com.  I had 8 servers running RedHat 6.1 for this experiment.  After installing the rpms, I created a Hadoop user and we were ready to start processing data.  The entire install from updating the Isilon code to installing the RPMs, accounts, etc. took 48 minutes to have it all installed and configured.  (This process is now even faster for me after building a yum repo for all the software).  
As far as configuration, the only thing we had to edit for the Isilon integration was the core-site.xml file.  The fs.default.name value is set to the smartconnect address of the Isilon and you are done.
We did also end up tuning the # of task tracker threads in the mapred-site.xml file and mapred.local.dir values for a local working directory for Hadoop.  (I hope to have a more technical dive blog on this later as I'm still revising some of this process).

  So the only warnings I would have would this approach is that I'm pretty sure this setup could also be used to stress test network switches in addition to processing data.   We had 10Gbit networking in the lab and everything was on the same LAN segment.  In one case we did push an aggregate of 10Gbit of data from the Isilon to the compute nodes in other workloads.  So if you go down this route, keep your compute nodes on the same switch as the Isilon nodes and this should scale pretty far.  Also, I don't know that I would buy an Isilon strictly for Hadoop.  It is a relatively expensive platform compared to a traditional Hadoop installation, but if you already have an Isilon or you are looking at it for other use-cases, like cloud storage, then this is certainly something you might want to add to the toolkit.  I can't overstate the simplicity of setup and ongoing operations of the Isilon and Hadoop.   It really lets you spend time doing analysis of your data rather than managing the infrastructure.