Thursday, March 27, 2014

Software Defined Storage

In order to be fully buzzword compliant, I'm going to talk a bit around software defined storage.
In fairness, I think this should be looked at as more of a software defined persistence.  

I've spent some time running HP's Storevirtual VSA.  There is a LOT that is compelling about this platform.  I have been able to use their documented APIs and automate deploying a 8 node cluster with full storage auto tiering between SSDs and HDDs.  I can automatically deploy my ESX host with PXEboot, deploy the VSA cluster and license it, configure and present luns and format them as VMware datastores all as part of a scripted process and have a VMware cluster up and ready for VMs.

This whole process takes less than 1 hour to deploy an 8 node cluster from bare metal to fully redundant but the best part is that I really don't have to do anything manual other than kick off the VSA deployment script once the 8 nodes are finished PXEbooting.

I think the potential for this kind of technology is awesome.  At the same time, I don't want to limit this category to just iscsi luns, NAS, or traditional storage kind of technologies.  While SCSI is a protocol to access block storage, I feel like I could make an argument that your SQL database is just as much software defined storage as an iscsi block device, NFS datastores or an HDFS filesystem.  It just has a different access protocol and different services that are exposed and different service levels.   

What I like about the idea of software defined persistence is that my same physical cluster of servers with local disk that has a VSA, also runs Hadoop, and also has a Cassandra cluster.  

I didn't have to buy separate hardware for one versus the other.  I still have a capacity planning effort for each storage layer but the conversation has shifted from separate infrastructures that must be scaled and managed independently at purcase time to more of how do we divide the resources we have.  If things change, we can always readjust the ratios of resources. 

So in essence, software defined storage gives me storage flexibility.  The cost models are more linear as compared to large storage arrays.  I understand this scares some of the traditional storage vendors.  If it scares you as a vendor, that's probably a good thing.  It means you are paying attention.  I would just say be open to changing your business model because if you don't you run the risk of being disrupted completely.   As a consumer of IT, I have my own business priorities and if you can help me with those that is great.  If you start holding me back from accomplishing my business priorities because you are protecting your revenue stream then you move from someone helping me succeed to  someone getting in the way.  This is what opens you up to disruption more than being attacked on a pure technology front.




Tuesday, February 25, 2014

This neglected blog

So it has been a while since I have posted anything.  I actually have several things as drafts that I have never gotten around to finishing because I want them to be useful and meaningful (and contain enough information that I can use them instead of having to recall everything in my brain).

So this post is as much about getting into the habit of posting regularly as much as anything.

A few hints at what has been going on in the lab over the last year.

Software Defined Storage -  by this I mean software implemented storage.  I almost want to call this software defined persistence layer instead but that is not so buzz word compliant.  I've got several experiments here to share.  This has been fun.

OpenStack -  I'm not sure I'm ready to talk about this yet.  There are things I like, things I don't like and a fair amount I either don't have an opinion on or haven't really tested.

Big Data -  I've been playing with VMware's BDE.  I Found a few things I would consider bugs but overall it is working well and was very easy to setup.

Flash Cache - OK, so I have been playing around with using software in a VMware environment as a flash cache.  I've tested several vendors in this space all with differing results.  Overall this is a very interesting space to watch develop.

Hybrid Cloud - so last year involved two trips to Asia for me all centered around how to transform to a hybrid cloud kind of model.  It is really amazing what all I have learned in this area in terms of the reality of a hybrid cloud versus the marketing material.

So those are the things I intend to write about.  I am also hoping to finish some of the draft post which are much more about performance tuning and troubleshooting.



Wednesday, May 1, 2013

Can your cloud bring the rain?




There is a large amount of press, hype, and debate about cloud computing. There are arguments about which cloud platform is better, vCloud director, OpenStack, CloudStack, etc. I've seen very passionate debates on buy versus build with relation to cloud and many people are really left trapped in a fog of rhetoric trying to sort it all out. It reminds me of the early days of Linux and the distribution wars on which distribution is better. Instead of joining in, I wanted to offer a slightly different perspective on the problem. I'm less interested in what kind of cloud it is, I'm more interested to know if your cloud can bring the rain!

In order to explain what I mean, let's first step back in time to my first SAN. We had a project I was working on that needed storage and several Solaris servers (this was probably in 2001 or so). I recommended a SAN even though our group was VERY skilled at adding additional local drives and using volume managers, and even doing host clustering with shared SCSI enclosures (and we had NO SAN knowledge much less expertise). After weeks of reading, vendor meetings, design sessions, learning about fiber cabling, zoning, masking, and configuring HBAs, and some long nights dealing with bugs, the moment of truth arrived. I had a new lun available on my host that I now needed to partition, add to the volume manager, and make a file system just like I had always done. I had a directory I could store files on.

It was quite possibly the most anti-climatic moment of my IT career. Months worth of work gave me exactly the same result as sticking a hard drive into the server. I took refuge in the fact that my lun was faster than a regular disk (but I could have done that with software raid). It was more flexible because I didn't have to use standard disk sizes (but my volume manager did that for me already). It was more reliable (because having multipathing software, HBAs, switches, controllers, and disk that had all new ways to manage and monitor meant there were tons of new things to break).

Do I regret having my SAN? Absolutely NOT! When I built my first 3 and 4 node clusters I smiled. I looked at the cost savings by not stranding storage all over the place and dealing with lots of SCSI disk cabinets with only a few drives or cabinets that were full when I needed to add drives (not to mention never dealing with high voltage and low voltage differential SCSI EVER again). Allocating storage from home in my slippers was a nice bonus. One of the real wins though was when a developer called and you could hear the desperation in his voice. The production instance of a database had died and there was no ETA on a fix. Could we repurpose a test system? Within 30 minutes the old test luns were unmasked, new luns created, data was being restored and clients were being redirected to the new database. After two days things were fixed on the production system and life was good. He was ready for the "you'll be down for a week or two in order to rebuild the test system" and I basically said we just need to reboot and present the old luns and you are back to where you were with no rebuilding or reloading. I can do it while you are at lunch.

The very long point here, is that it wasn't about the SAN, it is about the fact that the SAN made me think differently about how we offered services and that let us solve real world problems in ways we just couldn't before. I see a very parallel lesson with the cloud.

After spending months researching and building an internal cloud in the lab. Rethinking IP management and assignment, account management, sacred rules I had about DNS management, I had a nice little system that could automatically provision IaaS. Click a button and you had a Linux server with accounts, IP address, I could log into it with a DNS name. It was great! Then it hit me. The same feeling I had after provisioning my first LUN on the SAN. It's just another server....so what.

The problem was not the cloud I had built or the technology I built it with. The problem was I didn't have a service or platform worthy of building a cloud. I'm not saying there isn't some benefit (it did make my life easier and more automated and that is always good). Something was missing. Even as I moved on to automate a specific platform like a webserver so that it deployed and there was some content something was still missing.

“There is nothing quite so useless, as doing with great efficiency, something that should not be done at all.”
― Peter F. Drucker

In the end, the problem seems to be that building a cloud to deliver the IT services you are using today (assuming you do not have a cloud) will probably miss a great opportunity to rethink the services that you offer and what they can provide. If your cloud provides this great elasticity and you web or app server platform you are offering do not evolve to match you may actually be doing yourself a disservice. How do you manage content in a web farm that can constantly grow and shrink in terms of servers? Deploying content on the template isn't really a good solution because you will be in constant template management mode. A shared backed may be good, but how do you make sure that's not a single point of failure or a bottleneck? If you are integrating with a load balancer so that your shiny new web layer can scale, are you locking yourself into you old way of thinking and ensuring that your platform only works in one type of environment and couldn't help you at something like AWS should you chose to leverage their service? If you start at AWS are you going to leverage their pre-built services and APIs so that you are stuck in AWS? How do you wrap up all those things and offer a web platform that is simple to use, scales up and down, has current content, and isn't constrained by every process the IT organization has ever invented.

It's also interesting to me to see how some people are solving these problems. While we kind of standardized on 3 tier architectures to run websites, cloud applications and platforms are different and scale differently and there are a lot of options out there. I find it useful to watch AWS extend the services that it offers and maybe take some lessons for your private cloud.
Initially they offered virtual machines and some storage via S3. Now they are offering redundant databases across AWS zones to help you keep you business available even when they have a disaster. They are also offering things like their DynamoDB to give you a scale out NOSQL database. The platforms and services they offer have evolved greatly because in the end, these are the services that transform how we write applications and how we do business. I also see people not using these and opting for building their own versions of these so that they are not locked into AWS and can be redundant across cloud providers.

One other view seems to be that if your cloud doesn't offer those kinds of services (redundant databases, scalable storage, application platforms that are as scalable as you cloud will enable them to be) the applications will evolve and deploy that type of functionality strictly as part of application design. Some of these projects (Apache Cassandra) didn't come along to compete with Amazon's services, they came because Amazon didn't offer those services and that is what applications and businesses needed. Once that happens, your cloud might be stuck striving to be low cost and faster as your only main selling points versus the potential of providing a game changing way of computing for your organization. So this is what I mean when I ask if your cloud can bring the rain. Can you deliver capabilities with your cloud that truly transform your business or are you leveraging your cloud to deploy business as usual faster? Are you altering your platforms in parallel with building out your cloud to let them take advantage of your new capabilities?

It's a journey, and there are crawl, walk, run phase to all this. I'm not saying don't build a cloud until you can offer everything AWS has and maybe more. But the debates about cloud technologies and which cloud platforms are better seem to be IT people getting lost in IT instead of how IT can enable the business to achieve things we could not offer them before. Rain is the stuff that makes things grow. Make sure you are building a cloud that can bring the rain.












Wednesday, November 14, 2012

DL380 Gen8




Ever few years we get a new generation of Intel based servers according to their tick/tock model. So this server generation represents the tick for Intel. That means we get new features in the chipsets.

I've been spending a bit of time in the lab exploring the options on the DL380. My main goal is to further reduce the power draw of the server and I'm happy to say I have an 8 core / 128GB config that idles at 80watts. It tops out around 200 watts. That means that this server runs less at maximum than a G6 with the same core count and memory would idle. This is Moore's Law in action. I'm not building special purpose servers like Facebook or Google which are great for what they are doing. This is just careful selection of components from what is available and keeping an eye on the cost metrics. A 100GB flash drive is roughly the same cost as a 450GB SAS drive. But it's 5watts less per drive. Also 16GB low voltage dimms are basically the same $/GB as 8GB dimms. Using 8 16GB dimms instead of 16 8GB dimms gives you great power savings. Probably the most significant change is that 8 cores used to require 2 processors. So choosing a single 8 core processor drops the power as well. If you are dealing with software licenses that charge by core, then this also helps keep cost flat, while per socket licenses would actually go down.


Probably the biggest change in this server generation is that the PCIe controller is now built on the chip. While this is overall a nice thing, it also is something to understand if you use lots of PCIe cards in your servers. The DL380 gen8 can have 6 PCIe slots, but that requires that both processor sockets be populated.

The onboard RAID also has had a major upgrade to be able to handle several SSD drives on the system. I've run 50,000 IOPS with one setup and never had issues. There are also some exciting things coming soon with a new firmware that HP previewed at HP Discover this year that makes this a very compelling hardware platform for big data (but that's another post). One other curious thing is that with raid5 on SSDs, we pretty much don't see a performance hit when running in raid5 (3+1) at least at the low end of processing. I can write 19,000 IOPS with a single thread. That number doesn't really scale with multiple threads. With mirroring, the limit was just over 6,000 IOPS so it looks like raid5 gives linear scalability. I have to dig into that more because something about goes against most everything I have learned in 20 years of system administration.

A few things have been painful. We've seen a return of firmware bugs and compatibility issues with 10gb that were cable length and 10gb switch firmware related.  If you know what the picture below is from then you might have an idea of how painful it has been.  I'm referring this to the Eye of Sauron in your 10Gbe network (and it's good when it's open).  But all those issues are being worked out in the lab and it's looking great going forward.

My only gripes with the servers is that HP redesigned the rail kits and they are much harder to rack without help than they used to be and they moved the power button to a weird spot on the ear of the server.  Other than those items, it's another solid product in the DL380 line.