Main Content RSS FeedLatest Entry

Disasters & Disaster Recovery in the Cloud

This past weekend, Amazon EC2 experienced a power outage that brought down servers for about seven hours.  Amazon has experienced a number of outages over the last few years–not surprising given the size of their operations.  However, this makes it clear how important disaster recovery and high availability will be as more services are deployed into the cloud, and also suggests that achieving the highest level of reliability may require utilizing redundant services from multiple cloud providers.

This is something I’ve been thinking about quite a bit lately, and in fact just a few days ago I was happy to learn that our paper, Disaster Recovery as a Cloud Service: Economic Benefits & Deployment Challenges, has been accepted into this year’s Workshop on Hot Topics in Cloud Computing (HotCloud 2010).  In our paper, we survey why we think cloud computing platforms are going to become increasingly popular for providing cheap disaster recovery services.

Clouds can be used to provide a variety of backup mechanisms ranging from cold replicas that are periodically synchronized up to hot standbys that are always in sync and can take over as soon as a failure is detected.  In practice, we think that a middle class of warm replicas is where the cloud can provide the greatest benefit.  A warm replica could be implemented as an EC2 VM that is not aways running, but whose disk (an EBS volume) is kept regularly up to date by a replication manager VM.  This replication manager can handle synchronizing the disk state for a large number of applications, but the customer will not have to pay for the active VM costs of those applications until a failure actually occurs and the VMs are booted up.

Check the paper for all the details, including a cost analysis of providing DR for various application types.

Recent Entries

Stalking Your Fellow Researchers

At times it seems like the Internet was designed for stalkers–people are increasingly publicizing their personal details on websites like Facebook or Foursquare. While this means that I can get the latest updates from my friends about what they had for lunch, it does not generally help me any with getting the latest news from researchers I’m interested in. RSS (Really Simple Syndication) can help solve this problem by giving you a feed of updates for a given webpage. Unfortunately, most personal websites (that aren’t blogs) don’t support RSS. Luckily, there are some handy tools out there which we can use to turn any website into an RSS feed so that you’ll be notified when it is updated.

The first thing you need is an RSS feed reader–a desktop or web application that will aggregate the  updates from the websites you are watching.  I use Google Reader, but there are quite a few out there.

Next, browse to the website that you are interested in tracking, e.g. the publications page of a researcher in your field (my favorite example).  Copy down the URL of the page and then visit Page2Rss.com.  Paste in the URL of interest, and they will magically produce an RSS feed for you that is updated every time the page is changed.  It will automatically send you only the changed version of the page, a very handy feature if you are using this on a long publication page and only care about the most recent changes. Another option if you don’t like RSS, is to use ChangeDetection.com which will email whenever a page is modified (I haven’t tried this yet).

Of course many websites do provide their own RSS feeds, particularly blog or news sites.  A number of CS publications also provide useful RSS feeds, such as the IEEE and ACM.

The Singularity

A nice comic on the imminent “singularity”–when technology will revolutionize our lives, bodies, and societies. I’m still a bit skeptical…


Job Application Resources

I’m not going to be applying for jobs until next year, but recently I’ve been helping a few friends with their own applications which has gotten me interested in the subject. To learn a bit more, I went to the Chronicle of Higher Education’s website earlier today.  I’ve heard my dad (a recently retired sociology professor) refer to the Chronicle many times, but this was the first time I’ve actually read any of it.  I found quite a bit of interesting information, which I will link to here.  This is largely for my own benefit a year from now when I need the info, but hopefully some others will find it useful as well.

  • First Time on the Market – a collection of articles on interviews, teaching statements, and generally what to expect
  • How to Write a Teaching Statement – luckily I got some practice with this when taking a course on teaching in scientific disciplines last year, but otherwise it can be a tricky piece to write when you are coming from a graduate program that focuses almost entirely on research
  • Facing the Truth – an interesting piece on the chances of new grads applying to four year teaching colleges without any teaching experience.

Clustered Bar Graphs in Mac OS X

I use gnuplot for most of my graphing needs, but using it for complicated bar charts has always been a pain. Fortunately, there is a very handy clustered/stacked bar chart generator which wraps gnuplot in a nice perl script to add some extra features. I’d used it previously under Linux without any problems, but to work on a Mac you need to first setup gnuplot (which can be a pain), plus you need the fig2dev utility to actually produce the final output files. Luckily, I found a copy of it compiled for OS X on the jfig webpage, and although it has a warning from 2006 that it may not work on Intel Macs, it works fine on mine. This will let you make eps/pdf versions of your graphs which will work nicely in latex documents.

Setting up Gnuplot on a Mac

I wrote these directions down over a year ago, so they could be a bit out of date. I’d like a permanent record though since some of the steps are a bit tricky…

Gnuplot is used for making graphs. If you try to compile it normally you will get some errors. Here is how to make it work:

  1. Download and install aquaterm – this is a program which will handle the actual plotting graphics for gnuplot.
  2. Download the source code for gnpulot – I am using 4.2.3.
  3. Extract the source code somewhere (double click the file in finder or use “tar xzf FILENAME” from a terminal.
  4. Open a terminal and change to the extracted source directory.
  5. Configure the source code distribution by running: ./configure –with-readline=builtin You must use the –with-readline flag because Mac OS X comes with a bad version of this library. More details here.
  6. Build the source code by running make
  7. Install the resulting package by running make install
  8. You are done!

You can test it out by running gnuplot at the terminal and then typing plot sin(x)

Grad Students Officially Obsolete

Robot Scientist 'Adam' at Aberystwyth University

Adam: The first robotic scientist

So much for job security as an academic researcher… soon we’ll all be replaced with giant robots and monkeys on typewriters…

Scientific Publishing

Pretty good comic…

(not written by me)


Improving Data Center Resource Management, Deployment, and Availability with Virtualization

That’s the title of my thesis proposal, which attempts to cram all the work I’ve done over the past four years in just a few words. In the end, I’m pretty happy with the result–I’ve been able to tie together the various projects I’ve worked on to show how virtualization provides powerful new techniques for deploying applications, more efficiently managing resources, and providing high reliability in large data centers.

If you are interested, you can read the full version, or look through my slides.  It should make for absolutely thrilling bed time reading.

Below is the executive summary of what I’ve worked on.

Note for non-CS friends/parents/relatives: First, you need to know that virtualization refers to technology that lets you “slice up” a single computer into multiple virtual computers, also known as virtual machines or VMs. The reason you would do this, is that a single modern server is very powerful, and often it makes sense to run multiple applications on the server at the same time.  By splitting the server into virtual machines, each of these applications can be tricked into thinking that they have complete control over the machine, even though really they are sharing its resources. The virtualization layer provides makes sure that if one VM crashes it won’t impact any of the others, plus you can do nifty things like dynamically change the memory or CPU resources each VM gets, or even “migrate” a VM from one server to another without impacting its running applications.  In my work, I look at the benefits that using virtualization provides, as well as the new challenges that it causes when managing large numbers of servers.

Deployment

I start by looking at the deployment challenges of transitioning to a virtual environment and figuring out where to place VMs. This is an interesting area because virtualization can provide great benefits such as improved server consolidation, but also adds new challenges in the form of virtualization overheads.

MOVE (Modeling Overheads of Virtual Environments)

When you first consider transitioning from running applications natively to using virtual machines, it is important to understand how application resource requirements will change due to the overheads incurred by the virtualization layer. The MOVE project is designed to help predict these resource changes by building a regression model that relates the native and virtual platforms. This was work that I started during an internship at HP Labs in the summer of 2007, working with Lucy Cherkasova.

Memory Buddies – Guiding VM placement with memory information

Once you know your resource requirements, you need to figure out where to put each of your virtual machines.  A typical data center might have thousands of servers and tens of thousands of applications–what server should each application be run on in order to make the most efficient system?  The Memory Buddies project tries to place virtual machines in order to maximize the amount of memory sharing that can be achieved — if VMs are running similar operating systems or applications, then the virtualization layer can share copies of these duplicated pages. As a result of this sharing, there is more free memory available for other VMs, allowing many more applications to be consolidated onto a smaller number of servers. In order to make this practical in a data center with many thousands of VMs, we propose an efficient fingerprinting technique that uses Bloom filters to quickly compare virtual machine memory contents.

Resource Management

Making data centers more efficient is a key concern throughout all of my work.  Virtualization’s greatest benefit comes in the promise of improved server utilization, leading to lower hardware costs and decreased energy consumption.

Sandpiper – automated VM loadbalancing

Alright, now we’ve figured out initial resource allocations and placements for all of our virtual machines, but those initial decisions may not be sufficient (or efficient) if an application’s workload changes over time. Sandpiper is a system which monitors the resource utilization and performance of a set of VMs and dynamically adjusts their resources or migrates them between hosts in order to prevent servers from becoming overloaded. This was the first project I worked on when I came to grad school, and now there are several commercial products out there doing similar things. We recently revised and extended this paper for a journal.

Reliability

High performance systems are only useful if they are reliable. The remaining work for my thesis uses virtualization to decrease the cost of high availability and fault tolerance systems.

ZZ: Cheap Practical Byzantine Fault Tolerance

Byzantine Fault Tolerance (BFT) is a way of providing very strong reliability guarantees, even in the face of malicious users or application components.  They provide this protection by replicating the software and having multiple nodes process each request.  Unfortunately, BFT has a very high cost because each application request must be executed 2f+1 times in order to handle f simultaneous faults (e.g. to handle one malicious server, a total of three servers are required). In ZZ, we try to reduce this cost down to only f+1, by using an additional f sleeping VM replicas which are only woken up after a fault is detected. Our work focuses on how these sleeping VMs can be quickly brought online after a fault occurs, allowing the system to provide the same level of protection at a lower overall cost.

CloudNet: Wide Area Resource Management and Availability

My most recent work was started while at AT&T in Fall 2008, and looks at how VPNs can be combined with cloud computing platforms to make data center resources appear seamlessly connected to an enterprise’s existing infrastructure. We are further exploring this area to see how we can provide disaster recovery services so that if a data center becomes unavailable, the critical applications running within it can transparently fail over to servers at a different data center.

Usenix ATC 09 Awards & Keynotes

Best Paper Awards

The first best paper award went to Grzegorz Miłoś for Satori: Enlightened Page Sharing. I’m a big fan of memory sharing between virtual machines, so I’m glad to see some recognition for this type of work. I talked with Irfan Ahmad from VMware after the talk and I have to agree with his view that the real benefit of this type of system is not in attempting to free up memory for other VMs, but in reducing I/O latency since fewer blocks need to be read from disk.

Next up was Tolerating File-System Mistakes with EnvyFS, which I haven’t read yet, but now I’ll have to take a look.

Keynote

The keynote was by James Hamilton from Amazon Web Services. He made some pretty interesting points about how enterprise costs are so different from services costs. I’m sure that in time, enterprises will do the best they can to get closer to the services model by eliminating their “people” costs (sorry folks!) and trying to make larger scale homogenous systems. The talk did a great job at providing the high level view for where the real problems are in big data centers. I also recommend his blog for anyone not already checking it out, it is full of a wealth of interesting data (plus some good ideas). The slides from his talk are available on this page.

  • Enterprise’s main cost is people
    • Often about 100 servers per admin
    • Have many different apps, each with relatively small scale -> difficult to automate
  • Services world’s main cost is hardware
    • >1,000:1 server:admin
    • Don’t look at raw performance, look at work done per dollar, or work per joule.
  • Data Center monthly costs (3 year amortization for servers, 15 year for infrastructure)
    • 15MW data center ~$200M
    • Servers 50%
    • Power & Cooling infrastructure 25%
    • Power 22%
    • Other infrastructure 3%
    • Even combined, power and power infrastructure is less than 50%
    • but server costs are decreasing, while energy is not…
    • So current headlines are wrong, but still correct
    • Take away: server cost is still very high, so it makes sense to USE servers if at all possible. Turning them off only saves on the energy costs, which is relatively small (22%)
  • PUE = (total facility power / IT Equipment power
    • 1.7 PUE is a “normal” new Data center (wastes 0.7 watts per 1 watt used by servers)
    • But PUE can be deceiving since it counts things like server fans as useful energy
    • tPUE is new metric that just counts useful server energy – see blog post for more info
    • Key points:
      • Server efficiency and utilization is a good thing to improve
      • Cooling waste is unreasonably high
      • Power distribution waste isn’t too bad
      • When provisioning energy, don’t assume DC will have all servers running at peak load. “Oversell” power, and then shed load to other data centers if somehow all servers simultaneously ramp up to full load.
  • Temperature
    • Most people run at 81 degrees, but systems can handle much higher (Dell 95, Rackable 104)
    • Raising temp within the data center can save lots of money (especially if it is cool outside)
  • Resource Consumption Shaping
    • Apply resource optimization across entire data center
    • Move work from peaks into valleys since costs are based on peaks

My only disappointment from the keynote? His hair wasn’t as big as I’d imagined ;)