Skip to content
Aug 13 11

Problems at Scale

by Charles Hooper

Over on HackerNews, saturn wrote that:

Cloud computing scales the efficiencies, yes. It also scales the problems.

This is exactly right. Problems in simple architectures are relatively easy to solve. In fact, I’d go as far as to say that we’ve probably solved them in all of the traditional archetypes, both in theory and in practice.

On the other hand, complex architectures lead to exponentially more difficult problems. There are probably lots of problems in these various complex architectures that we don’t even know exist yet. And then there are those problems that we do know about that we think will only occur in very rare (or even “impossible”) circumstances so they get considerably less attention devoted to them.

Those of us who have careers, jobs, and hobbies in an engineering discipline need to remember this when we make decisions about the design of a new or existing system. Just because we can’t see the underlying platform, because it’s been abstracted away from us, doesn’t mean that it doesn’t exist. For example, much of the recent AWS downtime was contributed to by design flaws in the Elastic Block Store system. If you think you should be hosted on the cloud, use it, but take the time to understand the systems under the hood.

Aug 10 11

Amazon’s Relational Database Service (RDS) – The Black Box From Hell

by Charles Hooper

One morning I woke up early and checked my email. My plan was to check that my inbox was empty for some peace of mind and then go back to bed for a few more hours (I love Sundays). But that isn’t what happened. Instead, upon opening my inbox I was alerted that one of a client’s database servers was offline. I snapped out of my haze and immediately got to work.

This particular database server was a RDS instance. RDS, or Relational Database Service, is an Amazon-provided MySQL (or Oracle) server that runs on top of the EC2 platform. The advantages to this service are that backups are performed automatically (complete with point-in-time recovery,) snapshots are supported, the instances can be resized with more or less RAM/CPU/storage through the AWS console, and a whole bunch of other stuff (“maintenance”) is supposed to be performed for you automatically.

The disadvantages don’t make themselves apparent until you need to debug or troubleshoot a performance or availability issue. While CloudWatch metrics are included as part of the RDS package, knowing how much CPU, RAM, or storage space you’re using is only a very small part of knowing what your database instance is actually doing.

Prior to attempting recovery, the first thing I did was to check the CloudWatch metrics. CloudWatch seems to have trouble reporting its data when the system is under durress because there were periods where there was data and there were periods where there wasn’t. The next thing I did was check the RDS event logs. Don’t get excited, the RDS event log is not a UI wrapped around system logs, it’s just a couple of entries here and there on what Amazon RDS decides to publish. The last entry in the event log was a backup job that started several hours before and never finished. These typically only take one to two minutes to finish on this instance so I knew something was wrong.

I didn’t want to waste time trying to troubleshoot while the database was down so I instead moved immediately to recovery and rebooted the instance through the AWS console. It’s like Charles McPhail says, “Respond, Restore, Resolve.” After about a whole 20 to 30 minutes the database server began accepting connections again but the instance was never taken out of the “REBOOTING” state when it should have transitioned to “STARTED”. With the instance in the “REBOOTING” state, my only option now was to recover from a previous backup as the rest of the functionality is disabled unless the instance is in a “STARTED” state.

To make matters worse, the various components in our infrastructure were connecting to this database server and were making it impossible to find out what’s going on. The max connection limit was reached and I was no longer able to login and view the process list or analyze the status variables.

At this point, I decided my only course of action was to spin up a new instance from a previous backup. I made this request through the AWS console and, two to three hours later, my new instance was finally up and running. About a half an hour prior to this, the old instance was transitioned into a “FAILED” state and shut down. When your instance is in the “FAILED” state, you cannot restart it. Your only option is to restore from backup. In my case, it took several hours for AWS to declare the instance as failed and it took several hours to restore the backup. I did not know that the “FAILED” state was even a possible state and had no idea that AWS could just kill an instance like that. To top it all off, Amazon sent a very nice email to the owner of the account (my client the CEO) explaining that we’ve been using an unsupported storage engine all this time.

As it turns out, I missed the note in the RDS User Guide that says that MyISAM is not supported, particularly when it comes to data recovery. While I understand why RDS made this decision (MyISAM gets corrupted easily and is not easy to repair sometimes,) I felt misled and uninformed about the support of the storage engines. Yes, the note is in the RDS User Guide, however, it is not mentioned anywhere in the main page about RDS nor is it in the RDS FAQs (where the string “MyISAM” only appears once).

A few weeks have gone by and we have taken steps to avoid and reduce the damage from these types of outages in the future. However, we still occasionally receive an alert where an RDS instance stops accepting connections for one to two minutes at a time and all the event log has to say is that the instance has been “recovered.” Recovered from what exactly? What did you do to it? Why does this keep happening? How do we make it stop?

In summary, I’ll probably never know because on RDS you do not have access to the underlying OS. This means:

  • You do not have access to the OS process list
  • You do not have access to things like top, htop, iostat, or dstat
  • You do not have access to the process list if the MySQL process isn’t accepting connections
  • You do not have access to any system logs
If you just need a quick and dirty MySQL server and you almost never want to worry about the status of your backups, go ahead and use RDS. However, if you’re concerned about reliability (that you control,) being able to effectively troubleshoot problems, and knowing the state of your underlying OS, RDS is not right for you.
Jun 30 11

A Couple of Python Snippets

by Charles Hooper

I haven’t updated in awhile but I decided to drop a couple of gists in here and call it a post. These snippets are incredibly simple and I don’t expect to “wow” anybody here, but I was asked for them recently and am posting them here.

Group words by their first letter in Python

Merging list of lists in Python using reduce

Jun 7 11

Common Single Point of Failure: People

by Charles Hooper

Yesterday, when I arrived at my other job on my school’s help desk, I found out that my supervisor was not coming into work at all. This is OK; I enjoy the autonomy of working unsupervised. However, at this particular university’s help desk, my supervisor is the only person who can reset security profile information on student accounts. She is also the only person who assigns work orders to the technicians that work here. I’ll spare you the details, but probably 80-90% of our workload on any given day gets passed through this one person.

This is a serious problem. By passing tasks through a single person with no backup we are guaranteeing the collapse of our support system. I’ve seen this at other gigs and I bet you have, too.

Maybe it’s the “one guy” who has access to the firewall or router. Or maybe there’s that only person who knows how to configure a particular piece of software or solve a specific problem. Truthfully, you’re probably that guy and don’t even realize it. Ever get work-related phone calls (or worse: called in) during your “time off?” Red flag.

All of these conditions are single points of failures (SPoF). Too often, we sysadmins, developers, and engineers only think of SPoFs in terms of hardware and software. But if we look at what actually makes up the entire information system (hardware, software, data, procedures, and people), we see that we’re part of it too. This hoarding of knowledge often results in a failure of the system itself and very frequently makes existing failures worse.

Example

A customer-facing database server stops responding. You’re not really familiar with what database(s) it serves but customers are complaining that it’s down or very slow. There’s another guy that normally handles this system but he’s out of town and completely unreachable. You want to diagnose but you don’t even know how to access the system. Do you blindly reboot (risking data loss and corruption)? Sit and wait it out? Learn how to summon your co-worker’s spirit?

One very real situation occurred when I worked at a small Internet Service Provider. A very big client of ours called and said that a very large portion of their network was down (we managed it, too). Did I have the credentials to the router in question? No. Did the client? No. Who did? That guy did, the one who is usually too busy running around to return calls (incidentally, the owner). He did finally return our cries for help… 3 hours later. Was the problem difficult to solve? No. In fact, it was fixed within minutes of receiving the proper credentials. (Funny story, one of their on-staff techs plugged a network camera into the network and accidentally assigned their router’s address as the camera’s IP :)) Sure, this mistake was dumb, but did this client need to suffer degraded availability for these 3 hours? Absolutely not.

Solution

The obvious, and perhaps only, solution to this problem is to make as much of your knowledge available as possible. The more knowledge you offload from your brain, the better and more efficient the system becomes. I know to some this might seem a little counter-productive. After all, having this knowledge is job security…right?

No, absolutely not. Holding company knowledge hostage should never be how you ensure your job security (that’s a myth anyways).

With that being said, please don’t spend all your energy and effort on documentation only to abandon the effort a month later. I was speaking to a friend of mine earlier when he mentioned that very often he comes across company Wikis all the time that usually contain outdated information and haven’t even been logged into in 6 months.

Allow me to re-iterate, do not go on documentation sprees. Document everything when you do it and share that information when you do it. Regularly. Constantly. If you wait until you have alot of information to document, then you will probably become overwhelmed and just not do it. When I was in the Air Force, we had a saying:

The job ain’t over till the paperwork is done.

Simply put, add documentation into your regular workflow. The investment is small and the returns are great.

May 29 11

Controlling Django Apps with an Init Script

by Charles Hooper

If you’re reading this, you probably already know that an init script is a specific style of script that allows you to control daemon processes. In particular, they are used to start processes at boot and terminate them at shutdown. What follows is an example script I use to control one of my Django+FastCGI projects. This particular example was written for Ubuntu and Debian but could probably be modified for RedHat/CentOS or other distros.

Please refer to your Distro’s documentation on how to install and activate init scripts (hint: See /etc/init.d/ and the man page for update-rc.d if on Debian or Ubuntu.)