One morning I woke up early and checked my email. My plan was to check that my
inbox was empty for some peace of mind and then go back to bed for a few more
hours (I love Sundays). But that isn’t what happened. Instead, upon opening my
inbox I was alerted that one of a client’s database servers was offline. I
snapped out of my haze and immediately got to work.
This particular database server was a RDS instance. RDS, or Relational
Database Service, is an Amazon-provided MySQL (or Oracle) server that runs
on top of the EC2 platform. The advantages to this service are that backups are
performed automatically (complete with point-in-time recovery,) snapshots are
supported, the instances can be resized with more or less RAM/CPU/storage
through the AWS console, and a whole bunch of other stuff (“maintenance”) is
supposed to be performed for you automatically.
The disadvantages don’t make themselves apparent until you need to debug or
troubleshoot a performance or availability issue. While CloudWatch metrics are
included as part of the RDS package, knowing how much CPU, RAM, or storage
space you’re using is only a very small part of knowing what your database
instance is actually doing.
Prior to attempting recovery, the first thing I did was to check the
CloudWatch metrics. CloudWatch seems to have trouble reporting its data
when the system is under durress because there were periods where there was
data and there were periods where there wasn’t. The next thing I did was check
the RDS event logs. Don’t get excited, the RDS event log is not a UI wrapped
around system logs, it’s just a couple of entries here and there on what Amazon
RDS decides to publish. The last entry in the event log was a backup job that
started several hours before and never finished. These typically only take one
to two minutes to finish on this instance so I knew something was wrong.
I didn’t want to waste time trying to troubleshoot while the database was down
so I instead moved immediately to recovery and rebooted the instance through
the AWS console. It’s like Charles McPhail says, “Respond, Restore,
Resolve.” After about a whole 20 to 30 minutes the database server began
accepting connections again but the instance was never taken out of the
“REBOOTING” state when it should have transitioned to “STARTED”. With the
instance in the “REBOOTING” state, my only option now was to recover from a
previous backup as the rest of the functionality is disabled unless the
instance is in a “STARTED” state.
To make matters worse, the various components in our infrastructure were
connecting to this database server and were making it impossible to find out
what’s going on. The max connection limit was reached and I was no longer able
to login and view the process list or analyze the status variables.
At this point, I decided my only course of action was to spin up a new instance
from a previous backup. I made this request through the AWS console and, two to
three hours later, my new instance was finally up and running. About a half an
hour prior to this, the old instance was transitioned into a “FAILED” state and
shut down. When your instance is in the “FAILED” state, you cannot restart it.
Your only option is to restore from backup. In my case, it took several hours
for AWS to declare the instance as failed and it took several hours to restore
the backup. I did not know that the “FAILED” state was even a possible state
and had no idea that AWS could just kill an instance like that. To top it all
off, Amazon sent a very nice email to the owner of the account (my client the
CEO) explaining that we’ve been using an unsupported storage engine all this
As it turns out, I missed the note in the RDS User Guide that says that
MyISAM is not supported, particularly when it comes to data recovery. While I
understand why RDS made this decision (MyISAM gets corrupted easily and is not
easy to repair sometimes,) I felt misled and uninformed about the support of
the storage engines. Yes, the note is in the RDS User Guide, however, it is not
mentioned anywhere in the main page about RDS nor is it in the RDS FAQs
(where the string “MyISAM” only appears once).
A few weeks have gone by and we have taken steps to avoid and reduce the damage
from these types of outages in the future. However, we still occasionally
receive an alert where an RDS instance stops accepting connections for one to
two minutes at a time and all the event log has to say is that the instance has
been “recovered.” Recovered from what exactly? What did you do to it? Why does
this keep happening? How do we make it stop?
In summary, I’ll probably never know because on RDS you do not have access to
the underlying OS. This means:
- You do not have access to the OS process list
- You do not have access to things like top, htop, iostat, or dstat
- You do not have access to the process list if the MySQL process isn’t accepting connections
- You do not have access to any system logs
If you just need a quick and dirty MySQL server and you almost never want to
worry about the status of your backups, go ahead and use RDS. However, if
you’re concerned about reliability (that you control,) being able to
effectively troubleshoot problems, and knowing the state of your underlying OS,
RDS is not right for you.