Charles Hooper

Thoughts and projects from a site reliability engineer

Give It a Rest

Dying zucchini (courgette)

Yesterday I walked out to my yard to find that all of my zucchini (courgette) plants are dying. It was cold enough the night before to kill them all at once and there really isn’t any saving them.

We got greedy at first and wanted to reclaim the wasted space, until we realized that our garden has been running strong for about 9 months now, since Erica and I planted our first seeds of the season in March of last year. Now we’ve decided to give it a rest so we can add some new compost to fortify the soil.

For me, the dying zucchini plants serve as a reminder that nature intends for us to rest and will find a way to see that we do (sometimes permanently). Plants drop seed, decompose, and ultimately end up feeding their offspring. Animals need to eat and sleep.

Caesar not sleeping

It’s a new year and you likely just had some time off that wasn’t very restful. If you’re like me, you were probably busy traveling, shopping, worrying about money, or the holidays may even open up some new wounds for you. Don’t accept this as your vacation for the year.

Your real vacation should be about rest and relaxation. You can travel or you can stay home, but the bottom line is that you should be relaxing. Sleep in some days if you’d like. When you’re awake, maybe read a book, go for a walk, or take up that hobby you’ve been thinking about. On my last stay-cation, I tried our archery for the first time and have been hooked ever since.

So start planning your time off now. Trust me, it’s worth it.

First archery target

On Slack and Upkeep

A term I hear often in the context of engineering and project management is “slack.” It’s often used to refer to a magical pool of time that all of a service’s upkeep, including maintenance and operations, is going to come out of. This is wrong though. Here’s why:

  • That’s not what slack is for

  • Mismanaged slack is equivalent to non-existent slack

What is it then?

I subscribe to the definition in Tom DeMarco’s Slack, which is “the degree of freedom required to effect change.”

Slack is something you maintain so that your team is responsive and adaptable, it is not “extra time” or “maintenance time.” If you are doing this, you are effectively allocating that time and thus eliminating your slack pool. Signs you or your team may be guilty of this:

  • You don’t make explicit allocations of time to operations or maintenance upkeep

  • You don’t “have enough time” to properly operate or maintain your services

  • You can’t solve problems or complete remediation items identified by your organization’s problem management program

So I should do nothing then?

Well, no. At least some of your slack needs to be spent idle though. Remember that the concept of slack is rooted in queueing theory. There’s a well-known relationship between utilization and response time. This relationship is exponential: The higher utilized your team is, the much higher your response time is! You can see it for yourself below:

Relationship between utilization and response time

We can tell by looking at this graph that our responsiveness falls apart at about 70% utilization which means you should keep at least 30% of your time unallocated.

Unallocated? Why can’t I just devote 30% of my time to upkeep?

Because upkeep, the maintenance and operations of your service, are required activities. Entropy means that, unkept, your service will degrade over time. This entropy is accelerated if your service is experiencing growth. Your databases will bloat, your latency will increase, your 99.99% success rate will fall to 99.9% (or worse), your service will be difficult to add features to, and eventually your users will go somewhere else.

Instead of thinking about it like this:

Wrong way to manage slack

Think about it like this:

Right way to manage slack

In this model, you explicitly allocate time to upkeep and maintain a slack pool.

How much time should I spend on upkeep versus product and feature work?

I don’t have a good guideline for you, sorry. You’ll need to determine this based on your organization’s or team’s goals and any SLAs you may have.

For example, if you’re operating a service with a service-level objective of meeting a 99.99% success rate (0.01% error rate) then you need to allocate more time to upkeep than a service targetting a 99.9% success rate, generally speaking.

Note that this will change and vary over time. If you’re already deep in technical debt, your upkeep allocation will need to be much higher to pay off some of your principal. Once you’ve done that, you’ll probably be able to meet your goals with a much lower allocation later on.

Call to action

I urge everyone to start thinking about slack and upkeep this way. Take a close look at your team’s goals and commitments and explicitly allocate time for reaching those goals. Doing so will allow your team to properly maintain the services which it operates while also being very responsive.

Troubleshooting ELBs With Elbping

Troubleshooting ELBs can be pretty painful at times because they are largely a black box. There aren’t many metrics available, and the ones that do exist are aggregated across all of the nodes of an ELB. This can be troublesome at times, for example when only a subset of an ELB’s nodes are degraded.

ELB Properties

ELBs have some interesting properties. For instance:

  • ELBs are made up of 1 or more nodes
  • These nodes are published as A records for the ELB name
  • These nodes can fail, or be shut down, and connections will not be closed gracefully
  • It often requires a good relationship with Amazon support ($$$) to get someone to dig into ELB problems

NOTE: Another interesting property but slightly less pertinent is that ELBs were not designed to handle sudden spikes of traffic. They typically require 15 minutes of heavy traffic before they will scale up or they can be pre-warmed on request via a support ticket

Troubleshooting ELBs (manually)

Update: Since writing this blog post, AWS has since migrated all ELBs to use Route 53 for DNS. In addition, all ELBs now have a all.$elb_name record that will return the full list of nodes for the ELB. For example, if your ELB name is, then you would get the full list of nodes by doing something like dig In addition, Route 53 is able to return up to 4KB of data still using UDP, so using the +tcp flag may not be necessary.

Knowing this, you can do a little bit of troubleshooting on your own. First, resolve the ELB name to a list of nodes (as A records):

$ dig +tcp ANY

The tcp flag is suggested as your ELB could have too many records to fit inside of a single UDP packet. You also need to perform an ANY query because Amazon’s nameservers will only return a subset of the nodes otherwise. Running this command will give you output that looks something like this (trimmed for brevity):

;; ANSWER SECTION: 60 IN SOA 1376719867 3600 900 7776000 60 600 IN NS 60 IN A 60 IN A

Now, for each of the A records use e.g. curl to test a connection to the ELB. Of course, you also want to isolate your test to just the ELB without connecting to your backends. One final property and little known fact about ELBs:

  • The maximum size of the request method (verb) that can be sent through an ELB is 127 characters. Any larger and the ELB will reply with an HTTP 405 - Method not allowed.

This means that we can take advantage of this behavior to test only that the ELB is responding:

$ curl -X $(python -c 'print "A" * 128') -i http://ip.of.individual.node
Content-Length: 0
Connection: Close

If you see HTTP/1.1 405 METHOD_NOT_ALLOWED then the ELB is responding successfully. You might also want to adjust curl’s timeouts to values that are acceptable to you.

Troubleshooting ELBs using elbping

Of course, doing this can get pretty tedious so I’ve built a tool to automate this called elbping. It’s available as a ruby gem, so if you have rubygems then you can install it by simply doing:

$ gem install elbping

Now you can run:

$ elbping -c 4
Response from code=405 time=210 ms
Response from code=405 time=189 ms
Response from code=405 time=191 ms
Response from code=405 time=188 ms
Response from code=405 time=190 ms
Response from code=405 time=192 ms
Response from code=405 time=187 ms
Response from code=405 time=189 ms
--- statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 187/163/210 ms
--- statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 188/189/192 ms
--- total statistics ---
8 requests, 8 responses, 0% loss
min/avg/max = 188/189/192 ms

Remember, if you see code=405 then that means that the ELB is responding.

Next Steps

Whichever method you choose, you will at least know if your ELB’s nodes are responding or not. Armed with this knowledge, you can either turn your focus to troubleshooting other parts of your stack or be able to make a pretty reasonable case to AWS that something is wrong.

Hope this helps!

My DEF CON 21 Experience

I’ve just returned from DEF CON this year and wanted to share my experience. I’ve only been to DEF CON one other time which I believe was DEF CON 16. During DEF CON 16, I mostly stuck to the hallway track and, to be perfectly honest, didn’t get alot out of it as I mostly hung out with coworkers.

This time around I went with my good friend Japhy and no one else.


We flew in separately on Thursday and stayed in the Bellagio. We initially chose the Bellagio because it was cheaper and we didn’t think a 15 minute walk every day was going to be a big deal. As it turns out, the walk itself was fine (even with the 98F weather) but it meant we were effectively separated from the conference for most of the day. I think the next time I go I would like to stay in the same hotel as the conference.


Thursday was my day of travel. The flight was late leaving SFO but this isn’t unusual as planes to/from SFO are pretty much ever on time it seems. Blame the fog.

Anyways, I arrived mid-afternoon and just hung out around the Bellagio since Japhy wasn’t in yet. I ate some pho, drank some good bourbon, and played some video poker. Eventually, Japhy arrived and we grabbed a beer together before turning in.


Friday morning we woke up and went and get our badges. They were pretty sweet looking and I was curious about the crypto challenge. There was apparently a talk where the badges were explained but I missed that and so I mostly chatted with random people about them and compared notes and hypothesis. My badge, the Ace of Phones, translated to “in the real order the”. There was also an XOR gate on it but I never got far enough to know what it was for.

Badges aside, Friday is the day that I went to the most talks.

The first talk I went to was about Offensive Forensics. The speaker asserted that an attacker could use many of the same techniques that would be used by a forensics investigator during their attack. For example, an attacker could easily recover and steal files that were previously deleted. The talk was good but I felt that the speaker spent too much time trying to convince the audience that it was a good idea. My personal opinion, and that of the people I’ve talked to, all seemed to agree up front that it was a great idea.

After leaving this talk I ended up catching the tail end of Business Logic Flaws In Mobile Operators Services. I wish I saw more of this, but the speaker more or less explained that many mobile operator services have big flaws in their business logic (just like the title, eh?) such as relying on Caller ID for authentication. He also gave a live demo of an (unnamed) customer service line that, instead of disconnecting you on the third entry of an invalid PIN, actually grants you access.

Next I caught the end of Evil DoS Attacks and Strong Defenses where Matt Prince (CEO of CloudFlare) described some very large DDoS attacks and what they looked like. Someone afterwards also showed a variety of online banking websites where the “logout” button doesn’t actually do anything, leaving users vulnerable.

Immediately following that session, two guys got up and gave their talk on Kill ‘em All — DDoS Protection Total Annihilation!. I enjoyed the format of the talk, where the speakers would describe DDoS protection techniques and then how to bypass them. The bottom line is: a) look like a real client, b) perform whatever handshakes are necessary (alot of DDoS mitigators rely on odd protocol behaviors), c) use the OS TCP/IP stack when mossible (see (a) and (b)), do what it takes to bypass any front-end caches, and d) try to keep your attack threshold just below where anyone will notice you.

At night, there were a bunch of DEF CON parties. At some point the fire alarm went off a few times. A voice came over the intercom shortly after stating that they weren’t sure why their alarm system entered test mode but that “the cause was being investigated.” Later, it happened again and the hallway strobes for the fire alarm stayed on, adding kind of a cool effect to the party. Hmm.


On Saturday I only saw two talks.

  1. Wireless village - In the wireless village I listened to a Q&A session by a pen tester whose expertise was in wireless assessments. My favorite quote from this talk was:

    Q: When you do these wireless assessments, is your goal just to get onto the network or do you look at wireless devices, such as printers, as well?

    A: I pulled 700 bank accounts from a financial institution 6 weeks ago [during a pen test]. We like printers.

  2. Skytalks - One of the skytalks I saw the first half of was about “big data”, the techniques used in analyzing this data, their weaknesses, and how you could use these techniques to stay below the radar so to speak. It was interesting but rather abstract and I’m not totally certain how to apply that in practice.

For the rest of the day, I brought my laptop and just kind of tinkered with stuff.


I flew home early Sunday morning so I didn’t do anything on this day.

Why I Moved to San Francisco

It’s been three months since I first moved to San Francisco and decided I should share why I moved here in the first place. The primary reasons why I moved to San Francisco are for my career and to be around more like-minded people.

Career-wise, what made San Francisco appealing to me is the number and diversity of employment opportunities. In Connecticut, if you want to work with “technology” then you work for one of the many insurance companies headquartered there or an agency of some kind. In addition to available opportunities, there is also more parity between the job market in the bay area and my skill set and experience. For example, today reports 184 results for the “Python” in the entire state of Connecticut, while there are nearly 2,700 results for the San Francisco Bay Area. At one point in my life, I was told I was wasting my life messing around with GNU/Linux and other Open Source software. Things would have been a little better if I moved two hours away to either Boston or New York, but if I’m going to move then I might as well get better weather out of it, too.

I also moved to San Francisco to be around more like-minded people. Things that interest me (besides gardening and home brewing) are startups and tech. There were a few groups around my old location that were interesting, but they typically required an hour long drive to show up to their events. Often times, the groups failed early due to a lack of participation (including the hackerspace I founded but that story is for another day).

Thoughts so far

As I mentioned, I’ve now been here for three months. My thoughts so far are:

  • The place really is quite small. Especially in tech. Everyone seems to know everyone, which can be fun socially, but you need to watch what you say when you’re talking shop.

  • The Sillicon Valley/SF Bay tech isolation chamber is real (and so is the echo chamber). Companies sometimes seem huge when you’re in the bay, but if you talk to anyone from outside of the area, they’re like “Who?

  • San Francisco’s neighborhoods are really awesome. SF is divided into a bunch of small neighborhoods, each with their own unique attributes. There really is a place for everyone.


I moved to San Francisco because I thought it would be good for my career and because I thought I would meet more like-minded people. This has certainly proved to be the case. What I was not expecting but have experienced so far is how small and isolated the SF tech scene actually is.

My Personal Kaizen

Kaizen is Japenese for “good change” or “improvement” and is frequently used within the context of lean manufacturing. Today though, I’m going to talk about a few things I want to improve for myself, both personally and professionally.

Hard Skills

I’m fascinated by new technology and new ways of doing work, so polishing up on some of my so-called “hard skills” comes relatively natural to me. Things like learning a new programming language or a new tool require only time which I’m more than willing and able to invest. Without further ado, here are the hard skills I’d like to improve:

  • Become proficient in Go. I find Go very appealing as a language, specifically for systems-level uses.
  • Become proficient in Ruby. Alot of the software I’m responsible for maintaining at Heroku is written in Ruby.
  • Become proficient in a functional programming language, such as Erlang. I’ve been interested in learning a functional programming language for a while. I decided on Erlang after hearing a talk from Chris Meiklejohn (of Basho) about Riak and watching several talks about Erlang and Erlang/OTP.

Soft Skills

Addressing soft skills is something that’s a little more difficult for me, but something that I think is important. These are the soft skills I’d like to improve:

  • Become better at empathy. Sometimes when listening to someone, it’s easy to jump to conclusions about what they are trying to say or how they came to that point. What I have been working on, however, is understanding why they feel a certain way in particular.
  • Become a better listener. I have been working on this for a while, but one area I would still like to improve is learning to ask the the right questions. I’m always impressed when someone follows up something I said with an engaging question, and I would love to gain that ability.
  • Become more well-spoken. I’d like to be a better public speaker, both publicly and privately.

Catching Up


  • Moved to San Francisco
  • Took time off to decompress
  • Joined team at Heroku
  • Loving it


It’s been a while since I’ve posted, but I wanted to let you know that I’m still alive! Things were very hectic due to a few life changes but I much of the dust has settled now and I’m excited to talk about what those changes are.

But first, some history.

In 2011, I was a junior at university studying Business Information Systems. I had some prior systems engineering and operations experience and was paying my tuition by performing contract work for a company that doesn’t exist today. While at this company, I worked very closely with a developer and the two of us were responsible for running this company’s infrastructure. Neither of us had the time or the willpower to be bothered by operational tasks, so we automated everything. Even though our environment was much smaller (somewhere between 30-50 instances in EC2) than many others, we had alot of things that many companies are lacking:

  • New services were in configuration management
  • New deployments were almost entirely automated
  • We had good visibility with a variety of metrics being reported to Ganglia and Graphite
  • We even had a reasonable nagios configuration

Around this time, the term devops was being tossed around on twitter on a more frequent basis. I can remember actually having a Google Alert for the term to email me when there were new blog posts about it and it not being spammy. I have a love/hate relationship with the term, and in December 2011 I wrote my blog post Concurrent Engineering: The Foundation of DevOps in which I argued that the ideas behind devops weren’t new, but were maybe old ideas from business that were recently independtly re-discovered.

The blog post wasn’t very popular, but Solomon Hykes (CEO of dotCloud) managed to see it and, thinking we had very similar ideas about devops, invited me to interview for a position on their newly formed Site Reliability Engineering team. I got the job at dotCloud, and up until April of this year that’s where I stayed.

In mid-April, I resigned from my position at dotCloud and moved to San Francisco. There were a number of reasons for the resignation but chief among them was that I needed some time to decompress and all of my paid time-off had been used up following my youngest brother’s car accident. This ended up being an awesome decision because it gave me my much-needed decompression time and I was able to explore my new city.

I took about four weeks off before I put any real effort towards a job search. The move to San Francisco and the job search alone could fill two entirely-too-verbose blog posts but the end result was that I moved here safely and joined the team at Heroku!

Fast forward to today and I’ve just finished my third week at Heroku. It’s an amazing experience, a great team, and an awesome culture that is encapsulated in the following quote:

I like our culture. We welcome failure into our house and then kick its teeth in.

I’ll write more about this another time. For now, thank you for listening to my story.

— Charles

Intro to Operations: Metrics Collection

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

Collecting metrics is another area that many early stage startups seem to overlook even though it is probably one of the most important things they can do. By metrics collection, I am referring to the gathering and storing of various metrics at several different levels. As John Allspaw identifies them in Web Operations: Keeping the Data on Time, they are:

  • High-level business and application metrics (e.g. user sign-ups)
  • Feature-specific application-level metrics (e.g. widgets processed)
  • Systems and service-level metrics (e.g. server load or database queries per second)

You’ll note that there are two levels of “application-level” metrics. The higher-level application metrics are mostly those that can be tied to business objectives, while the other category of application metrics are generally more feature specific.

Benefits incurred by collecting these metrics are plentiful. For one, having quick access to these metrics is helpful during troubleshooting and incident response. For example, I was once hired under contract to look into why a certain company’s API was unreliable for the previous few months. At least once per day, this company’s API would time out and not respond to client requests. After enabling basic metrics collection for the servers and services used by the API, it very quickly became obvious that the database servers were reaching their connection limits which was preventing the API from retrieving records from the database. Not only was this problem identified very quickly, but later on we were able to look back at our metrics data to assess how close to our limits we were getting.

Another benefit is that you can integrate the metrics into your Availability monitoring system to be alerted when metrics surpass some threshold or change significantly. Not only that, but analyzing these metrics will allow you to manage your capacity intelligently and build a business case to justify infrastructure expenditures. Finally, analyzing these metrics will also give you insight into your application, how it’s used, and your business.

How you go about collecting and storing these metrics is up to you. Many engineers might be tempted to build their own solution; however, there are many open source and third party software packages that you may find helpful. Key considerations when choosing which package or packages to use are:

  • The ability to add new, custom metrics
  • Configurable resolution/storage trade-off
  • Integration with availability monitoring and alerting systems
  • Graphing/visualization

If your startup doesn’t have any metrics then you should start collecting them now. The visualization will help you in the short run and the historical data will help you in the long run.

Intro to Operations: Availability Monitoring and Alerting

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

Another area I’ve seen alot of early stage startups lacking in is availability monitoring and alerting. The essence of availability monitoring and alerting is being notified when your service is not working as expected, including when it’s simply down, isn’t meeting formal or informal SLAs (e.g., it’s too slow), or certain functionality is broken.

What I typically see is that some effort was made to set up this type of monitoring before and never maintained. Symptoms include poor monitoring coverage (servers missing from the config, services monitoring nearly non-existent), large amounts of false positives and negatives, inactionable alerts, and alerts that go unignored because of the previous issues.

Symptoms on the business include not knowing when your service is down and finding out that your service is broken from your customers. Finding out that your service is down from your customers is not only embarrassing but it also shakes their confidence in you, affects your reputation, and may even lead to lost revenue.

The good news is that it doesn’t have to be this way. When availability monitoring is set up properly, maintained, and you and your employees agree to approach alerts a specific way, you will be able to reap a variety of benefits. Here’s what I recommend:

  1. First, collaborate with your employees to define who is in the pager rotation and the escalation policies. Ask yourself: What happens when the on call engineer is overwhelmed and needs backup? What happens when the engineer goes on vacation?

  2. Next, take inventory of what services you rely on and define an internal SLA for them. This does not have to be a super formal process, but this inventory and SLA will be helpful for deciding what thresholds to set in your monitoring to avoid false positives. Try to see the big picture and think about everything such as:

    • Servers,
    • Self-managed supporting services like web servers, databases, email services,
    • Application functionality and features - one strategy I like is exposing a “health check” service that can be checked by the monitoring agent,
    • Third party services like remote APIs.

Your inventory and SLA definition is a living document; remember to keep it up to date!

  1. Then set up whatever monitoring package you or your engineers decided to use (self-hosted or third party) such as nagios, Zenoss, Pingdom, or CopperEgg and have your monitoring configured for those services. If you’re really good, you’ll check your configuration into its own source control repository. If you go the self-hosted route, it may also be worth having your monitoring server monitored externally. Who’s watching the watcher indeed.
  1. Think about integrating your monitoring with a pager service such as PagerDuty. Services like PagerDuty allow you to input your pager rotation and then define good rules for how to contact the on call engineer and when to escalate should the engineer be unavailable.
  1. With improved monitoring and alerting in place, you may want to think about giving certain customers “911” access. At a previous company I worked at, we had a secret email address our big customers could hit which would open a support ticket and then page the on call engineer with the ticket number. If you decide to go this route; however, you’ll want to train your customers when it’s appropriate to use this power and how to use it most effectively.

  2. Adjust alerts and fix problems as you get paged for them. Don’t care that a particular API goes down during a known maintenance window? Schedule the notification policy accordingly.

  3. Finally, continue maintaining your inventory and monitoring service’s configuration. For extra benefit, consider tracking your organization’s Mean Time To Respond (how long it took for engineer to acknowledge that something is wrong) and your Mean Time To Recover (how long it took the engineer to resolve the issue including the Mean Time To Respond), your Mean Time Between Failures (self-explanatory, I hope), and Percent Availability (what percent of time your service is functional in a given period of time).

This concludes the management and non-ops introduction to operations; I hope you find this helpful.

Intro to Operations: Configuration Management

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

One of the areas I’ve witnessed early stage startups lacking in is configuration management. Configuration management is the process of standardizing and enforcing configurations. In other words, configuration management is about deciding on a specific configuration of services for various roles and then applying these configurations in practice. Typically, these manifests are written in (domain-specific) language and is specific to the configuration management software being used, such as puppet, chef, cfengine, or salt stack.

There are many benefits to configuration management. For one, configuration management allows developers to spend more time working on the product and less time deploying new services. This is because configuration is now automated and faster as a result. In addition, environments are standardized and therefore less time is spent troubleshooting or diagnosing edge cases in different environments. Finally, when coupled with source control management, the proper use of configuration management can be used to track and audit what has changed over time and who changed it.

In many of these early stage startups, there is either very little configuration management performed at all, or configuration management exists as a series of shell scripts cobbled together to do some post-hardware setup. If you’re lucky, there exists a document somewhere that describes when and how to run these scripts to deploy new services.

The way configuration management works is that engineers create a collection of files that define how the system should be configured. This collection of files is typically called a manifest. Then, once physical or virtual hardware has been provisioned, one of these manifests is applied to the new host. During application, the configuration management software will interpret the new configuration, install software packages, manage users and credentials, alter config files, manage file permissions, run arbitrary commands, and so on. Once the manifest is fully applied, the new host should be fully configured and ready to be used! In some environments; however, they may be a post-host-provisioning step where additional work is performed afterwards, such as checking out application code from a source control repository.

If you’re not using configuration management already then you should start now because, frankly, it’s never too early. Starting configuration management now will not only help your first hired ops/systems engineer from working backwards to write these manifests later, but will also incur benefits (such as your developers spending less time away from shipping value-added code) that will outweigh the initial learning curve.