Charles Hooper

Thoughts and projects from a hacker and engineer

Briefly: Operator Requirements

On any given day, there are a number of people discussing user requirements and prioritizing the work ahead of them based on them. There’s an oft-underrepresented group of users however and those are your operators. Typically, the set of things needed by your operators are buried in your project’s list of “non-functional requirements”, if at all.

In this brief, I would like to provide you with a de facto set of “operator requirements” for your project. This list is likely incomplete and I’m discovering more every day. I may update this post from time to time to add things or clarify them as I journey towards understanding.

An application that satisfies these requirements will be more scalable, easier to operate, and likely have a lower Mean Time To Recovery than an application that does not.

  1. In general you should strive to adhere to 12factor if you’re building a web application. 12factor creates a clean contract between your application and the operating system, enables simpler deployments, and results in applications that are mostly horizontally scalable by default. If you cannot adhere to 12factor, then I would challenge you to borrow as much of it as you can before discounting the whole 12factor methodology.

  2. Your application should have plenty of logging and follow best practices.

  3. Your application should also emit metrics that create some sense of understanding of what the system is doing.

  4. Your application’s services should have health checks. The health checks should return HTTP 2xx or 3xx when the service is healthy and HTTP 5xx when it is not. The response body should contain an explanation or identifier that will allow the operator to determine why the health check failed to aid in incident recovery.

  5. Your application should use unique request IDs and add them to their logging contexts (see logging).

  6. Your application should support credential rotation. Any given secret, whether it’s a password, API key, SSL private key, or otherwise, should be changeable with minimal disruption to the service. This should be exercised often to ensure it works as designed.

  7. Your application should provide operators with toggles or feature flags — parameters that allow the operators or the system itself to turn off bits of functionality when the system is degraded.

  8. Your application should put external resources behind circuit breakers. Circuit breakers allow your app to continue operating (albeit in a degraded state) when an external resource is unavailable instead of taking your application offline.

  9. Your application should be disposable and restartable; this means that it’s restartable on the same instance or a new instance) after a crash and should crash in an automatically recoverable state. If your crash is not automatically recoverable, it should scream! In addition, your application should gracefully complete existing work such as HTTP requests or jobs it picked up from a task queue. In the case of long running jobs, your application should be able to abandon the work to have it picked up by another worker or node.

These are just a start but these requirements should be imported into your project’s requirements and prioritized with maintainability in mind. By doing so, your application will be more scalable, easier to operator, and have a lower Mean Time To Recovery than application that don’t satisfy these requirements.

Do you feel like I missed anything? What else would you recommend?

Briefly: Health Checks

Health checks are specially defined endpoints or routes in your application that allow external monitors to determine the health of your web application. They are so important to production health that I consider them the “13th factor” in 12factor.

If an application is healthy it will return a HTTP 2xx or 3xx status code and when it is not it will return an HTTP 5xx status code.

This type of output allows load balancers to remove unhealthy instances from its rotation but can also be used to alert an operator or even automatically replace the instance.

In order to implement proper health checks, your application’s health checks should:

  1. Return a HTTP 2xx or 3xx status code when healthy

  2. Return a HTTP 5xx status code when not healthy

  3. Include the reason why the check failed in the response body

  4. Log the requests and their results along with Request IDs

  5. Not have any side effects

  6. Be lightweight and fast

If you implement health checks in your application following this advice, you’ll have a more resilient, monitorable, and manageable application.

How about you all? Is there anything you would add?

Briefly: Logs

Recently I was asked by another engineer what information I expect to be able to find in logs. For this, I mostly agree with Splunk’s best practices but I have some additional advice I want to provide. I’ll end up regurgitating some of Splunk’s recommendations anyway.

  1. Your logs should be human readable. This means logging in text (no binary logging) and in a format that can be read by angry humans. Splunk recommends key-value pairs (e.g. at=response code=200 bytes=1024) since it makes Splunking easy, but I don’t have a strong enough opinion to evangelize that. Some folks advocate for logging in JSON but I don’t actually find JSON to be very readable.

    Edit: Someone pointed out to me that this isn’t ideal when you have a large amount of logs. They prefered sending JSON logs to a service like ElasticSearch but I think also sending key-value pairs to Splunk is also reasonable at some scale.

  2. Every log line should include a timestamp. The timestamp should be human readable and in a standard format such as RFC 3339/ISO 8601. Finally, even though the above specs include a timezone offset, timestamps should be stated in UTC time whenever possible.

  3. Every log line should include a unique identifier for the work being performed. In web applications and APIs, for example, this would be a request ID. The combination of a unique ID and timestamp allows for developers and operators to trace the execution of a single work unit.

  4. More is more. While I don’t particularly enjoy reading logs, I have always been more happy when an application logs more information than I need versus when an application doesn’t log enough information. Be verbose and log everything.

  5. Make understanding the code path of a work unit easy. This means logging file names, class names, function or method names, and so on. When sensible, include the arguments to these things as well.

  6. Use one line per event. Multi-line events are bad because they are difficult to grep or Splunk. Keep everything on one log line but feel free to log additional events. An exception to this rule might be tracebacks (see what I did there?)

  7. Log to stdout if you’re following 12factor otherwise log to syslog. Do not write your own log files! By writing your own log files, you are either taking log rotation off the table or signing yourself up to support exciting requirements like re-opening logs on SIGHUP (let’s not go there).

  8. Last but not least: Don’t write your own logging library! Chances are there already exists a well thought-out and standard library available in your application’s language or framework. Please use it!

So those are my recommendations about logs. What else would you recommend?

I Have a New Job at Truss!

Two weeks ago I started a new job at Truss after leaving Heroku two months ago.

Working at Heroku was an amazing experience in many ways. I achieved the highest level of work-life balance so far in my life, I had great coaches, and I solved a lot of challenging and interesting problems.

But it’s time to move on so after a month and half of downtime I’ve joined Truss as an operations engineer.

I joined Truss for a number of reasons:

  1. I wanted to consult again; consultants are given more ownership of the problems they are tasked with solving and there’s always something new to do

  2. I believe there is a ton of opportunity for infrastructure consulting and engineering, both in government and in private industry

  3. I wanted to work with the folks on this team in particlar

Thanks to all the folks who made my time at Heroku awesome and the folks who have been most welcoming at Truss. I’m already enjoying working together!

Personal Archery FAQ

When people learn that I’m a traditional archer, they tend to ask me a number of questions. I thought this might be a fun blog post, so here we go!

How far do I shoot?

I don’t shoot very far! Because I’m currently focused on form (more on that below), I am only shooting from between ten to twenty meters.

How accurate am I?

I’m not at all because I’m currently focused on precision. This is mostly a matter of form and consistency, which is why I don’t have to “shoot” (“loose”? “let”?) from very far. I can spend an entire session at 10 meters and get very rapid feedback on this, so I do. (Note: I’m being pedantic about precision vs accuracy, explanation in the image below).

"Precision vs Accuracy"

A variation of this question is sometimes if I’ve ever “robin hooded” an arrow. Robin hooding is when you shoot one arrow back into another one.

Not yet - arrow damage

My answer today is “not yet!”

Have you seen the video of that Danish guy? Lars something? It’s really amazing!

Ah yes, Lars Anderson. He’s done a couple of videos but one in particular has gotten a lot of attention recently.

The videos are pretty amazing to watch and highly entertaining, but you won’t find me doing those things any time soon. I’m perfectly content right here.

"Archery point of view"

Where do I go?

There’s an archery range in Golden Gate Park which is free to use. Nearby is the San Francisco Archery Pro Shop which has good lessons and cheap all-day rentals.

How did I get started?

I had been interested in archery for several years but it wasn’t until a couple of months ago that I finally tried it.

I went to a local shop and took my first lesson. It was so much fun that I took another one and, when I tried to sign up for my third, my instructor suggested that I should make some time to practice all of the things I’ve just learned on my own.

So I rented the equipment a few times and eventually just bought my own.

What did you buy?

I bought a pretty basic recurve bow and everything else I needed:

  • Six arrows

  • A quiver

  • An arm guard and glove

  • A bow stringer

  • Target pins

  • And carrying bag

All in all, it cost me about $350.

What Am I Supposed to Do With All of This Honey?!

Two gallons of honey

For Christmas this year, one of the wonderful gifts I received was this two gallon bucket full of fresh, local honey. My friend, the gift giver, offered the suggestion of brewing some mead which is too sweet for my tastes. So what should I do with it then?

I thought I’d brew some beer with it but two gallons of pure honey (or whatever that translates to in weight) is alot of honey and, since I’m currently only doing one gallon batches of home brewed beer, it’ll take take quite a bit of homebrew to get through it all.

A quick search for honey recipes yields some interesting ideas as well, with the top contender being these sweet and sour glazed pork chops.

So what do you think? What should I do with all this honey?

Two gallons of honey

Give It a Rest

Dying zucchini (courgette)

Yesterday I walked out to my yard to find that all of my zucchini (courgette) plants are dying. It was cold enough the night before to kill them all at once and there really isn’t any saving them.

We got greedy at first and wanted to reclaim the wasted space, until we realized that our garden has been running strong for about 9 months now, since Erica and I planted our first seeds of the season in March of last year. Now we’ve decided to give it a rest so we can add some new compost to fortify the soil.

For me, the dying zucchini plants serve as a reminder that nature intends for us to rest and will find a way to see that we do (sometimes permanently). Plants drop seed, decompose, and ultimately end up feeding their offspring. Animals need to eat and sleep.

Caesar not sleeping

It’s a new year and you likely just had some time off that wasn’t very restful. If you’re like me, you were probably busy traveling, shopping, worrying about money, or the holidays may even open up some new wounds for you. Don’t accept this as your vacation for the year.

Your real vacation should be about rest and relaxation. You can travel or you can stay home, but the bottom line is that you should be relaxing. Sleep in some days if you’d like. When you’re awake, maybe read a book, go for a walk, or take up that hobby you’ve been thinking about. On my last stay-cation, I tried our archery for the first time and have been hooked ever since.

So start planning your time off now. Trust me, it’s worth it.

First archery target

On Slack and Upkeep

A term I hear often in the context of engineering and project management is “slack.” It’s often used to refer to a magical pool of time that all of a service’s upkeep, including maintenance and operations, is going to come out of. This is wrong though. Here’s why:

  • That’s not what slack is for

  • Mismanaged slack is equivalent to non-existent slack

What is it then?

I subscribe to the definition in Tom DeMarco’s Slack, which is “the degree of freedom required to effect change.”

Slack is something you maintain so that your team is responsive and adaptable, it is not “extra time” or “maintenance time.” If you are doing this, you are effectively allocating that time and thus eliminating your slack pool. Signs you or your team may be guilty of this:

  • You don’t make explicit allocations of time to operations or maintenance upkeep

  • You don’t “have enough time” to properly operate or maintain your services

  • You can’t solve problems or complete remediation items identified by your organization’s problem management program

So I should do nothing then?

Well, no. At least some of your slack needs to be spent idle though. Remember that the concept of slack is rooted in queueing theory. There’s a well-known relationship between utilization and response time. This relationship is exponential: The higher utilized your team is, the much higher your response time is! You can see it for yourself below:

Relationship between utilization and response time

We can tell by looking at this graph that our responsiveness falls apart at about 70% utilization which means you should keep at least 30% of your time unallocated.

Unallocated? Why can’t I just devote 30% of my time to upkeep?

Because upkeep, the maintenance and operations of your service, are required activities. Entropy means that, unkept, your service will degrade over time. This entropy is accelerated if your service is experiencing growth. Your databases will bloat, your latency will increase, your 99.99% success rate will fall to 99.9% (or worse), your service will be difficult to add features to, and eventually your users will go somewhere else.

Instead of thinking about it like this:

Wrong way to manage slack

Think about it like this:

Right way to manage slack

In this model, you explicitly allocate time to upkeep and maintain a slack pool.

How much time should I spend on upkeep versus product and feature work?

I don’t have a good guideline for you, sorry. You’ll need to determine this based on your organization’s or team’s goals and any SLAs you may have.

For example, if you’re operating a service with a service-level objective of meeting a 99.99% success rate (0.01% error rate) then you need to allocate more time to upkeep than a service targetting a 99.9% success rate, generally speaking.

Note that this will change and vary over time. If you’re already deep in technical debt, your upkeep allocation will need to be much higher to pay off some of your principal. Once you’ve done that, you’ll probably be able to meet your goals with a much lower allocation later on.

Call to action

I urge everyone to start thinking about slack and upkeep this way. Take a close look at your team’s goals and commitments and explicitly allocate time for reaching those goals. Doing so will allow your team to properly maintain the services which it operates while also being very responsive.

What I Do as an SRE

Sometimes people ask me what I do and I’m not really sure how to answer them. My answer tends to depend on social setting, what I’ve been working on, and if I was on call that week. No matter the circumstances, it usually comes out pretty boring and terribly short.

This really sucks though, because I actually really like my job and think that it’s interesting, if only I could articulate it.

So here’s some attempt at explaining what I do:

  • I’m an SRE, or Service Reliability Engineer, at Heroku. Typically, SRE stands for Site Reliability Engineer, however we’ve modernized it at Heroku because what is even a site anymore?

  • My week-to-week is wildly unpredictable. This week I’m conducting an operational review of one of our key platform components, last week I was investigating and addressing database bloat, and the week before I was the on-call incident commander and quite busy due to several incidents that occurred.

  • Speaking of the incident commander role, part of my job includes defining how we respond to incidents. At first glance it seems easy: Get paged and show up. And then you respond to your first 24-hour slow-burning incident and realize that you’ve got more work to do.

  • Following incidents, I also schedule and faciliate retrospectives. We practice blameless postmortems and these tend to be incredibly constructive.

  • I also analyze past incident data and look for patterns and trends. Wondering if there’s a day of week that has a higher probability of experiencing an incident? Yeah, it’s Friday.

  • When all is quiet, I review dashboards and investigate anomolies. Wondering what that weird spike or dip is that seems to happen every once in a while? Ask me, I’ve probably pulled that thread before (and if I haven’t, I’ll be terribly curious).

  • And sometimes I build integration tests and tools. I wrote elbping, for instance, because ELBs were terrible to troubleshoot during an incident.

  • And, most importantly, I mentor other SREs and software engineers. This is the single biggest thing I can do in terms of its impact, and probably most rewarding, too.

So there you have it, that’s what I do.

P.S. - If this sounds interesting to you, we’re hiring!

Troubleshooting ELBs With Elbping

Troubleshooting ELBs can be pretty painful at times because they are largely a black box. There aren’t many metrics available, and the ones that do exist are aggregated across all of the nodes of an ELB. This can be troublesome at times, for example when only a subset of an ELB’s nodes are degraded.

ELB Properties

ELBs have some interesting properties. For instance:

  • ELBs are made up of 1 or more nodes
  • These nodes are published as A records for the ELB name
  • These nodes can fail, or be shut down, and connections will not be closed gracefully
  • It often requires a good relationship with Amazon support ($$$) to get someone to dig into ELB problems

NOTE: Another interesting property but slightly less pertinent is that ELBs were not designed to handle sudden spikes of traffic. They typically require 15 minutes of heavy traffic before they will scale up or they can be pre-warmed on request via a support ticket

Troubleshooting ELBs (manually)

Update: Since writing this blog post, AWS has since migrated all ELBs to use Route 53 for DNS. In addition, all ELBs now have a all.$elb_name record that will return the full list of nodes for the ELB. For example, if your ELB name is, then you would get the full list of nodes by doing something like dig In addition, Route 53 is able to return up to 4KB of data still using UDP, so using the +tcp flag may not be necessary.

Knowing this, you can do a little bit of troubleshooting on your own. First, resolve the ELB name to a list of nodes (as A records):

$ dig +tcp ANY

The tcp flag is suggested as your ELB could have too many records to fit inside of a single UDP packet. You also need to perform an ANY query because Amazon’s nameservers will only return a subset of the nodes otherwise. Running this command will give you output that looks something like this (trimmed for brevity):

;; ANSWER SECTION: 60 IN SOA 1376719867 3600 900 7776000 60 600 IN NS 60 IN A 60 IN A

Now, for each of the A records use e.g. curl to test a connection to the ELB. Of course, you also want to isolate your test to just the ELB without connecting to your backends. One final property and little known fact about ELBs:

  • The maximum size of the request method (verb) that can be sent through an ELB is 127 characters. Any larger and the ELB will reply with an HTTP 405 - Method not allowed.

This means that we can take advantage of this behavior to test only that the ELB is responding:

$ curl -X $(python -c 'print "A" * 128') -i http://ip.of.individual.node
Content-Length: 0
Connection: Close

If you see HTTP/1.1 405 METHOD_NOT_ALLOWED then the ELB is responding successfully. You might also want to adjust curl’s timeouts to values that are acceptable to you.

Troubleshooting ELBs using elbping

Of course, doing this can get pretty tedious so I’ve built a tool to automate this called elbping. It’s available as a ruby gem, so if you have rubygems then you can install it by simply doing:

$ gem install elbping

Now you can run:

$ elbping -c 4
Response from code=405 time=210 ms
Response from code=405 time=189 ms
Response from code=405 time=191 ms
Response from code=405 time=188 ms
Response from code=405 time=190 ms
Response from code=405 time=192 ms
Response from code=405 time=187 ms
Response from code=405 time=189 ms
--- statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 187/163/210 ms
--- statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 188/189/192 ms
--- total statistics ---
8 requests, 8 responses, 0% loss
min/avg/max = 188/189/192 ms

Remember, if you see code=405 then that means that the ELB is responding.

Next Steps

Whichever method you choose, you will at least know if your ELB’s nodes are responding or not. Armed with this knowledge, you can either turn your focus to troubleshooting other parts of your stack or be able to make a pretty reasonable case to AWS that something is wrong.

Hope this helps!