Charles Hooper

Thoughts and projects from a site reliability engineer

It’s All Fun and Games Until Someone Loses a Life

Two weeks ago, I uninstalled every game I own from my computers and I’m SUPER glad I did.

For a long time, I thought computer games were really fun. I’ve been playing then ever since I was a single digit in age when my Dad bought a CD-ROM drive and SoundBlaster 16 sound card. The CD-ROM drive came with a number of CDs which included evaluation versions of Myst and the original Doom.

These games were like “wow!” They became even more interesting when we could dial into a friend’s game and play deathmatch or cooperatively. This was a lot of fun but it took some coordination since you needed to have a dedicated group of local people who could play at the same time as you. Also I was probably eight at the time so these were all of my Dad’s friends.

As I grew older, the Internet became more ubiquitous even if it was only dial-up at the time. We still had to play with folks who were local, due to latency considerations, but now we could play on dedicated servers (well, sort of) with complete strangers. Thanks to software like GameSpy, we could discover the servers closest to us. Despite the ease of discoverability, the latency restriction and relatively small number of dedicated servers meant we had stronger communities. We all knew each other and, in fact, some communities had entire websites, forums, and other BBS-like functionality built around their respective games.

This pattern would continue for probably another ten years until broadband became commonplace and games began imposing the “matchmaking” pattern. I think that maybe the people who produced and sold games wanted to make it easier for casual or new gamers to get started in multiplayer, which is totally valid, but I feel like this pattern completely killed the community aspect. At least for me.

Still though, I continued to play games regularly. Any time I got a free moment and felt like I had nothing else I needed to do, I’d jump into a game of some kind. If I had a short period of time available to me, I’d hop into a fast-paced FPS (or, eventually, Rocket League). If I had an entire day available, well hello Civilization and Eve Online!

Then, one day maybe two years ago now, Erica asked me to explain what I enjoyed about gaming. I had told her that I enjoyed strategy games like Civ and Eve Online because they were mentally stimulating and FPSes because they’re exciting and distracting. I could describe the precise sensations I got from each game I played but there was something off. The problem was that I didn’t think any of these things were fun! Not anymore, anyway.

Still, that didn’t stop me from playing.

Not until we bought a house and had a lot of things put into perspective. It turns out that, when you buy a house that needs work, literally everything is more important or higher priority than playing video games. I thought I could balance gaming with important things and even other hobbies but I had already failed at that for literally twenty years!

On top of that, for the seventh year in a row I caught myself wishing I wrote more, read more, did more archery, learned some engineering topics really well, started a business, and literally everything else besides playing games.

A friend of mine, Russell, has a side business he started despite having two kids and a demanding full time job. It was the day that I caught myself telling him how jealous I was that I realized my priorities were skewed.

I decided right then that, as an experiment, I would uninstall every single PC game I owned. I reasoned that, since I use Steam, I could recover all of the games if I ever needed to but, since my Internet connectivity is so bad right now, it would take me a solid day just to get a single game reinstalled. That sounded like the perfect barrier to entry! My hypothesis was that, with the barrier to gaming being so high now, that I would spend my free time doing literally anything else more productive.

And, guess what? I was right.

It’s only been about two weeks but I’ve been writing about 500 words per day. This isn’t a lot, mind you, but it’s way better than the one thousand words I wrote in all of 2014! I’ve also been reading. I’m still reading Stranger in a Strange Land which I think is a beautiful book but I’ve been “reading it” for probably an entire year now! And, finally, I’ve been working around the house and the yard. I recently built Erica and I a compost bin and we’re getting our new garden ready for our first Spring in our new home.

I know that not everyone has the same priorization problems as I do and that some people genuinely find gaming fun but I don’t have an ounce of regret about ditching gaming for good.

Network Imagineering

If I made a thousand dollars each time I dreamed of starting an ISP, I would have a couple thousand dollars and be a hundred thousand dollars in debt.

I go through this exercise, planning and designing an ISP, every few years and each time I find that it’s untenable for one reason or another. It turns out that starting an ISP is a CapEx-heavy venture, usually with shitty margins and are slow-starting to boot. For the jargon-shy, CapEx is short for “capital expenditure” and denotes spending a lot of money up front as opposed to paying some money e.g. each month as part of operations.

On top of the heavy up-front investment, your local “wholesale” provider of Internet access tends to hold a monopoly and are also in the retail business so, if you do decide to start your own regional ISP, you’re literally buying service from your competition.

After the military, I started my systems engineering career at a small local ISP as a network engineer. I did a lot of “sysadmin”-type work but I also spent half my time logged into Cisco equipment and even became a Cisco-certified network associate (CCNA).

It was so much fun! I actually really liked it and, for a few years at least, I thought I would pursue a career as a network engineer. I considered getting my CCNP (the next level past CCNA) and onward but eventually lost focus when my career took a turn down the Linux systems engineering path.

In any event, the company I worked for was headquartered across from AT&T which is who we bought all of our connectivity solutions from. You know, our provider as well as our competition. At the time, our alleged value-add was that our connectivity was “managed” which to this day I’m still not really sure what that means.

But today I found myself having this dream again. You see, yesterday morning I woke up at 3AM and my Internet connectivity was totally shitty. I was experiencing over 40% packet loss and I was furious! I managed to find my ISPs number but learned that their tech support, who consists of a single person, didn’t open until 8AM. I patiently waited until 8AM to call, setting up a Raspberry Pi with Smokeping as a monitoring solution and, right at 7:55AM, my Internet connectivity recovered. It was as if someone rolled into the office and rebooted a router.

Having been in the shoes as the early-morning office arriver, this was a totally plausible scenario. But I was mad! I recently made the decision to move to nearly the middle of nowhere, so shitty Internet was always in the cards, but as an engineer who works remotely crappy Internet is totally unacceptable! So I found myself, again, wondering which obstacles stood in my way of starting my own ISP.

I found this amazing blog series called Tales from the tower and, while the author is a little ranty at times, it’s generally informative and engaging. My general impression from reading so far is that it might be somewhat straightforward (which is not the same as “easy”) to start up a small WISP but radio (RF) engineering has a lot more influence on technical success than network engineering. Additionally, equipment has come way down in price making the “heavy CapEx problem” much more manageable.

While I have no idea what’s next, I hope to continue exploring this idea, even if I never take it further. I hope that it will be interesting for you to read (and for me to document!) how I approach the problem of providing respectable Internet connectivity in the middle of nowhere.

Hsleep: `sleep` With a Countdown

Today I’m open sourcing hsleep. hsleep is a utility which behaves just like GNU sleep(1) in coreutils – and its BSD counterpart – with the addition of a countdown timer which is emitted to standard error.

hsleep counting down

I wrote hsleep because I sometimes find myself needing to delay commands for a few minutes and I couldn’t stand not knowing how much time is left!

hsleep is available on github or – if you have go installed – can be installed with:

go install

I Have a New Home and New Job!

Almost three years ago, I wrote a post about moving to San Francisco and was happier than a pig in shit. Well, today I’m even more excited to announce that I’ve moved to Oregon!

When I was between the ages and 12 and 14, my parents moved us to a very small farm and I absolutely loved it. I had lots of space to myself, fresh air, and animals to care for. Unfortunately, it wasn’t long before my parents lost the farm and we ended up moving back to the suburbs.

I went the rest of my life (so far :)) reminiscing about that farm and, late last year, Erica and I started at homes. We started in the Petaluma area but we found it a bit too expensive for what we were looking for. We gradually continued our search further and further North until we found the Grants Pass/Medford/Ashland, Oregon area.

We put an offer on our house in December and we finally moved in at the beginning of February. It’s been a month since then and we still wake up every day and look out across the valley to exclaim “wow, I still can’t believe we live here!”

View from home

Around the time we put an offer on this house, I also changed jobs. I ultimately wound up at Stripe and, if you’re interested in solving challenging problems, you should apply to come work for us :)

I’m currently assigned to the Systems team as a Site Reliability Engineer working on how Stripe’s engineering teams reliably run and consume services at scale.

Anyway, these last three months have been amazing! I’m looking forward to seeing what the next three bring.

Briefly: Operator Requirements

On any given day, there are a number of people discussing user requirements and prioritizing the work ahead of them based on them. There’s an oft-underrepresented group of users however and those are your operators. Typically, the set of things needed by your operators are buried in your project’s list of “non-functional requirements”, if at all.

In this brief, I would like to provide you with a de facto set of “operator requirements” for your project. This list is likely incomplete and I’m discovering more every day. I may update this post from time to time to add things or clarify them as I journey towards understanding.

An application that satisfies these requirements will be more scalable, easier to operate, and likely have a lower Mean Time To Recovery than an application that does not.

  1. In general you should strive to adhere to 12factor if you’re building a web application. 12factor creates a clean contract between your application and the operating system, enables simpler deployments, and results in applications that are mostly horizontally scalable by default. If you cannot adhere to 12factor, then I would challenge you to borrow as much of it as you can before discounting the whole 12factor methodology.

  2. Your application should have plenty of logging and follow best practices.

  3. Your application should also emit metrics that create some sense of understanding of what the system is doing.

  4. Your application’s services should have health checks. The health checks should return HTTP 2xx or 3xx when the service is healthy and HTTP 5xx when it is not. The response body should contain an explanation or identifier that will allow the operator to determine why the health check failed to aid in incident recovery.

  5. Your application should use unique request IDs and add them to their logging contexts (see logging).

  6. Your application should support credential rotation. Any given secret, whether it’s a password, API key, SSL private key, or otherwise, should be changeable with minimal disruption to the service. This should be exercised often to ensure it works as designed.

  7. Your application should provide operators with toggles or feature flags — parameters that allow the operators or the system itself to turn off bits of functionality when the system is degraded.

  8. Your application should put external resources behind circuit breakers. Circuit breakers allow your app to continue operating (albeit in a degraded state) when an external resource is unavailable instead of taking your application offline.

  9. Your application should be disposable and restartable; this means that it’s restartable on the same instance or a new instance after a crash and should crash in an automatically recoverable state. If your crash is not automatically recoverable, it should scream! In addition, your application should gracefully complete existing work such as HTTP requests or jobs it picked up from a task queue. In the case of long running jobs, your application should be able to abandon the work to have it picked up by another worker or node.

These are just a start but these requirements should be imported into your project’s requirements and prioritized with maintainability in mind. By doing so, your application will be more scalable, easier to operate, and have a lower Mean Time To Recovery than an application that don’t satisfy these requirements.

Do you feel like I missed anything? What else would you recommend?

Briefly: Health Checks

Health checks are specially defined endpoints or routes in your application that allow external monitors to determine the health of your web application. They are so important to production health that I consider them the “13th factor” in 12factor.

If an application is healthy it will return a HTTP 2xx or 3xx status code and when it is not it will return an HTTP 5xx status code.

This type of output allows load balancers to remove unhealthy instances from its rotation but can also be used to alert an operator or even automatically replace the instance.

In order to implement proper health checks, your application’s health checks should:

  1. Return a HTTP 2xx or 3xx status code when healthy

  2. Return a HTTP 5xx status code when not healthy

  3. Include the reason why the check failed in the response body

  4. Log the requests and their results along with Request IDs

  5. Not have any side effects

  6. Be lightweight and fast

If you implement health checks in your application following this advice, you’ll have a more resilient, monitorable, and manageable application.

How about you all? Is there anything you would add?

Briefly: Logs

Recently I was asked by another engineer what information I expect to be able to find in logs. For this, I mostly agree with Splunk’s best practices but I have some additional advice I want to provide. I’ll end up regurgitating some of Splunk’s recommendations anyway.

  1. Your logs should be human readable. This means logging in text (no binary logging) and in a format that can be read by angry humans. Splunk recommends key-value pairs (e.g. at=response code=200 bytes=1024) since it makes Splunking easy, but I don’t have a strong enough opinion to evangelize that. Some folks advocate for logging in JSON but I don’t actually find JSON to be very readable.

    Edit: Someone pointed out to me that this isn’t ideal when you have a large amount of logs. They prefered sending JSON logs to a service like ElasticSearch but I think also sending key-value pairs to Splunk is also reasonable at some scale.

  2. Every log line should include a timestamp. The timestamp should be human readable and in a standard format such as RFC 3339/ISO 8601. Finally, even though the above specs include a timezone offset, timestamps should be stated in UTC time whenever possible.

  3. Every log line should include a unique identifier for the work being performed. In web applications and APIs, for example, this would be a request ID. The combination of a unique ID and timestamp allows for developers and operators to trace the execution of a single work unit.

  4. More is more. While I don’t particularly enjoy reading logs, I have always been more happy when an application logs more information than I need versus when an application doesn’t log enough information. Be verbose and log everything.

  5. Make understanding the code path of a work unit easy. This means logging file names, class names, function or method names, and so on. When sensible, include the arguments to these things as well.

  6. Use one line per event. Multi-line events are bad because they are difficult to grep or Splunk. Keep everything on one log line but feel free to log additional events. An exception to this rule might be tracebacks (see what I did there?)

  7. Log to stdout if you’re following 12factor otherwise log to syslog. Do not write your own log files! By writing your own log files, you are either taking log rotation off the table or signing yourself up to support exciting requirements like re-opening logs on SIGHUP (let’s not go there).

  8. Last but not least: Don’t write your own logging library! Chances are there already exists a well thought-out and standard library available in your application’s language or framework. Please use it!

So those are my recommendations about logs. What else would you recommend?

I Have a New Job at Truss!

Two weeks ago I started a new job at Truss after leaving Heroku two months ago.

Working at Heroku was an amazing experience in many ways. I achieved the highest level of work-life balance so far in my life, I had great coaches, and I solved a lot of challenging and interesting problems.

But it’s time to move on so after a month and half of downtime I’ve joined Truss as an operations engineer.

I joined Truss for a number of reasons:

  1. I wanted to consult again; consultants are given more ownership of the problems they are tasked with solving and there’s always something new to do

  2. I believe there is a ton of opportunity for infrastructure consulting and engineering, both in government and in private industry

  3. I wanted to work with the folks on this team in particlar

Thanks to all the folks who made my time at Heroku awesome and the folks who have been most welcoming at Truss. I’m already enjoying working together!

Personal Archery FAQ

When people learn that I’m a traditional archer, they tend to ask me a number of questions. I thought this might be a fun blog post, so here we go!

How far do I shoot?

I don’t shoot very far! Because I’m currently focused on form (more on that below), I am only shooting from between ten to twenty meters.

How accurate am I?

I’m not at all because I’m currently focused on precision. This is mostly a matter of form and consistency, which is why I don’t have to “shoot” (“loose”? “let”?) from very far. I can spend an entire session at 10 meters and get very rapid feedback on this, so I do. (Note: I’m being pedantic about precision vs accuracy, explanation in the image below).

"Precision vs Accuracy"

A variation of this question is sometimes if I’ve ever “robin hooded” an arrow. Robin hooding is when you shoot one arrow back into another one.

Not yet - arrow damage

My answer today is “not yet!”

Have you seen the video of that Danish guy? Lars something? It’s really amazing!

Ah yes, Lars Anderson. He’s done a couple of videos but one in particular has gotten a lot of attention recently.

The videos are pretty amazing to watch and highly entertaining, but you won’t find me doing those things any time soon. I’m perfectly content right here.

"Archery point of view"

Where do I go?

There’s an archery range in Golden Gate Park which is free to use. Nearby is the San Francisco Archery Pro Shop which has good lessons and cheap all-day rentals.

How did I get started?

I had been interested in archery for several years but it wasn’t until a couple of months ago that I finally tried it.

I went to a local shop and took my first lesson. It was so much fun that I took another one and, when I tried to sign up for my third, my instructor suggested that I should make some time to practice all of the things I’ve just learned on my own.

So I rented the equipment a few times and eventually just bought my own.

What did you buy?

I bought a pretty basic recurve bow and everything else I needed:

  • Six arrows

  • A quiver

  • An arm guard and glove

  • A bow stringer

  • Target pins

  • And carrying bag

All in all, it cost me about $350.

What Am I Supposed to Do With All of This Honey?!

Two gallons of honey

For Christmas this year, one of the wonderful gifts I received was this two gallon bucket full of fresh, local honey. My friend, the gift giver, offered the suggestion of brewing some mead which is too sweet for my tastes. So what should I do with it then?

I thought I’d brew some beer with it but two gallons of pure honey (or whatever that translates to in weight) is alot of honey and, since I’m currently only doing one gallon batches of home brewed beer, it’ll take take quite a bit of homebrew to get through it all.

A quick search for honey recipes yields some interesting ideas as well, with the top contender being these sweet and sour glazed pork chops.

So what do you think? What should I do with all this honey?

Two gallons of honey