Charles Hooper

Thoughts and projects from a site reliability engineer

Intro to Operations for Early Stage Startups

I’ve spent the last few years in a variety of roles for early stage tech startups. While in these roles, I’ve noticed a pattern: Early stage startups don’t give much thought to their operations. In particular, they typically don’t hire anyone specifically for that role because they are focused on building their product. In other words, all of their technical hires are for developers.

What tends to happen in my experience is that their developers soon become overwhelmed (especially after a growth spurt) and are unable to spend their time shipping code that’s going to improve their product or make their company money. Eventually, if they’re lucky, management catches onto this and hires their first systems or operations engineer.

Because I’ve had the opportunity to be first-hired systems engineer, what I’ve experienced is the effect of “working backwards” to undo a bunch of things that weren’t done following best practices while simultaneously moving things forward to improve them.

I decided to try to educate whoever would be willing to read this (hopefully early stage startups!) about some best practices that will not only save their future operations engineers some headache, but will also improve their business. Part of this education will happen in the form of one-on-one time with these startups. For example, I spent the last couple of days sitting in on office hours at a startup accelerator. The other part; however, will take place by writing “Intro to…” articles and publishing them to a variety of places, including this blog.

Specifically, the topics I’ve chosen to write about are:

Over the next week or so, I’ll write about each one of these topics and post them to this blog. I hope people find them helpful!

2012 Annual Review

Here are some stats from 2012:

  • Only 3 posts published.
  • 50,791 pageviews.
  • 36,176 visitors (32,007 unique)
  • 71.97% bounce rate.
  • Same great position at dotCloud!

Interestingly, my stats compared to last year aren’t too different, despite the fact that I only wrote 3 blog posts this year (instead of 10 last year). What’s also interesting is the HUGE spike in bounce rate (which used to be almost non-existent). This spike begins right around the time I hit a “home run” in terms of driving new traffic.

Going forward, I am going to try to post more (again). I’ve said this before but have yet to succeed. This time around, I changed my posting rules to allow me to write about topics more personal to me or more opinionated in nature.

An Early Year-end Review – Highlights

Since I’ve only written 3 blog posts so far this year and alot has changed, here’s a brief summary of what I’ll be writing about in my annual review:

  • I stopped doing contract work and started working for dotCloud part time,
  • As of tomorrow, I’ll officially have completed the requirements for my Bachelor’s degree,
  • I found myself, politically (I won’t be writing about this, but it is a big deal to me).

What does the future hold for me?

  • dotCloud offered me a full time position and I accepted; I will start January 2nd,
  • I have a variety of side projects planned but it’s too soon to tell how much of my free time I’ll want to spend on them,
  • I hope to become more involved with my local community and the ACM,
  • I’ve made the decision to stop writing so much on Facebook and Twitter and write more here.

That last item has big implications for this blog. It means that I’ll be writing about more personal subjects and opinions I have instead of wasting that effort on Twitter and Facebook (where relatives I haven’t seen in 20 years take the opportunities to rant on my wall). Some of these opinions may also be controversial and, that’s OK, we’re meant to disagree on some things. Some of them may even be ill-formed as they’ll be outside of my area of expertise. That’s OK too, just write me and educate me.

Painless Instrumentation of Celery Tasks Using Statsd and Graphite

For one of my clients and side projects, we’ve been working hard to build in application-level metrics to our wide portfolio of services. Among these services is one built on top of the Celery distributed task queue. We wanted a system that required as little configuration as possible to publish new metrics. For this reason, we decided on using statsd and graphite. Getting statsd and graphite running was the easy part, but we needed a quick, painless way of adding the instrumentation code for the most basic metrics to our Celery-backed service.

For us, those basic metrics consisted of:

  • Number of times a worker starts on a specific task
  • Number of times a task raises an exception
  • Number of times a task completes successfully (no exceptions)
  • How long each task takes to complete

Since the code to enable these metrics just wraps the code being instrumented it seemed only natural to use a decorator. Below is the code I wrote to do just that.

statsd_instrument.pylink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
"""Decorator to quickly add statsd (graphite) instrumentation to Celery
task functions.

With some slight modification, this could be used to instrument just
about any (non-celery) function and be made abstract enough to customize
metric names, etc.

Stats reported include number of times the task was accepted by a worker
(`started`), the number of successes, and the number of times the task
raised an exception. In addition, it also reports how long the task took
to complete. Usage:

>>> @task
>>> @instrument_task
>>> def mytask():
>>>     # do stuff
>>>     pass

Please note that the order of decorators is important to Celery. See
http://ask.github.com/celery/userguide/tasks.html#decorating-tasks
for more information.

Uses `simple_decorator` from
http://wiki.python.org/moin/PythonDecoratorLibrary#Property_Definition

Limitation: Does not readily work on subclasses of celery.tasks.Task
because it always reports `task_name` as 'run'
"""

# statsd instrumentation
from celery import current_app
import statsd

@simple_decorator
def instrument_task(func):
    """Wraps a celery task with statsd instrumentation code"""

    def instrument_wrapper(*args, **kwargs):
        stats_conn = statsd.connection.Connection(
            host = current_app.conf['STATSD_HOST'],
            port = current_app.conf['STATSD_PORT'],
            sample_rate = 1)

        task_name = func.__name__

        counter = statsd.counter.Counter('celery.tasks.status',stats_conn)
        counter.increment('{task_name}.started'.format(**locals()))

        timer = statsd.timer.Timer('celery.tasks.duration', stats_conn)
        timer.start()

        try:
            ret = func(*args, **kwargs)
        except:
            counter.increment('{task_name}.exceptions'.format(**locals()))
            raise
        else:
            counter.increment('{task_name}.success'.format(**locals()))
            timer.stop('{task_name}.success'.format(**locals()))
            return ret
        finally:
            try:
                del timer
                del counter
                del stats_conn
            except:
                pass

    return instrument_wrapper

def simple_decorator(decorator):
    """Borrowed from:
    http://wiki.python.org/moin/PythonDecoratorLibrary#Property_Definition

    Original docstring:
    This decorator can be used to turn simple functions
    into well-behaved decorators, so long as the decorators
    are fairly simple. If a decorator expects a function and
    returns a function (no descriptors), and if it doesn't
    modify function attributes or docstring, then it is
    eligible to use this. Simply apply @simple_decorator to
    your decorator and it will automatically preserve the
    docstring and function attributes of functions to which
    it is applied."""
    def new_decorator(f):
        g = decorator(f)
        g.__name__ = f.__name__
        g.__module__ = f.__module__ # or celery throws a fit
        g.__doc__ = f.__doc__
        g.__dict__.update(f.__dict__)
        return g
    # Now a few lines needed to make simple_decorator itself
    # be a well-behaved decorator.
    new_decorator.__name__ = decorator.__name__
    new_decorator.__doc__ = decorator.__doc__
    new_decorator.__dict__.update(decorator.__dict__)
    return new_decorator

We Have the Tools but What About the Techniques?

In my previously-written article “Concurrent Engineering: The Foundation of DevOps” I wrote “just because you use puppet does not necessarily mean your organization is practicing DevOps.” I didn’t spend much time on it then, but I think it bears repeating and further explanation. The DevOps “movement” has seen, and will likely continue to see, a huge influx of new tools as organizations attempt to find ways to adopt DevOps within their organizations. These tools have included (and certainly have not been limited to) tools that aid in monitoring (statsd), configuration management (puppet), and continuous delivery (hubot).

Operations engineers, software developers, and managers are in a mad dash to develop, utilize, and integrate these tools within their organizations. And that’s where we’re going wrong; we are focused on a single component of the Software/Systems Engineering Process. This process model contains three main components that are central to its existence: methodologies, techniques, and tools (Valacich 2009). While I don’t need to go into each one specifically, it’s clear that the tools are just a single factor in the overall process. Following the model further, it becomes clear that the makeup of each of these components influences the other components in the process.

Put simply, DevOps is a methodology and, as such, it’s natural that we’re seeing a huge response in tools. What I feel we’re missing, however, is more information about the different techniques used throughout organizations in their software and operations engineering processes. An excellent example of this is Scott Chacon’s explanation of how Github uses Git (and Github!) to deliver continuous improvement to their service. With that said, I would like to see more organizations refine their techniques and talk about these as much as they talk about their tools.

2011 Annual Review and New-Year Updates

Happy New Years, everyone.

I thought I’d ring in the new year with some site stats from 2011.

  • Only 10 posts published.
  • 59,238 pageviews.
  • 24,829 visitors (22,634 unique)
  • 1.32% bounce rate.
  • Multiple job and business opportunities in direct response to articles I wrote including a new job (more details below).

I really want to write more. My resolution then is to “write more.” Using a more quantified approach, I will spend at least 30 minutes a day writing for at least five days a week. That doesn’t mean I will publish five articles a week. One of my big issues with writing is the amount of time that goes into each post. I approach my writing very academically and try to back up my ideas with citations; this research takes time. I also, very frequently, solicit feedback from other people before publishing. But I really enjoy writing and I really enjoy receiving feedback on my ideas through this blog so I would like to continue doing it.

Other resolutions of mine include physical health and professional development (I’d really like to give a talk at a conference this year).

I also have some really exciting news. Starting on Monday I will officially begin my employment with dotCloud on their Site Reliability Engineering team! This is exciting for two reasons.

First, working at dotCloud is going to be an awesome experience. Everyone I’ve talked to is incredibly smart. Our CEO, Solomon Hykes, was also just named on Forbes’ “30 Under 30” list. Third, and probably most importantly, is that I’m going to absolutely love the work. I love solving problems, particularly in devops, and I love writing tools that make people’s lives easier (which is precisely what dotCloud does). If you’re looking for a Platform as a Service provider, try out dotCloud and let us know how you like it.

The second reason this is exciting because, in the process of starting at a new company, I’ve managed to expand my personal consulting practice. I don’t think I’ve said so before, but I provide systems engineering services to Loud3r Inc. I’m their only “web ops” engineer and we’ve managed to completely turn things around in the past 6 months; we’re providing our services better (more reliable, more frequent updates), faster, and cheaper than before. Rather than cancel the contract entirely, the CEO of Loud3r and I felt it was a good idea for me to subcontract a large portion of my workload to a trusted colleague of mine.

How about you? Is there anything exciting you would like to share about the progress you made during 2011 or large changes you’re making in the start of 2012? Tell me all about it!

Concurrent Engineering: The Foundation of DevOps

DevOps is all about trying to avoid that epic failure and working smarter and more efficiently at the same time. It is a framework of ideas and principles designed to foster cooperation, learning and coordination between development and operational groups. In a DevOps environment, developers and sysadmins build relationships, processes, and tools that allow them to better interact and ultimately better service the customer (James Turnbull).

At the time of writing, if you were to search for “devops” you would find eight results attempting to explain what devops is, one result for a conference, and one rather satirical article (although not necessarily incorrect) where the author answers the question of “how do you implement devops” with “nobody seems to know” (Ted Dziuba).

The big problem with the DevOps “movement” is that we essentially have a bunch of operations and development people promoting it and trying to implement it within their organizations. Meanwhile, those with management and business responsibilities, even if explained the “what,” don’t understand the “how.” Just because you use puppet does not necessarily mean your organization is practicing DevOps.

This shortcoming is the result of us devops proponents either falsely claiming these techniques and methodologies are new or not knowing any better. If we had something more relatable for the business people (and, by principle, we should be business-oriented, too) then I think DevOps would have more of a chance.

Well, get your product and management together because the truth is that DevOps is actually a form of Concurrent Engineering.

Concurrent Engineering (CE) is a systematic approach to integrated product development that emphasizes the response to customer expectations. It embodies team values of co-operation, trust and sharing in such a manner that decision making is by consensus, involving all perspectives in parallel, from the beginning of the product life-cycle (ESA – Concurrent Engineering Facility).

Concurrent Engineering encompasses several major principles which just so happen to fit the definition (however formal or informal) of devops.

I’ll list them from the Synthesis Coalition here:

  • Get a strong commitment from senior management.
  • Establish unified project goals and a clear business mission.
  • Develop a detailed plan early in the process.
  • Continually review your progress and revise your plan.
  • Develop project leaders that have an overall vision of the project and goals.
  • Analyze your market and know your customers.
  • Suppress individualism and foster a team concept.
  • Establish and cultivate cross-functional integration and collaboration.
  • Break project into its natural phases.
  • Develop metrics.
  • Set milestones throughout the development process.
  • Collectively work on all parts of project.
  • Reduce costs and time to market.
  • Complete tasks in parallel.

By approaching the issues of devops as  concurrent engineering and implementing it as such, you open the movement to a well-researched, well-documented, and well-accepted product design philosophy. By shedding this light on the devops methodologies, this enables those of us pushing the devops movement to finally put the movement into a more business-oriented perspective.

Stop Letting Technical People Get Away With Social Ineptitude

As a technical person who has worked many customer-facing support roles, I’m offended by the often-cited notion that technical people have poor people skills or are poor at filling customer support roles. Earlier this week, a web host incited a Public Relations nightmare when their “Technical Director” responded childishly to some customers, disabled their accounts, and deleted their backups. The company’s response was to create a new customer support position so they could keep their Technical Director yet isolate him from their customers. In this new customer support personnel’s recap of the situation, he wrote:

Jules is the Technical Director. Jules does a fantastic job looking after things and keeping the infrastructure running well. Unfortunately though, as is often the case with very technically minded people, customer service is not always his strong point.

As I stated above, I find this absolutely offensive. Unfortunately, it’s often believed to be true because there are so many engineers and technical personnel with poor people skills. I never have seen statistics to support the notion that the ratio is higher in these professions than in others, yet employers often find that people filling these roles with poor people skills are still employable. This needs to stop.

I’m of the opinion that every position is customer-facing. Sure, some positions might not interface with the company’s customers, but the position is likely to have customers of its own – whether internal or external. All it takes for someone to be good at customer service is:

  • The understanding that they represent their employer and that people are relying on them,
  • A compassionate, helpful, and courteous attitude,
  • The knowledge of whatever they’re supporting.

Frankly, if you don’t have those qualities, how can you work with anyone at all?

Problems at Scale

Over on HackerNews, saturn wrote that:

Cloud computing scales the efficiencies, yes. It also scales the problems.

This is exactly right. Problems in simple architectures are relatively easy to solve. In fact, I’d go as far as to say that we’ve probably solved them in all of the traditional archetypes, both in theory and in practice.

On the other hand, complex architectures lead to exponentially more difficult problems. There are probably lots of problems in these various complex architectures that we don’t even know exist yet. And then there are those problems that we do know about that we think will only occur in very rare (or even “impossible”) circumstances so they get considerably less attention devoted to them.

Those of us who have careers, jobs, and hobbies in an engineering discipline need to remember this when we make decisions about the design of a new or existing system. Just because we can’t see the underlying platform, because it’s been abstracted away from us, doesn’t mean that it doesn’t exist. For example, much of the recent AWS downtime was contributed to by design flaws in the Elastic Block Store system. If you think you should be hosted on the cloud, use it, but take the time to understand the systems under the hood.

Amazon’s Relational Database Service (RDS) – the Black Box From Hell

One morning I woke up early and checked my email. My plan was to check that my inbox was empty for some peace of mind and then go back to bed for a few more hours (I love Sundays). But that isn’t what happened. Instead, upon opening my inbox I was alerted that one of a client’s database servers was offline. I snapped out of my haze and immediately got to work.

This particular database server was a RDS instance. RDS, or Relational Database Service, is an Amazon-provided MySQL (or Oracle) server that runs on top of the EC2 platform. The advantages to this service are that backups are performed automatically (complete with point-in-time recovery,) snapshots are supported, the instances can be resized with more or less RAM/CPU/storage through the AWS console, and a whole bunch of other stuff (“maintenance”) is supposed to be performed for you automatically.

The disadvantages don’t make themselves apparent until you need to debug or troubleshoot a performance or availability issue. While CloudWatch metrics are included as part of the RDS package, knowing how much CPU, RAM, or storage space you’re using is only a very small part of knowing what your database instance is actually doing.

Prior to attempting recovery, the first thing I did was to check the CloudWatch metrics. CloudWatch seems to have trouble reporting its data when the system is under durress because there were periods where there was data and there were periods where there wasn’t. The next thing I did was check the RDS event logs. Don’t get excited, the RDS event log is not a UI wrapped around system logs, it’s just a couple of entries here and there on what Amazon RDS decides to publish. The last entry in the event log was a backup job that started several hours before and never finished. These typically only take one to two minutes to finish on this instance so I knew something was wrong.

I didn’t want to waste time trying to troubleshoot while the database was down so I instead moved immediately to recovery and rebooted the instance through the AWS console. It’s like Charles McPhail says, “Respond, Restore, Resolve.” After about a whole 20 to 30 minutes the database server began accepting connections again but the instance was never taken out of the “REBOOTING” state when it should have transitioned to “STARTED”. With the instance in the “REBOOTING” state, my only option now was to recover from a previous backup as the rest of the functionality is disabled unless the instance is in a “STARTED” state.

To make matters worse, the various components in our infrastructure were connecting to this database server and were making it impossible to find out what’s going on. The max connection limit was reached and I was no longer able to login and view the process list or analyze the status variables.

At this point, I decided my only course of action was to spin up a new instance from a previous backup. I made this request through the AWS console and, two to three hours later, my new instance was finally up and running. About a half an hour prior to this, the old instance was transitioned into a “FAILED” state and shut down. When your instance is in the “FAILED” state, you cannot restart it. Your only option is to restore from backup. In my case, it took several hours for AWS to declare the instance as failed and it took several hours to restore the backup. I did not know that the “FAILED” state was even a possible state and had no idea that AWS could just kill an instance like that. To top it all off, Amazon sent a very nice email to the owner of the account (my client the CEO) explaining that we’ve been using an unsupported storage engine all this time.

As it turns out, I missed the note in the RDS User Guide that says that MyISAM is not supported, particularly when it comes to data recovery. While I understand why RDS made this decision (MyISAM gets corrupted easily and is not easy to repair sometimes,) I felt misled and uninformed about the support of the storage engines. Yes, the note is in the RDS User Guide, however, it is not mentioned anywhere in the main page about RDS nor is it in the RDS FAQs (where the string “MyISAM” only appears once).

A few weeks have gone by and we have taken steps to avoid and reduce the damage from these types of outages in the future. However, we still occasionally receive an alert where an RDS instance stops accepting connections for one to two minutes at a time and all the event log has to say is that the instance has been “recovered.” Recovered from what exactly? What did you do to it? Why does this keep happening? How do we make it stop?

In summary, I’ll probably never know because on RDS you do not have access to the underlying OS. This means:

  • You do not have access to the OS process list
  • You do not have access to things like top, htop, iostat, or dstat
  • You do not have access to the process list if the MySQL process isn’t accepting connections
  • You do not have access to any system logs

If you just need a quick and dirty MySQL server and you almost never want to worry about the status of your backups, go ahead and use RDS. However, if you’re concerned about reliability (that you control,) being able to effectively troubleshoot problems, and knowing the state of your underlying OS, RDS is not right for you.