Charles Hooper

Thoughts and projects from a Hacker and Operations Engineer

Intro to Operations: Metrics Collection

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

Collecting metrics is another area that many early stage startups seem to overlook even though it is probably one of the most important things they can do. By metrics collection, I am referring to the gathering and storing of various metrics at several different levels. As John Allspaw identifies them in Web Operations: Keeping the Data on Time, they are:

  • High-level business and application metrics (e.g. user sign-ups)
  • Feature-specific application-level metrics (e.g. widgets processed)
  • Systems and service-level metrics (e.g. server load or database queries per second)

You’ll note that there are two levels of “application-level” metrics. The higher-level application metrics are mostly those that can be tied to business objectives, while the other category of application metrics are generally more feature specific.

Benefits incurred by collecting these metrics are plentiful. For one, having quick access to these metrics is helpful during troubleshooting and incident response. For example, I was once hired under contract to look into why a certain company’s API was unreliable for the previous few months. At least once per day, this company’s API would time out and not respond to client requests. After enabling basic metrics collection for the servers and services used by the API, it very quickly became obvious that the database servers were reaching their connection limits which was preventing the API from retrieving records from the database. Not only was this problem identified very quickly, but later on we were able to look back at our metrics data to assess how close to our limits we were getting.

Another benefit is that you can integrate the metrics into your Availability monitoring system to be alerted when metrics surpass some threshold or change significantly. Not only that, but analyzing these metrics will allow you to manage your capacity intelligently and build a business case to justify infrastructure expenditures. Finally, analyzing these metrics will also give you insight into your application, how it’s used, and your business.

How you go about collecting and storing these metrics is up to you. Many engineers might be tempted to build their own solution; however, there are many open source and third party software packages that you may find helpful. Key considerations when choosing which package or packages to use are:

  • The ability to add new, custom metrics
  • Configurable resolution/storage trade-off
  • Integration with availability monitoring and alerting systems
  • Graphing/visualization

If your startup doesn’t have any metrics then you should start collecting them now. The visualization will help you in the short run and the historical data will help you in the long run.

Intro to Operations: Availability Monitoring and Alerting

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

Another area I’ve seen alot of early stage startups lacking in is availability monitoring and alerting. The essence of availability monitoring and alerting is being notified when your service is not working as expected, including when it’s simply down, isn’t meeting formal or informal SLAs (e.g., it’s too slow), or certain functionality is broken.

What I typically see is that some effort was made to set up this type of monitoring before and never maintained. Symptoms include poor monitoring coverage (servers missing from the config, services monitoring nearly non-existent), large amounts of false positives and negatives, inactionable alerts, and alerts that go unignored because of the previous issues.

Symptoms on the business include not knowing when your service is down and finding out that your service is broken from your customers. Finding out that your service is down from your customers is not only embarrassing but it also shakes their confidence in you, affects your reputation, and may even lead to lost revenue.

The good news is that it doesn’t have to be this way. When availability monitoring is set up properly, maintained, and you and your employees agree to approach alerts a specific way, you will be able to reap a variety of benefits. Here’s what I recommend:

  1. First, collaborate with your employees to define who is in the pager rotation and the escalation policies. Ask yourself: What happens when the on call engineer is overwhelmed and needs backup? What happens when the engineer goes on vacation?

  2. Next, take inventory of what services you rely on and define an internal SLA for them. This does not have to be a super formal process, but this inventory and SLA will be helpful for deciding what thresholds to set in your monitoring to avoid false positives. Try to see the big picture and think about everything such as:

    • Servers,
    • Self-managed supporting services like web servers, databases, email services,
    • Application functionality and features - one strategy I like is exposing a “health check” service that can be checked by the monitoring agent,
    • Third party services like remote APIs.

    Your inventory and SLA definition is a living document; remember to keep it up to date!

  3. Then set up whatever monitoring package you or your engineers decided to use (self-hosted or third party) such as nagios, Zenoss, Pingdom, or CopperEgg and have your monitoring configured for those services. If you’re really good, you’ll check your configuration into its own source control repository. If you go the self-hosted route, it may also be worth having your monitoring server monitored externally. Who’s watching the watcher indeed.

  4. Think about integrating your monitoring with a pager service such as PagerDuty. Services like PagerDuty allow you to input your pager rotation and then define good rules for how to contact the on call engineer and when to escalate should the engineer be unavailable.

  5. With improved monitoring and alerting in place, you may want to think about giving certain customers “911” access. At a previous company I worked at, we had a secret email address our big customers could hit which would open a support ticket and then page the on call engineer with the ticket number. If you decide to go this route; however, you’ll want to train your customers when it’s appropriate to use this power and how to use it most effectively.

  6. Adjust alerts and fix problems as you get paged for them. Don’t care that a particular API goes down during a known maintenance window? Schedule the notification policy accordingly.

  7. Finally, continue maintaining your inventory and monitoring service’s configuration. For extra benefit, consider tracking your organization’s Mean Time To Respond (how long it took for engineer to acknowledge that something is wrong) and your Mean Time To Recover (how long it took the engineer to resolve the issue including the Mean Time To Respond), your Mean Time Between Failures (self-explanatory, I hope), and Percent Availability (what percent of time your service is functional in a given period of time).

This concludes the management and non-ops introduction to operations; I hope you find this helpful.

Intro to Operations: Configuration Management

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

One of the areas I’ve witnessed early stage startups lacking in is configuration management. Configuration management is the process of standardizing and enforcing configurations. In other words, configuration management is about deciding on a specific configuration of services for various roles and then applying these configurations in practice. Typically, these manifests are written in (domain-specific) language and is specific to the configuration management software being used, such as puppet, chef, cfengine, or salt stack.

There are many benefits to configuration management. For one, configuration management allows developers to spend more time working on the product and less time deploying new services. This is because configuration is now automated and faster as a result. In addition, environments are standardized and therefore less time is spent troubleshooting or diagnosing edge cases in different environments. Finally, when coupled with source control management, the proper use of configuration management can be used to track and audit what has changed over time and who changed it.

In many of these early stage startups, there is either very little configuration management performed at all, or configuration management exists as a series of shell scripts cobbled together to do some post-hardware setup. If you’re lucky, there exists a document somewhere that describes when and how to run these scripts to deploy new services.

The way configuration management works is that engineers create a collection of files that define how the system should be configured. This collection of files is typically called a manifest. Then, once physical or virtual hardware has been provisioned, one of these manifests is applied to the new host. During application, the configuration management software will interpret the new configuration, install software packages, manage users and credentials, alter config files, manage file permissions, run arbitrary commands, and so on. Once the manifest is fully applied, the new host should be fully configured and ready to be used! In some environments; however, they may be a post-host-provisioning step where additional work is performed afterwards, such as checking out application code from a source control repository.

If you’re not using configuration management already then you should start now because, frankly, it’s never too early. Starting configuration management now will not only help your first hired ops/systems engineer from working backwards to write these manifests later, but will also incur benefits (such as your developers spending less time away from shipping value-added code) that will outweigh the initial learning curve.

New Platform - Sorry for the RSS Spam

Sorry for the RSS spam, everyone. I switched blogging platforms recently from Wordpress to Octopress. When I updated my feed URL in Feedburner; however, I’m told it re-posted all of my previous blog posts in multiple RSS readers as unread. This resulted in a ton of annoying spam for alot of people, particularly those using Feedly and Google Reader.

Intro to Operations for Early Stage Startups

I’ve spent the last few years in a variety of roles for early stage tech startups. While in these roles, I’ve noticed a pattern: Early stage startups don’t give much thought to their operations. In particular, they typically don’t hire anyone specifically for that role because they are focused on building their product. In other words, all of their technical hires are for developers.

What tends to happen in my experience is that their developers soon become overwhelmed (especially after a growth spurt) and are unable to spend their time shipping code that’s going to improve their product or make their company money. Eventually, if they’re lucky, management catches onto this and hires their first systems or operations engineer.

Because I’ve had the opportunity to be first-hired systems engineer, what I’ve experienced is the effect of “working backwards” to undo a bunch of things that weren’t done following best practices while simultaneously moving things forward to improve them.

I decided to try to educate whoever would be willing to read this (hopefully early stage startups!) about some best practices that will not only save their future operations engineers some headache, but will also improve their business. Part of this education will happen in the form of one-on-one time with these startups. For example, I spent the last couple of days sitting in on office hours at a startup accelerator. The other part; however, will take place by writing “Intro to…” articles and publishing them to a variety of places, including this blog.

Specifically, the topics I’ve chosen to write about are:

Over the next week or so, I’ll write about each one of these topics and post them to this blog. I hope people find them helpful!

2012 Annual Review

Here are some stats from 2012:

  • Only 3 posts published.
  • 50,791 pageviews.
  • 36,176 visitors (32,007 unique)
  • 71.97% bounce rate.
  • Same great position at dotCloud!

Interestingly, my stats compared to last year aren’t too different, despite the fact that I only wrote 3 blog posts this year (instead of 10 last year). What’s also interesting is the HUGE spike in bounce rate (which used to be almost non-existent). This spike begins right around the time I hit a “home run” in terms of driving new traffic.

Going forward, I am going to try to post more (again). I’ve said this before but have yet to succeed. This time around, I changed my posting rules to allow me to write about topics more personal to me or more opinionated in nature.

An Early Year-end Review – Highlights

Since I’ve only written 3 blog posts so far this year and alot has changed, here’s a brief summary of what I’ll be writing about in my annual review:

  • I stopped doing contract work and started working for dotCloud part time,
  • As of tomorrow, I’ll officially have completed the requirements for my Bachelor’s degree,
  • I found myself, politically (I won’t be writing about this, but it is a big deal to me).

What does the future hold for me?

  • dotCloud offered me a full time position and I accepted; I will start January 2nd,
  • I have a variety of side projects planned but it’s too soon to tell how much of my free time I’ll want to spend on them,
  • I hope to become more involved with my local community and the ACM,
  • I’ve made the decision to stop writing so much on Facebook and Twitter and write more here.

That last item has big implications for this blog. It means that I’ll be writing about more personal subjects and opinions I have instead of wasting that effort on Twitter and Facebook (where relatives I haven’t seen in 20 years take the opportunities to rant on my wall). Some of these opinions may also be controversial and, that’s OK, we’re meant to disagree on some things. Some of them may even be ill-formed as they’ll be outside of my area of expertise. That’s OK too, just write me and educate me.

Painless Instrumentation of Celery Tasks Using Statsd and Graphite

For one of my clients and side projects, we’ve been working hard to build in application-level metrics to our wide portfolio of services. Among these services is one built on top of the Celery distributed task queue. We wanted a system that required as little configuration as possible to publish new metrics. For this reason, we decided on using statsd and graphite. Getting statsd and graphite running was the easy part, but we needed a quick, painless way of adding the instrumentation code for the most basic metrics to our Celery-backed service.

For us, those basic metrics consisted of:

  • Number of times a worker starts on a specific task
  • Number of times a task raises an exception
  • Number of times a task completes successfully (no exceptions)
  • How long each task takes to complete

Since the code to enable these metrics just wraps the code being instrumented it seemed only natural to use a decorator. Below is the code I wrote to do just that.

statsd_instrument.pylink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
"""Decorator to quickly add statsd (graphite) instrumentation to Celery
task functions.

With some slight modification, this could be used to instrument just
about any (non-celery) function and be made abstract enough to customize
metric names, etc.

Stats reported include number of times the task was accepted by a worker
(`started`), the number of successes, and the number of times the task
raised an exception. In addition, it also reports how long the task took
to complete. Usage:

>>> @task
>>> @instrument_task
>>> def mytask():
>>>     # do stuff
>>>     pass

Please note that the order of decorators is important to Celery. See
http://ask.github.com/celery/userguide/tasks.html#decorating-tasks
for more information.

Uses `simple_decorator` from
http://wiki.python.org/moin/PythonDecoratorLibrary#Property_Definition

Limitation: Does not readily work on subclasses of celery.tasks.Task
because it always reports `task_name` as 'run'
"""

# statsd instrumentation
from celery import current_app
import statsd

@simple_decorator
def instrument_task(func):
    """Wraps a celery task with statsd instrumentation code"""

    def instrument_wrapper(*args, **kwargs):
        stats_conn = statsd.connection.Connection(
            host = current_app.conf['STATSD_HOST'],
            port = current_app.conf['STATSD_PORT'],
            sample_rate = 1)

        task_name = func.__name__

        counter = statsd.counter.Counter('celery.tasks.status',stats_conn)
        counter.increment('{task_name}.started'.format(**locals()))

        timer = statsd.timer.Timer('celery.tasks.duration', stats_conn)
        timer.start()

        try:
            ret = func(*args, **kwargs)
        except:
            counter.increment('{task_name}.exceptions'.format(**locals()))
            raise
        else:
            counter.increment('{task_name}.success'.format(**locals()))
            timer.stop('{task_name}.success'.format(**locals()))
            return ret
        finally:
            try:
                del timer
                del counter
                del stats_conn
            except:
                pass

    return instrument_wrapper

def simple_decorator(decorator):
    """Borrowed from:
    http://wiki.python.org/moin/PythonDecoratorLibrary#Property_Definition

    Original docstring:
    This decorator can be used to turn simple functions
    into well-behaved decorators, so long as the decorators
    are fairly simple. If a decorator expects a function and
    returns a function (no descriptors), and if it doesn't
    modify function attributes or docstring, then it is
    eligible to use this. Simply apply @simple_decorator to
    your decorator and it will automatically preserve the
    docstring and function attributes of functions to which
    it is applied."""
    def new_decorator(f):
        g = decorator(f)
        g.__name__ = f.__name__
        g.__module__ = f.__module__ # or celery throws a fit
        g.__doc__ = f.__doc__
        g.__dict__.update(f.__dict__)
        return g
    # Now a few lines needed to make simple_decorator itself
    # be a well-behaved decorator.
    new_decorator.__name__ = decorator.__name__
    new_decorator.__doc__ = decorator.__doc__
    new_decorator.__dict__.update(decorator.__dict__)
    return new_decorator

We Have the Tools but What About the Techniques?

In my previously-written article “Concurrent Engineering: The Foundation of DevOps” I wrote “just because you use puppet does not necessarily mean your organization is practicing DevOps.” I didn’t spend much time on it then, but I think it bears repeating and further explanation. The DevOps “movement” has seen, and will likely continue to see, a huge influx of new tools as organizations attempt to find ways to adopt DevOps within their organizations. These tools have included (and certainly have not been limited to) tools that aid in monitoring (statsd), configuration management (puppet), and continuous delivery (hubot).

Operations engineers, software developers, and managers are in a mad dash to develop, utilize, and integrate these tools within their organizations. And that’s where we’re going wrong; we are focused on a single component of the Software/Systems Engineering Process. This process model contains three main components that are central to its existence: methodologies, techniques, and tools (Valacich 2009). While I don’t need to go into each one specifically, it’s clear that the tools are just a single factor in the overall process. Following the model further, it becomes clear that the makeup of each of these components influences the other components in the process.

Put simply, DevOps is a methodology and, as such, it’s natural that we’re seeing a huge response in tools. What I feel we’re missing, however, is more information about the different techniques used throughout organizations in their software and operations engineering processes. An excellent example of this is Scott Chacon’s explanation of how Github uses Git (and Github!) to deliver continuous improvement to their service. With that said, I would like to see more organizations refine their techniques and talk about these as much as they talk about their tools.

2011 Annual Review and New-Year Updates

Happy New Years, everyone.

I thought I’d ring in the new year with some site stats from 2011.

  • Only 10 posts published.
  • 59,238 pageviews.
  • 24,829 visitors (22,634 unique)
  • 1.32% bounce rate.
  • Multiple job and business opportunities in direct response to articles I wrote including a new job (more details below).

I really want to write more. My resolution then is to “write more.” Using a more quantified approach, I will spend at least 30 minutes a day writing for at least five days a week. That doesn’t mean I will publish five articles a week. One of my big issues with writing is the amount of time that goes into each post. I approach my writing very academically and try to back up my ideas with citations; this research takes time. I also, very frequently, solicit feedback from other people before publishing. But I really enjoy writing and I really enjoy receiving feedback on my ideas through this blog so I would like to continue doing it.

Other resolutions of mine include physical health and professional development (I’d really like to give a talk at a conference this year).

I also have some really exciting news. Starting on Monday I will officially begin my employment with dotCloud on their Site Reliability Engineering team! This is exciting for two reasons.

First, working at dotCloud is going to be an awesome experience. Everyone I’ve talked to is incredibly smart. Our CEO, Solomon Hykes, was also just named on Forbes’ “30 Under 30” list. Third, and probably most importantly, is that I’m going to absolutely love the work. I love solving problems, particularly in devops, and I love writing tools that make people’s lives easier (which is precisely what dotCloud does). If you’re looking for a Platform as a Service provider, try out dotCloud and let us know how you like it.

The second reason this is exciting because, in the process of starting at a new company, I’ve managed to expand my personal consulting practice. I don’t think I’ve said so before, but I provide systems engineering services to Loud3r Inc. I’m their only “web ops” guy and we’ve managed to completely turn things around in the past 6 months; we’re providing our services better (more reliable, more frequent updates), faster, and cheaper than before. Rather than cancel the contract entirely, the CEO of Loud3r and I felt it was a good idea for me to subcontract a large portion of my workload to a trusted colleague of mine.

How about you? Is there anything exciting you would like to share about the progress you made during 2011 or large changes you’re making in the start of 2012? Tell me all about it!