Charles Hooper

Thoughts and projects from an infrastructure engineer

Intro to Operations: Availability Monitoring and Alerting

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

Another area I’ve seen alot of early stage startups lacking in is availability monitoring and alerting. The essence of availability monitoring and alerting is being notified when your service is not working as expected, including when it’s simply down, isn’t meeting formal or informal SLAs (e.g., it’s too slow), or certain functionality is broken.

What I typically see is that some effort was made to set up this type of monitoring before and never maintained. Symptoms include poor monitoring coverage (servers missing from the config, services monitoring nearly non-existent), large amounts of false positives and negatives, inactionable alerts, and alerts that go unignored because of the previous issues.

Symptoms on the business include not knowing when your service is down and finding out that your service is broken from your customers. Finding out that your service is down from your customers is not only embarrassing but it also shakes their confidence in you, affects your reputation, and may even lead to lost revenue.

The good news is that it doesn’t have to be this way. When availability monitoring is set up properly, maintained, and you and your employees agree to approach alerts a specific way, you will be able to reap a variety of benefits. Here’s what I recommend:

  1. First, collaborate with your employees to define who is in the pager rotation and the escalation policies. Ask yourself: What happens when the on call engineer is overwhelmed and needs backup? What happens when the engineer goes on vacation?

  2. Next, take inventory of what services you rely on and define an internal SLA for them. This does not have to be a super formal process, but this inventory and SLA will be helpful for deciding what thresholds to set in your monitoring to avoid false positives. Try to see the big picture and think about everything such as:

    • Servers,
    • Self-managed supporting services like web servers, databases, email services,
    • Application functionality and features - one strategy I like is exposing a “health check” service that can be checked by the monitoring agent,
    • Third party services like remote APIs.

Your inventory and SLA definition is a living document; remember to keep it up to date!

  1. Then set up whatever monitoring package you or your engineers decided to use (self-hosted or third party) such as nagios, Zenoss, Pingdom, or CopperEgg and have your monitoring configured for those services. If you’re really good, you’ll check your configuration into its own source control repository. If you go the self-hosted route, it may also be worth having your monitoring server monitored externally. Who’s watching the watcher indeed.
  1. Think about integrating your monitoring with a pager service such as PagerDuty. Services like PagerDuty allow you to input your pager rotation and then define good rules for how to contact the on call engineer and when to escalate should the engineer be unavailable.
  1. With improved monitoring and alerting in place, you may want to think about giving certain customers “911” access. At a previous company I worked at, we had a secret email address our big customers could hit which would open a support ticket and then page the on call engineer with the ticket number. If you decide to go this route; however, you’ll want to train your customers when it’s appropriate to use this power and how to use it most effectively.

  2. Adjust alerts and fix problems as you get paged for them. Don’t care that a particular API goes down during a known maintenance window? Schedule the notification policy accordingly.

  3. Finally, continue maintaining your inventory and monitoring service’s configuration. For extra benefit, consider tracking your organization’s Mean Time To Respond (how long it took for engineer to acknowledge that something is wrong) and your Mean Time To Recover (how long it took the engineer to resolve the issue including the Mean Time To Respond), your Mean Time Between Failures (self-explanatory, I hope), and Percent Availability (what percent of time your service is functional in a given period of time).

This concludes the management and non-ops introduction to operations; I hope you find this helpful.

Intro to Operations: Configuration Management

I’m writing a series of blog posts for managers and other people without an operations background in order to introduce certain best practices regarding Operations. For the rest of the blog posts, please visit the introductory Intro to Operations blog post!

One of the areas I’ve witnessed early stage startups lacking in is configuration management. Configuration management is the process of standardizing and enforcing configurations. In other words, configuration management is about deciding on a specific configuration of services for various roles and then applying these configurations in practice. Typically, these manifests are written in (domain-specific) language and is specific to the configuration management software being used, such as puppet, chef, cfengine, or salt stack.

There are many benefits to configuration management. For one, configuration management allows developers to spend more time working on the product and less time deploying new services. This is because configuration is now automated and faster as a result. In addition, environments are standardized and therefore less time is spent troubleshooting or diagnosing edge cases in different environments. Finally, when coupled with source control management, the proper use of configuration management can be used to track and audit what has changed over time and who changed it.

In many of these early stage startups, there is either very little configuration management performed at all, or configuration management exists as a series of shell scripts cobbled together to do some post-hardware setup. If you’re lucky, there exists a document somewhere that describes when and how to run these scripts to deploy new services.

The way configuration management works is that engineers create a collection of files that define how the system should be configured. This collection of files is typically called a manifest. Then, once physical or virtual hardware has been provisioned, one of these manifests is applied to the new host. During application, the configuration management software will interpret the new configuration, install software packages, manage users and credentials, alter config files, manage file permissions, run arbitrary commands, and so on. Once the manifest is fully applied, the new host should be fully configured and ready to be used! In some environments; however, they may be a post-host-provisioning step where additional work is performed afterwards, such as checking out application code from a source control repository.

If you’re not using configuration management already then you should start now because, frankly, it’s never too early. Starting configuration management now will not only help your first hired ops/systems engineer from working backwards to write these manifests later, but will also incur benefits (such as your developers spending less time away from shipping value-added code) that will outweigh the initial learning curve.

Intro to Operations for Early Stage Startups

I’ve spent the last few years in a variety of roles for early stage tech startups. While in these roles, I’ve noticed a pattern: Early stage startups don’t give much thought to their operations. In particular, they typically don’t hire anyone specifically for that role because they are focused on building their product. In other words, all of their technical hires are for developers.

What tends to happen in my experience is that their developers soon become overwhelmed (especially after a growth spurt) and are unable to spend their time shipping code that’s going to improve their product or make their company money. Eventually, if they’re lucky, management catches onto this and hires their first systems or operations engineer.

Because I’ve had the opportunity to be first-hired systems engineer, what I’ve experienced is the effect of “working backwards” to undo a bunch of things that weren’t done following best practices while simultaneously moving things forward to improve them.

I decided to try to educate whoever would be willing to read this (hopefully early stage startups!) about some best practices that will not only save their future operations engineers some headache, but will also improve their business. Part of this education will happen in the form of one-on-one time with these startups. For example, I spent the last couple of days sitting in on office hours at a startup accelerator. The other part; however, will take place by writing “Intro to…” articles and publishing them to a variety of places, including this blog.

Specifically, the topics I’ve chosen to write about are:

Over the next week or so, I’ll write about each one of these topics and post them to this blog. I hope people find them helpful!

2012 Annual Review

Here are some stats from 2012:

  • Only 3 posts published.
  • 50,791 pageviews.
  • 36,176 visitors (32,007 unique)
  • 71.97% bounce rate.
  • Same great position at dotCloud!

Interestingly, my stats compared to last year aren’t too different, despite the fact that I only wrote 3 blog posts this year (instead of 10 last year). What’s also interesting is the HUGE spike in bounce rate (which used to be almost non-existent). This spike begins right around the time I hit a “home run” in terms of driving new traffic.

Going forward, I am going to try to post more (again). I’ve said this before but have yet to succeed. This time around, I changed my posting rules to allow me to write about topics more personal to me or more opinionated in nature.

An Early Year-end Review – Highlights

Since I’ve only written 3 blog posts so far this year and alot has changed, here’s a brief summary of what I’ll be writing about in my annual review:

  • I stopped doing contract work and started working for dotCloud part time,
  • As of tomorrow, I’ll officially have completed the requirements for my Bachelor’s degree,
  • I found myself, politically (I won’t be writing about this, but it is a big deal to me).

What does the future hold for me?

  • dotCloud offered me a full time position and I accepted; I will start January 2nd,
  • I have a variety of side projects planned but it’s too soon to tell how much of my free time I’ll want to spend on them,
  • I hope to become more involved with my local community and the ACM,
  • I’ve made the decision to stop writing so much on Facebook and Twitter and write more here.

That last item has big implications for this blog. It means that I’ll be writing about more personal subjects and opinions I have instead of wasting that effort on Twitter and Facebook (where relatives I haven’t seen in 20 years take the opportunities to rant on my wall). Some of these opinions may also be controversial and, that’s OK, we’re meant to disagree on some things. Some of them may even be ill-formed as they’ll be outside of my area of expertise. That’s OK too, just write me and educate me.

Painless Instrumentation of Celery Tasks Using Statsd and Graphite

For one of my clients and side projects, we’ve been working hard to build in application-level metrics to our wide portfolio of services. Among these services is one built on top of the Celery distributed task queue. We wanted a system that required as little configuration as possible to publish new metrics. For this reason, we decided on using statsd and graphite. Getting statsd and graphite running was the easy part, but we needed a quick, painless way of adding the instrumentation code for the most basic metrics to our Celery-backed service.

For us, those basic metrics consisted of:

  • Number of times a worker starts on a specific task
  • Number of times a task raises an exception
  • Number of times a task completes successfully (no exceptions)
  • How long each task takes to complete

Since the code to enable these metrics just wraps the code being instrumented it seemed only natural to use a decorator. Below is the code I wrote to do just that.

statsd_instrument.pylink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
"""Decorator to quickly add statsd (graphite) instrumentation to Celery
task functions.

With some slight modification, this could be used to instrument just
about any (non-celery) function and be made abstract enough to customize
metric names, etc.

Stats reported include number of times the task was accepted by a worker
(`started`), the number of successes, and the number of times the task
raised an exception. In addition, it also reports how long the task took
to complete. Usage:

>>> @task
>>> @instrument_task
>>> def mytask():
>>>     # do stuff
>>>     pass

Please note that the order of decorators is important to Celery. See
http://ask.github.com/celery/userguide/tasks.html#decorating-tasks
for more information.

Uses `simple_decorator` from
http://wiki.python.org/moin/PythonDecoratorLibrary#Property_Definition

Limitation: Does not readily work on subclasses of celery.tasks.Task
because it always reports `task_name` as 'run'
"""

# statsd instrumentation
from celery import current_app
import statsd

@simple_decorator
def instrument_task(func):
    """Wraps a celery task with statsd instrumentation code"""

    def instrument_wrapper(*args, **kwargs):
        stats_conn = statsd.connection.Connection(
            host = current_app.conf['STATSD_HOST'],
            port = current_app.conf['STATSD_PORT'],
            sample_rate = 1)

        task_name = func.__name__

        counter = statsd.counter.Counter('celery.tasks.status',stats_conn)
        counter.increment('{task_name}.started'.format(**locals()))

        timer = statsd.timer.Timer('celery.tasks.duration', stats_conn)
        timer.start()

        try:
            ret = func(*args, **kwargs)
        except:
            counter.increment('{task_name}.exceptions'.format(**locals()))
            raise
        else:
            counter.increment('{task_name}.success'.format(**locals()))
            timer.stop('{task_name}.success'.format(**locals()))
            return ret
        finally:
            try:
                del timer
                del counter
                del stats_conn
            except:
                pass

    return instrument_wrapper

def simple_decorator(decorator):
    """Borrowed from:
    http://wiki.python.org/moin/PythonDecoratorLibrary#Property_Definition

    Original docstring:
    This decorator can be used to turn simple functions
    into well-behaved decorators, so long as the decorators
    are fairly simple. If a decorator expects a function and
    returns a function (no descriptors), and if it doesn't
    modify function attributes or docstring, then it is
    eligible to use this. Simply apply @simple_decorator to
    your decorator and it will automatically preserve the
    docstring and function attributes of functions to which
    it is applied."""
    def new_decorator(f):
        g = decorator(f)
        g.__name__ = f.__name__
        g.__module__ = f.__module__ # or celery throws a fit
        g.__doc__ = f.__doc__
        g.__dict__.update(f.__dict__)
        return g
    # Now a few lines needed to make simple_decorator itself
    # be a well-behaved decorator.
    new_decorator.__name__ = decorator.__name__
    new_decorator.__doc__ = decorator.__doc__
    new_decorator.__dict__.update(decorator.__dict__)
    return new_decorator

We Have the Tools but What About the Techniques?

In my previously-written article “Concurrent Engineering: The Foundation of DevOps” I wrote “just because you use puppet does not necessarily mean your organization is practicing DevOps.” I didn’t spend much time on it then, but I think it bears repeating and further explanation. The DevOps “movement” has seen, and will likely continue to see, a huge influx of new tools as organizations attempt to find ways to adopt DevOps within their organizations. These tools have included (and certainly have not been limited to) tools that aid in monitoring (statsd), configuration management (puppet), and continuous delivery (hubot).

Operations engineers, software developers, and managers are in a mad dash to develop, utilize, and integrate these tools within their organizations. And that’s where we’re going wrong; we are focused on a single component of the Software/Systems Engineering Process. This process model contains three main components that are central to its existence: methodologies, techniques, and tools (Valacich 2009). While I don’t need to go into each one specifically, it’s clear that the tools are just a single factor in the overall process. Following the model further, it becomes clear that the makeup of each of these components influences the other components in the process.

Put simply, DevOps is a methodology and, as such, it’s natural that we’re seeing a huge response in tools. What I feel we’re missing, however, is more information about the different techniques used throughout organizations in their software and operations engineering processes. An excellent example of this is Scott Chacon’s explanation of how Github uses Git (and Github!) to deliver continuous improvement to their service. With that said, I would like to see more organizations refine their techniques and talk about these as much as they talk about their tools.

2011 Annual Review and New-Year Updates

Happy New Years, everyone.

I thought I’d ring in the new year with some site stats from 2011.

  • Only 10 posts published.
  • 59,238 pageviews.
  • 24,829 visitors (22,634 unique)
  • 1.32% bounce rate.
  • Multiple job and business opportunities in direct response to articles I wrote including a new job (more details below).

I really want to write more. My resolution then is to “write more.” Using a more quantified approach, I will spend at least 30 minutes a day writing for at least five days a week. That doesn’t mean I will publish five articles a week. One of my big issues with writing is the amount of time that goes into each post. I approach my writing very academically and try to back up my ideas with citations; this research takes time. I also, very frequently, solicit feedback from other people before publishing. But I really enjoy writing and I really enjoy receiving feedback on my ideas through this blog so I would like to continue doing it.

Other resolutions of mine include physical health and professional development (I’d really like to give a talk at a conference this year).

I also have some really exciting news. Starting on Monday I will officially begin my employment with dotCloud on their Site Reliability Engineering team! This is exciting for two reasons.

First, working at dotCloud is going to be an awesome experience. Everyone I’ve talked to is incredibly smart. Our CEO, Solomon Hykes, was also just named on Forbes’ “30 Under 30” list. Third, and probably most importantly, is that I’m going to absolutely love the work. I love solving problems, particularly in devops, and I love writing tools that make people’s lives easier (which is precisely what dotCloud does). If you’re looking for a Platform as a Service provider, try out dotCloud and let us know how you like it.

The second reason this is exciting because, in the process of starting at a new company, I’ve managed to expand my personal consulting practice. I don’t think I’ve said so before, but I provide systems engineering services to Loud3r Inc. I’m their only “web ops” engineer and we’ve managed to completely turn things around in the past 6 months; we’re providing our services better (more reliable, more frequent updates), faster, and cheaper than before. Rather than cancel the contract entirely, the CEO of Loud3r and I felt it was a good idea for me to subcontract a large portion of my workload to a trusted colleague of mine.

How about you? Is there anything exciting you would like to share about the progress you made during 2011 or large changes you’re making in the start of 2012? Tell me all about it!

Concurrent Engineering: The Foundation of DevOps

DevOps is all about trying to avoid that epic failure and working smarter and more efficiently at the same time. It is a framework of ideas and principles designed to foster cooperation, learning and coordination between development and operational groups. In a DevOps environment, developers and sysadmins build relationships, processes, and tools that allow them to better interact and ultimately better service the customer (James Turnbull).

At the time of writing, if you were to search for “devops” you would find eight results attempting to explain what devops is, one result for a conference, and one rather satirical article (although not necessarily incorrect) where the author answers the question of “how do you implement devops” with “nobody seems to know” (Ted Dziuba).

The big problem with the DevOps “movement” is that we essentially have a bunch of operations and development people promoting it and trying to implement it within their organizations. Meanwhile, those with management and business responsibilities, even if explained the “what,” don’t understand the “how.” Just because you use puppet does not necessarily mean your organization is practicing DevOps.

This shortcoming is the result of us devops proponents either falsely claiming these techniques and methodologies are new or not knowing any better. If we had something more relatable for the business people (and, by principle, we should be business-oriented, too) then I think DevOps would have more of a chance.

Well, get your product and management together because the truth is that DevOps is actually a form of Concurrent Engineering.

Concurrent Engineering (CE) is a systematic approach to integrated product development that emphasizes the response to customer expectations. It embodies team values of co-operation, trust and sharing in such a manner that decision making is by consensus, involving all perspectives in parallel, from the beginning of the product life-cycle (ESA – Concurrent Engineering Facility).

Concurrent Engineering encompasses several major principles which just so happen to fit the definition (however formal or informal) of devops.

I’ll list them from the Synthesis Coalition here:

  • Get a strong commitment from senior management.
  • Establish unified project goals and a clear business mission.
  • Develop a detailed plan early in the process.
  • Continually review your progress and revise your plan.
  • Develop project leaders that have an overall vision of the project and goals.
  • Analyze your market and know your customers.
  • Suppress individualism and foster a team concept.
  • Establish and cultivate cross-functional integration and collaboration.
  • Break project into its natural phases.
  • Develop metrics.
  • Set milestones throughout the development process.
  • Collectively work on all parts of project.
  • Reduce costs and time to market.
  • Complete tasks in parallel.

By approaching the issues of devops as  concurrent engineering and implementing it as such, you open the movement to a well-researched, well-documented, and well-accepted product design philosophy. By shedding this light on the devops methodologies, this enables those of us pushing the devops movement to finally put the movement into a more business-oriented perspective.

Stop Letting Technical People Get Away With Social Ineptitude

As a technical person who has worked many customer-facing support roles, I’m offended by the often-cited notion that technical people have poor people skills or are poor at filling customer support roles. Earlier this week, a web host incited a Public Relations nightmare when their “Technical Director” responded childishly to some customers, disabled their accounts, and deleted their backups. The company’s response was to create a new customer support position so they could keep their Technical Director yet isolate him from their customers. In this new customer support personnel’s recap of the situation, he wrote:

Jules is the Technical Director. Jules does a fantastic job looking after things and keeping the infrastructure running well. Unfortunately though, as is often the case with very technically minded people, customer service is not always his strong point.

As I stated above, I find this absolutely offensive. Unfortunately, it’s often believed to be true because there are so many engineers and technical personnel with poor people skills. I never have seen statistics to support the notion that the ratio is higher in these professions than in others, yet employers often find that people filling these roles with poor people skills are still employable. This needs to stop.

I’m of the opinion that every position is customer-facing. Sure, some positions might not interface with the company’s customers, but the position is likely to have customers of its own – whether internal or external. All it takes for someone to be good at customer service is:

  • The understanding that they represent their employer and that people are relying on them,
  • A compassionate, helpful, and courteous attitude,
  • The knowledge of whatever they’re supporting.

Frankly, if you don’t have those qualities, how can you work with anyone at all?