Charles Hooper

Thoughts and projects from a Hacker and Operations Engineer

On Slack and Upkeep

A term I hear often in the context of engineering and project management is “slack.” It’s often used to refer to a magical pool of time that all of a service’s upkeep, including maintenance and operations, is going to come out of. This is wrong though. Here’s why:

  • That’s not what slack is for

  • Mismanaged slack is equivalent to non-existent slack

What is it then?

I subscribe to the definition in Tom DeMarco’s Slack, which is “the degree of freedom required to effect change.”

Slack is something you maintain so that your team is responsive and adaptable, it is not “extra time” or “maintenance time.” If you are doing this, you are effectively allocating that time and thus eliminating your slack pool. Signs you or your team may be guilty of this:

  • You don’t make explicit allocations of time to operations or maintenance upkeep

  • You don’t “have enough time” to properly operate or maintain your services

  • You can’t solve problems or complete remediation items identified by your organization’s problem management program

So I should do nothing then?

Well, no. At least some of your slack needs to be spent idle though. Remember that the concept of slack is rooted in queueing theory. There’s a well-known relationship between utilization and response time. This relationship is exponential: The higher utilized your team is, the much higher your response time is! You can see it for yourself below:

Relationship between utilization and response time

We can tell by looking at this graph that our responsiveness falls apart at about 70% utilization which means you should keep at least 30% of your time unallocated.

Unallocated? Why can’t I just devote 30% of my time to upkeep?

Because upkeep, the maintenance and operations of your service, are required activities. Entropy means that, unkept, your service will degrade over time. This entropy is accelerated if your service is experiencing growth. Your databases will bloat, your latency will increase, your 99.99% success rate will fall to 99.9% (or worse), your service will be difficult to add features to, and eventually your users will go somewhere else.

Instead of thinking about it like this:

Wrong way to manage slack

Think about it like this:

Right way to manage slack

In this model, you explicitly allocate time to upkeep and maintain a slack pool.

How much time should I spend on upkeep versus product and feature work?

I don’t have a good guideline for you, sorry. You’ll need to determine this based on your organization’s or team’s goals and any SLAs you may have.

For example, if you’re operating a service with a service-level objective of meeting a 99.99% success rate (0.01% error rate) then you need to allocate more time to upkeep than a service targetting a 99.9% success rate, generally speaking.

Note that this will change and vary over time. If you’re already deep in technical debt, your upkeep allocation will need to be much higher to pay off some of your principal. Once you’ve done that, you’ll probably be able to meet your goals with a much lower allocation later on.

Call to action

I urge everyone to start thinking about slack and upkeep this way. Take a close look at your team’s goals and commitments and explicitly allocate time for reaching those goals. Doing so will allow your team to properly maintain the services which it operates while also being very responsive.

What I Do as an SRE

Sometimes people ask me what I do and I’m not really sure how to answer them. My answer tends to depend on social setting, what I’ve been working on, and if I was on call that week. No matter the circumstances, it usually comes out pretty boring and terribly short.

This really sucks though, because I actually really like my job and think that it’s interesting, if only I could articulate it.

So here’s some attempt at explaining what I do:

  • I’m an SRE, or Service Reliability Engineer, at Heroku. Typically, SRE stands for Site Reliability Engineer, however we’ve modernized it at Heroku because what is even a site anymore?

  • My week-to-week is wildly unpredictable. This week I’m conducting an operational review of one of our key platform components, last week I was investigating and addressing database bloat, and the week before I was the on-call incident commander and quite busy due to several incidents that occurred.

  • Speaking of the incident commander role, part of my job includes defining how we respond to incidents. At first glance it seems easy: Get paged and show up. And then you respond to your first 24-hour slow-burning incident and realize that you’ve got more work to do.

  • Following incidents, I also schedule and faciliate retrospectives. We practice blameless postmortems and these tend to be incredibly constructive.

  • I also analyze past incident data and look for patterns and trends. Wondering if there’s a day of week that has a higher probability of experiencing an incident? Yeah, it’s Friday.

  • When all is quiet, I review dashboards and investigate anomolies. Wondering what that weird spike or dip is that seems to happen every once in a while? Ask me, I’ve probably pulled that thread before (and if I haven’t, I’ll be terribly curious).

  • And sometimes I build integration tests and tools. I wrote elbping, for instance, because ELBs were terrible to troubleshoot during an incident.

  • And, most importantly, I mentor other SREs and software engineers. This is the single biggest thing I can do in terms of its impact, and probably most rewarding, too.

So there you have it, that’s what I do.

P.S. - If this sounds interesting to you, we’re hiring!

An Actual Email From a Recruiter

OK, I was going to lead up to this carefully so this didn’t seem like yet another awful blog post about technical recruiters (can we seriously stop posting those?) but I have no words. Here is the fifth email sent from one particular recruiter inside of a week:

Hey Charles,

I hope all is well man! I hadn’t heard back from you after my previous e-mail so I was toying around with putting together a search party to make sure you didn’t fall into a well or something. I’m not sure if that actually even happens anymore but this dog came up to me on my way to work and was barking at me in a way that seemed he was attempting to tell me something.

Naturally, I put two and two together and realized that perhaps I misunderstood my new furry friend and instead of saying “Chance is in danger” he said “Charles is in danger”. Which still doesn’t shed light on the where abouts of my missing cat Chance, but if you are in fact in danger, don’t hesitate to let me know. I can hold my breath for 45 seconds (not that long) and I can run fast-ish for short distances. I also have this thing with blood where I “pass out”.

On second thought, if your in danger you should probably just call 911 or yell “fire” (I guess “help” doesn’t work for some reason). However, if you’re even remotely interested in discussing the possibility of chatting about new opportunities, I’m your guy! Hope to hear from you soon Charles!

Yuuuuup.

Troubleshooting ELBs With Elbping

Troubleshooting ELBs can be pretty painful at times because they are largely a black box. There aren’t many metrics available, and the ones that do exist are aggregated across all of the nodes of an ELB. This can be troublesome at times, for example when only a subset of an ELB’s nodes are degraded.

ELB Properties

ELBs have some interesting properties. For instance:

  • ELBs are made up of 1 or more nodes
  • These nodes are published as A records for the ELB name
  • These nodes can fail, or be shut down, and connections will not be closed gracefully
  • It often requires a good relationship with Amazon support ($$$) to get someone to dig into ELB problems

NOTE: Another interesting property but slightly less pertinent is that ELBs were not designed to handle sudden spikes of traffic. They typically require 15 minutes of heavy traffic before they will scale up or they can be pre-warmed on request via a support ticket

Troubleshooting ELBs (manually)

Update: Since writing this blog post, AWS has since migrated all ELBs to use Route 53 for DNS. In addition, all ELBs now have a all.$elb_name record that will return the full list of nodes for the ELB. For example, if your ELB name is elb-123456789.us-east-1.elb.amazonaws.com, then you would get the full list of nodes by doing something like dig all.elb-123456789.us-east-1.elb.amazonaws.com. In addition, Route 53 is able to return up to 4KB of data still using UDP, so using the +tcp flag may not be necessary.

Knowing this, you can do a little bit of troubleshooting on your own. First, resolve the ELB name to a list of nodes (as A records):

$ dig @ns-942.amazon.com +tcp elb-123456789.us-east-1.elb.amazonaws.com ANY

The tcp flag is suggested as your ELB could have too many records to fit inside of a single UDP packet. You also need to perform an ANY query because Amazon’s nameservers will only return a subset of the nodes otherwise. Running this command will give you output that looks something like this (trimmed for brevity):

;; ANSWER SECTION:
elb-123456789.us-east-1.elb.amazonaws.com. 60 IN SOA ns-942.amazon.com. root.amazon.com. 1376719867 3600 900 7776000 60
elb-123456789.us-east-1.elb.amazonaws.com. 600 IN NS ns-942.amazon.com.
elb-123456789.us-east-1.elb.amazonaws.com. 60 IN A 54.243.63.96
elb-123456789.us-east-1.elb.amazonaws.com. 60 IN A 23.21.73.53

Now, for each of the A records use e.g. curl to test a connection to the ELB. Of course, you also want to isolate your test to just the ELB without connecting to your backends. One final property and little known fact about ELBs:

  • The maximum size of the request method (verb) that can be sent through an ELB is 127 characters. Any larger and the ELB will reply with an HTTP 405 - Method not allowed.

This means that we can take advantage of this behavior to test only that the ELB is responding:

$ curl -X $(python -c 'print "A" * 128') -i http://ip.of.individual.node
HTTP/1.1 405 METHOD_NOT_ALLOWED
Content-Length: 0
Connection: Close

If you see HTTP/1.1 405 METHOD_NOT_ALLOWED then the ELB is responding successfully. You might also want to adjust curl’s timeouts to values that are acceptable to you.

Troubleshooting ELBs using elbping

Of course, doing this can get pretty tedious so I’ve built a tool to automate this called elbping. It’s available as a ruby gem, so if you have rubygems then you can install it by simply doing:

$ gem install elbping

Now you can run:

$ elbping -c 4 http://elb-123456789.us-east-1.elb.amazonaws.com
Response from 54.243.63.96: code=405 time=210 ms
Response from 23.21.73.53: code=405 time=189 ms
Response from 54.243.63.96: code=405 time=191 ms
Response from 23.21.73.53: code=405 time=188 ms
Response from 54.243.63.96: code=405 time=190 ms
Response from 23.21.73.53: code=405 time=192 ms
Response from 54.243.63.96: code=405 time=187 ms
Response from 23.21.73.53: code=405 time=189 ms
--- 54.243.63.96 statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 187/163/210 ms
--- 23.21.73.53 statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 188/189/192 ms
--- total statistics ---
8 requests, 8 responses, 0% loss
min/avg/max = 188/189/192 ms

Remember, if you see code=405 then that means that the ELB is responding.

Next Steps

Whichever method you choose, you will at least know if your ELB’s nodes are responding or not. Armed with this knowledge, you can either turn your focus to troubleshooting other parts of your stack or be able to make a pretty reasonable case to AWS that something is wrong.

Hope this helps!

My DEF CON 21 Experience

I’ve just returned from DEF CON this year and wanted to share my experience. I’ve only been to DEF CON one other time which I believe was DEF CON 16. During DEF CON 16, I mostly stuck to the hallway track and, to be perfectly honest, didn’t get alot out of it as I mostly hung out with coworkers.

This time around I went with my good friend Japhy and no one else.

Logistics

We flew in separately on Thursday and stayed in the Bellagio. We initially chose the Bellagio because it was cheaper and we didn’t think a 15 minute walk every day was going to be a big deal. As it turns out, the walk itself was fine (even with the 98F weather) but it meant we were effectively separated from the conference for most of the day. I think the next time I go I would like to stay in the same hotel as the conference.

Thursday

Thursday was my day of travel. The flight was late leaving SFO but this isn’t unusual as planes to/from SFO are pretty much ever on time it seems. Blame the fog.

Anyways, I arrived mid-afternoon and just hung out around the Bellagio since Japhy wasn’t in yet. I ate some pho, drank some good bourbon, and played some video poker. Eventually, Japhy arrived and we grabbed a beer together before turning in.

Friday

Friday morning we woke up and went and get our badges. They were pretty sweet looking and I was curious about the crypto challenge. There was apparently a talk where the badges were explained but I missed that and so I mostly chatted with random people about them and compared notes and hypothesis. My badge, the Ace of Phones, translated to “in the real order the”. There was also an XOR gate on it but I never got far enough to know what it was for.

Badges aside, Friday is the day that I went to the most talks.

The first talk I went to was about Offensive Forensics. The speaker asserted that an attacker could use many of the same techniques that would be used by a forensics investigator during their attack. For example, an attacker could easily recover and steal files that were previously deleted. The talk was good but I felt that the speaker spent too much time trying to convince the audience that it was a good idea. My personal opinion, and that of the people I’ve talked to, all seemed to agree up front that it was a great idea.

After leaving this talk I ended up catching the tail end of Business Logic Flaws In Mobile Operators Services. I wish I saw more of this, but the speaker more or less explained that many mobile operator services have big flaws in their business logic (just like the title, eh?) such as relying on Caller ID for authentication. He also gave a live demo of an (unnamed) customer service line that, instead of disconnecting you on the third entry of an invalid PIN, actually grants you access.

Next I caught the end of Evil DoS Attacks and Strong Defenses where Matt Prince (CEO of CloudFlare) described some very large DDoS attacks and what they looked like. Someone afterwards also showed a variety of online banking websites where the “logout” button doesn’t actually do anything, leaving users vulnerable.

Immediately following that session, two guys got up and gave their talk on Kill ‘em All — DDoS Protection Total Annihilation!. I enjoyed the format of the talk, where the speakers would describe DDoS protection techniques and then how to bypass them. The bottom line is: a) look like a real client, b) perform whatever handshakes are necessary (alot of DDoS mitigators rely on odd protocol behaviors), c) use the OS TCP/IP stack when mossible (see (a) and (b)), do what it takes to bypass any front-end caches, and d) try to keep your attack threshold just below where anyone will notice you.

At night, there were a bunch of DEF CON parties. At some point the fire alarm went off a few times. A voice came over the intercom shortly after stating that they weren’t sure why their alarm system entered test mode but that “the cause was being investigated.” Later, it happened again and the hallway strobes for the fire alarm stayed on, adding kind of a cool effect to the party. Hmm.

Saturday

On Saturday I only saw two talks.

  1. Wireless village - In the wireless village I listened to a Q&A session by a pen tester whose expertise was in wireless assessments. My favorite quote from this talk was:

    Q: When you do these wireless assessments, is your goal just to get onto the network or do you look at wireless devices, such as printers, as well?

    A: I pulled 700 bank accounts from a financial institution 6 weeks ago [during a pen test]. We like printers.

  2. Skytalks - One of the skytalks I saw the first half of was about “big data”, the techniques used in analyzing this data, their weaknesses, and how you could use these techniques to stay below the radar so to speak. It was interesting but rather abstract and I’m not totally certain how to apply that in practice.

For the rest of the day, I brought my laptop and just kind of tinkered with stuff.

Sunday

I flew home early Sunday morning so I didn’t do anything on this day.

Why I Moved to San Francisco

It’s been three months since I first moved to San Francisco and decided I should share why I moved here in the first place. The primary reasons why I moved to San Francisco are for my career and to be around more like-minded people.

Career-wise, what made San Francisco appealing to me is the number and diversity of employment opportunities. In Connecticut, if you want to work with “technology” then you work for one of the many insurance companies headquartered there or an agency of some kind. In addition to available opportunities, there is also more parity between the job market in the bay area and my skill set and experience. For example, today Indeed.com reports 184 results for the “Python” in the entire state of Connecticut, while there are nearly 2,700 results for the San Francisco Bay Area. At one point in my life, I was told I was wasting my life messing around with GNU/Linux and other Open Source software. Things would have been a little better if I moved two hours away to either Boston or New York, but if I’m going to move then I might as well get better weather out of it, too.

I also moved to San Francisco to be around more like-minded people. Things that interest me (besides gardening and home brewing) are startups and tech. There were a few groups around my old location that were interesting, but they typically required an hour long drive to show up to their events. Often times, the groups failed early due to a lack of participation (including the hackerspace I founded but that story is for another day).

Thoughts so far

As I mentioned, I’ve now been here for three months. My thoughts so far are:

  • The place really is quite small. Especially in tech. Everyone seems to know everyone, which can be fun socially, but you need to watch what you say when you’re talking shop.

  • The Sillicon Valley/SF Bay tech isolation chamber is real (and so is the echo chamber). Companies sometimes seem huge when you’re in the bay, but if you talk to anyone from outside of the area, they’re like “Who?

  • San Francisco’s neighborhoods are really awesome. SF is divided into a bunch of small neighborhoods, each with their own unique attributes. There really is a place for everyone.

Conclusion

I moved to San Francisco because I thought it would be good for my career and because I thought I would meet more like-minded people. This has certainly proved to be the case. What I was not expecting but have experienced so far is how small and isolated the SF tech scene actually is.

How I Hacked My High School

When I was a freshman or sophomore in high school, I cracked my high school and got in “a lot of trouble” for it. I’ve only told the story maybe half a dozen times in my life, but after telling it a few times in the past month or so, I decided to write about.

I went to a high school in a fairly decent school district that apparently had enough money to build several computer labs in each school, network them all together as a single autonomous network, and provide reasonably fast Internet access. This was somewhere between the year 1999 and the year 2000, the exact year I can’t quite remember.

Another interesting thing to note about my high school was that it carried with it an 80-hour community service requirement for graduation. I’ve always been into computers and networking so it was natural for me volunteer to do my community service assisting the technical staff in the computer labs. Community service in this way typically involved installing printer drivers, updating software, and pushing boxes around on a cart.

My school’s computer network consisted of a mix of sevaral flavors of Windows including Windows 95, Windows 98, and Windows NT 4.0. Despite the large Microsoft-only network, we were using Novell for authentication and authorization. Each school in the district was networked together and part of this same Novell network.

My school had reasonable filtering and monitoring in place for web traffic and it wasn’t uncommon to hear of students getting into trouble for looking at, let’s say “questionable content.”

One weird thing my school did was restrict access to the local hard drive. If you went into Explorer, the C:\ volume simply wasn’t listed. This was supposedly because we were supposed to use our network-attached “S drive” which was unique to each user. The “Run” dialog from the Start menu was also disabled.

This had its flaws; however, and one day I hand-wrote (in English class) a 6 page paper detailing the variety of ways even an unskilled person could bypass this precaution. Most notably this included:

  • Create a shortcut on the desktop to C:\
  • Open internet explorer and browse to file:///...
  • Certain “Save As” and “Open” dialogs still listed the C-drive and was browsable
  • You could still get to a “Run” dialog by opening the task manager and finding it in there. This allowed you to a) Browse to the C-drive anyway, and b) Open up a command prompt

There were probably some other flaws I missed. I argued that restricting access to the local disk was rather pointless as people could simply drop to it during boot (which many students were already doing in order to get a kick out of doing deltree C:\Windows for some reason) and that the real issue was that local password caching was enabled. In other words, whenever a user logged in, a hashed version of their password was stored in the form of a windows “PWL file.”

The next time I did my community service, I handed my paper to the head technician which he responded “What? This could never happen” and threw my paper in the garbage.

What?! I couldn’t believe that someone would actually throw away what was, at the time, the longest paper I’ve ever written after barely having read it. And to do this with a 4-word response no less!

Due to my shock and what was apparent lack of maturity, I decided that I would show them. I mean, when you approach someone with what you think is an obvious, logical opinion and they don’t believe you, your next option is to show them, right? Right?

It didn’t take me too long to gather up a fairly large collection of these so-called “PWL files” on a floppy disk. Distinguishing between staff and students was trivial too due to the naming convention in place. Students’ usernames were always in the form of {ExpectedGraduationYear}{LastName}{FirstInitial}. Mine was 03hooperc. Faculty and staff, on the other hand, simply followed the convention {FirstInitial}{LastName}. I decided that I didn’t care for any of the student logins and just discarded them.

When I had a good collection of staff PWL files, I used a tool called Cain & Abel on my home computer and ran it overnight.

In the morning, I was surprised to see how many passwords were cracked. Among them, one staff login stuck out at me. It was the technician who threw my paper out. I knew that because, unlike the rest of the technical staff, he mostly worked on network issues, he must have alot more access than anyone else. His password was hilariously bad and insecure. It consisted of two three-letter words in all caps:

THEMAN

The man, huh? I giggled and went to school ready to launch my next attack. I went to the computer lab during lunch and waited for the opportunity to log into my new staff account. I had to be careful, though, as my school set the background of staff accounts to something very noticable and I didn’t want the computer lab attendant sitting in the back to notice that I was on an account that didn’t belong to me. At some point, she left briefly and I quickly logged onto The Man’s account.

Once I was logged in, I was more in excited and nervous than anything. Even still, I moved onto the next step of creating a backdoor account with admin privileges. I made up a name and created a new account following the faculty/staff naming convention and granted it every privilege I could. As soon as I was done, I logged off. The log-off couldn’t be slower. I felt like Peter in Office Space when he’s trying to duck out of work early and his computer keeps coming up with a bunch of last-minute tasks before it can log off.

I logged into my backdoor account exactly once to verify that it worked and never logged into it again. I fucked up and made the classic mistake of telling a friend. He was one of the few people I knew who also used Linux and we used to trade books, CDs, and manuals all the time. My favorite trade was giving him Debian GNU/Linux: Guide to Installation and Usage for a manual from Bell Canada about this crazy thing called SONET and T-carrier transmission systems (e.g., T-1 lines). Who knew that I would one day work for an ISP and later move on to manage thousands of Linux hosts based on a Debian derivative?

Some months later, I start hearing from other students that my buddy has told a bunch of people my secret and has even been logging into the account to prove it (OK, I definitely shouldn’t have shared the credentials!) I decided that the best thing to do in this case was to confess to the Director of Information Technology, who I used to bug from time to time and ask why we didn’t use Linux since it’s free (Yes, I was probably a very annoying fifteen year old) For those wondering, he said it was because of a “licensing issue” which I think was professional speak for “fuck off, kid”). So, I walked into his office one day and dropped the bomb.

I honestly don’t know what I was expecting to happen. He listened to my story, asked me how long I had this access for. When I told him, he remarked that their backups didn’t back that far and that, in any case, he’d have to report me to the administration. He walked me over to the dean of disclipline’s office and I went through a grueling session of more questions.

Something I should probably note is that this during a period where a bunch of schools, including mine, had decided to institute a so-called “zero tolerance policy.” What these zero-tolerance policies were was simply a matrix of “Violations” on one axis and “Number of prior offenses” on the other. In other words, getting caught smoking in the bathroom for the first time carried a pre-determined penalty which was slightly less bad than getting caught doing that for the second time. Students could never argue that they didn’t know they would get in so much trouble for anything because this matrix and every other school/district policy was distributed to us at the beginning of the year and we were required to have it on us at all times.

The obvious flaw with this policy, besides the fact that it ignores additional circumstances, is that in 1999 there wasn’t a row describing the penalties for any of the things I had done. In fact, I hadn’t technically violated any of the school or district policies at all! (Federal and local crimes may be another issue). I was sent home with the dean of discipline still deciding what to do with me, but I figured I was probably going to be suspended for a few days.

I went home and told my parents what happened and, very very very surprisingly, they said I did the right thing (albeit many months late). Something else amazing happened though. My parents called the school and, after talking to the dean, we learned that my school was actually considering expulsion! My parents made some phone calls and tried to pull some strings that we didn’t think we had but ultimately told me that we wouldn’t know my fate until the next day at school.

The next day I went into school and was immediately called into the dean’s office. He explained that they had decided that what I was guilty of stealing and kept using this “stole the key to the bank” metaphor. He actually told me that had I not brought the password files home on a floppy disk, I would probably be in less trouble. What the fuck?

And then I received my sentencing: Ten day suspension.

Ten days! That was huge. For comparison:

  • Ten days in the maximum amount of time you could be suspended for. Anything longer would be considered an expulsion and would require a hearing of some kind.
  • You could punch someone in the face (you know, assault) and the punishment for doing that would be 4-6 days… on the second offense.

I don’t remember how I got home but it was like my parents weren’t even mad. In fact, they seemed more proud than anything. I think they even thought that this incident would get me hired somewhere someday, too.

After my suspension, I returned to school to learn that, in addition to my now-served suspension, I was no longer allowed to use any of the computers at school. This lasted over a year and, after signing a paper agreeing to let my school sue me if I commit another offence, I was given my computer access back.

I’m terrible at concluding stories so… The End.

My Personal Kaizen

Kaizen is Japenese for “good change” or “improvement” and is frequently used within the context of lean manufacturing. Today though, I’m going to talk about a few things I want to improve for myself, both personally and professionally.

Hard Skills

I’m fascinated by new technology and new ways of doing work, so polishing up on some of my so-called “hard skills” comes relatively natural to me. Things like learning a new programming language or a new tool require only time which I’m more than willing and able to invest. Without further ado, here are the hard skills I’d like to improve:

  • Become proficient in Go. I find Go very appealing as a language, specifically for systems-level uses.
  • Become proficient in Ruby. Alot of the software I’m responsible for maintaining at Heroku is written in Ruby.
  • Become proficient in a functional programming language, such as Erlang. I’ve been interested in learning a functional programming language for a while. I decided on Erlang after hearing a talk from Chris Meiklejohn (of Basho) about Riak and watching several talks about Erlang and Erlang/OTP.

Soft Skills

Addressing soft skills is something that’s a little more difficult for me, but something that I think is important. These are the soft skills I’d like to improve:

  • Become better at empathy. Sometimes when listening to someone, it’s easy to jump to conclusions about what they are trying to say or how they came to that point. What I have been working on, however, is understanding why they feel a certain way in particular.
  • Become a better listener. I have been working on this for a while, but one area I would still like to improve is learning to ask the the right questions. I’m always impressed when someone follows up something I said with an engaging question, and I would love to gain that ability.
  • Become more well-spoken. I’d like to be a better public speaker, both publicly and privately.

About This Site (Technical)

I was playing around with my site today and realized that I hadn’t disclosed much about how it works. Here’s that nerd “look at my site architecture you guys” post.

If you couldn’t tell by the default theme, my blogging software of choice is Octopress, which generates my site’s HTML from Markdown. I use vim for most of my editing and do so on my Lenovo Thinkpad X220 running Ubuntu desktop. I use a tiling window manager called Awesome WM.

I also use Fastly for my caching layer. For the purposes of this dicussion, let’s just call Fastly “varnish as a service.”

Prior to today, I just used a single Linode in the Newark, NJ datacenter as my web server. The Linode runs Ubuntu server and serves static HTML with nginx.

As of today, though, I’m experimenting with Heroku US and have configured my Heroku clone of the site as a second backend server for Fastly. Octopress does some magic here by serving the HTML through Rack, a minimal webserver for ruby.

My site is also monitored by Pingdom and I use Munin for metrics collection.

There are some really amazing takeaways from this. For one, my only expense currently is the roughly $20/mo I pay for the Linode. I would happily pay for any of these services for a real, money-making website but, for this site, I’ll continue to use the free plans.

Second is how much technology is available to us these days. I own zero physical infrastructure and the only thing slightly resembling traditional infrastructure is my Linode. Contrast this to several years ago when I had a third generation HP Proliant DL380 colocated and was starting to experiment with virtualization. I mean, holy shit, you can literally host a website for nothing!

This second point is why I think I love working in the PaaS (platform-as-a-service) space so much. PaaS is still young enough that many skeptics don’t believe it’s a working model. I think we’ve proven that it works, now we just need to prove we can grow with the needs of our customers.

Catching Up

tl;dr

  • Moved to San Francisco
  • Took time off to decompress
  • Joined team at Heroku
  • Loving it

--verbose

It’s been a while since I’ve posted, but I wanted to let you know that I’m still alive! Things were very hectic due to a few life changes but I much of the dust has settled now and I’m excited to talk about what those changes are.

But first, some history.

In 2011, I was a junior at university studying Business Information Systems. I had some prior systems engineering and operations experience and was paying my tuition by performing contract work for a company that doesn’t exist today. While at this company, I worked very closely with a developer and the two of us were responsible for running this company’s infrastructure. Neither of us had the time or the willpower to be bothered by operational tasks, so we automated everything. Even though our environment was much smaller (somewhere between 30-50 instances in EC2) than many others, we had alot of things that many companies are lacking:

  • New services were in configuration management
  • New deployments were almost entirely automated
  • We had good visibility with a variety of metrics being reported to Ganglia and Graphite
  • We even had a reasonable nagios configuration

Around this time, the term devops was being tossed around on twitter on a more frequent basis. I can remember actually having a Google Alert for the term to email me when there were new blog posts about it and it not being spammy. I have a love/hate relationship with the term, and in December 2011 I wrote my blog post Concurrent Engineering: The Foundation of DevOps in which I argued that the ideas behind devops weren’t new, but were maybe old ideas from business that were recently independtly re-discovered.

The blog post wasn’t very popular, but Solomon Hykes (CEO of dotCloud) managed to see it and, thinking we had very similar ideas about devops, invited me to interview for a position on their newly formed Site Reliability Engineering team. I got the job at dotCloud, and up until April of this year that’s where I stayed.

In mid-April, I resigned from my position at dotCloud and moved to San Francisco. There were a number of reasons for the resignation but chief among them was that I needed some time to decompress and all of my paid time-off had been used up following my youngest brother’s car accident. This ended up being an awesome decision because it gave me my much-needed decompression time and I was able to explore my new city.

I took about four weeks off before I put any real effort towards a job search. The move to San Francisco and the job search alone could fill two entirely-too-verbose blog posts but the end result was that I moved here safely and joined the team at Heroku!

Fast forward to today and I’ve just finished my third week at Heroku. It’s an amazing experience, a great team, and an awesome culture that is encapsulated in the following quote:

I like our culture. We welcome failure into our house and then kick its teeth in.

I’ll write more about this another time. For now, thank you for listening to my story.

— Charles