The guys at DevOps Cafe have been preaching the
gospel of sharing in operations, and I’ve been agreeing as loudly as possible.
This is my first step towards putting up (instead of shutting up).
Startups Need DevOps
problems don’t just exist in tiered organizations with a big wall separating
the development team from the operations team. In our case, they exist when
part of the dev team and the whole ops team share a brain. Over a few posts,
I’d like to share what our experience has been in solving these problems,
including the missteps we’ve had along the way and the problems we’re still
working on. First though, I’d like to explain why DevOps and a focus on
Culture, Automation, Metrics and Sharing (CAMS) has worked well for us and
why I think it’s a critical element of success in startups.
Startups are fundamentally about learning and building in the face of
For this and many other reasons, agile development is almost universally seen
as the way to build a software product (I’m going to take this as a given from
here on out). If you’re doing software development in a startup, you should be
using an agile process because the requirements are uncertain. They’re not
just uncertain because requirements are hard to write down. They’re uncertain
because not only do you not know what your customer want but you might not even
know who your customer is. Agile development means rapid iteration, and one
of the first things you run in to is that it’s difficult to keep an application
deployed and operational when you’re changing large parts of your application
on a regular basis (especially early). Agile development doesn’t work with
slow, rigid operations.
Your startup is chugging along, iterating quickly and deploying lightly-tested
code to your one server by doing
git pull && touch deploy.wsgi. You’re
learning tons as you get rapid feedback from customers and people love you
because you can push a fix for their bug before you even hang up the phone!.
Then, one day, you find a HUGE bug in production and How Could This Happen!?
The next week, your application goes down because you tested that new feature
using python 2.6 locally and the server has 2.5 and urllib hates you and you
were going to update the server to 2.6 before the next
git pull but Fred
deployed his feature first and you remember telling Fred about the python 2.6
thing and now your co-founder is questioning this rapid iteration thing and
your customers don’t care about your 5-minute turn around on bug fixes because
they were in a presentation when you went down and… arg.
The worst reaction is to slow down development.
- You can shame/punish developers for any bugs or mistakes
- You can require every update go through a manual set of several hundred test
cases that takes a few days to click through
- You can lengthen your release cycle so that the rapid iterations only happen
on the internal build and customers only see the changes every couple of
months after a rigorous manual QA process.
All of those options work in the sense that you can improve uptime and quality
over your old way of doing things. The problem is that in the mean time, your
- Punishing errors won’t actually solve your problem, but it has the misfortune
of making shamers feel better. This is a mindset that we’ve had to constantly
fight in order to build a culture where it’s ok to make a mistake exactly
Unilateral mistake prevention isn’t possible without HUGE costs.
Days worth of manual tests are the opposite of rapid. Before automating
testing was a viable alternative, this might have been the only option to
ensure quality, but every delay lengthens the distance between you and
learning from your customers. It’s also very possible that your time running
manual tests on known problems cuts in to the time you have for exploratory
testing to find the actual bugs.
In the beginning, this was the path we went down until we got smart and began
investing in a suite of Selenium tests (in
conjunction with unit tests) to solve the same problem. Now all our code
changes require accompanying selenium tests and we don’t have the kind of bugs
that manual tests would have caught. We also run these tests many times per
day, which means we know on an hourly basis what our quality looks like.
- Lengthening your release cycle might seem like a good idea, but if that’s all
you’re changing, it’s just as likely to cause lower quality as it is to
raise quality. Instead of building the minimum set of features and iterating
on feedback, longer cycles encourage you to build too much and “too much” is
more difficult to test than Just Enough. See our group collaboration story
for an example. It’s kind of a dead horse, but longer release cycles are more
waterfall, and waterfall is bad.
We need to deliver changes to our customers quickly because that’s the only way
we can find out what they actually need and give it to them.
A Bad Solution Hurts the Business
In 2008, when our update cycle was still in the 6-8 week range (because
releases were a pain), I built a group collaboration piece to help users give
feedback on policies in advance of a meeting. It took the full 8 week cycle to
nail down and get built and tested. We then spent another 6 week cycle refining
the idea based on direct meetings with customers. I thought it was pretty
awesome and the major sponsor of the feature seemed happy. Later down the road,
we integrated Mixpanel to start tracking exactly what
our customers were using and we got some bad news. Not only was the sponsor
customer the only one using the group collaboration, they were only using it a
few times a month (versus hundreds of uses a week for our “core” features) AND
they were only using a subset of the features we added. If we were keeping
metrics to track usage, we would have known right away that this feature was
not the most important. If our devops process was better and we didn’t fear
releases, we could have built the core functionality that they were actually
interested in instead of spending weeks on parts that were “nice to have.” As
more of a customer
note, a bit more customer interaction would have led us to build us something
very different. We added lightweight in-line collaboration this year in a
single 2-week cycle. This would have fulfilled 90% of the original requirements
and is now utilized about 40% of the available time instead of the low
single-digit utilization for the larger, more-complicated group collaboration
feature. Our devops problem cost us ~10 weeks of desperately-valuable
development time that our business and customers needed.
Another Solution, Almost as Bad
Another bad solution is to realize that you have a deployment/release problem,
discover that there are good technical solutions and then spend the next year
completely automating your processes, collecting metrics, and changing behavior
to build better culture while you shut down the development side of things.
Obviously, a startup can’t survive treading water for months while working on
“internal” improvements and you wouldn’t lay off a development team (or pay
them to sit) in the meantime. In our case though, there was no line between the
development team and the operations team. Time spent on operations was time not
spent on development. This balance has been a continuing struggle for us, but
the flexibility that gives us to always work on the most important problem has
A Better Way
Recognizing that your business has a problem and that there is a solution seems
like most of the battle, but there’s still the problem of how to get from where
you are to where you want to be. The good thing about DevOps in general, is
that it’s not a set of things you have to do to be “compliant.” It’s a process
that your startup can use starting today to make things a bit better. It
starts with a culture that treats operations and development as two side of the
same coin and that values continuous improvement. The manifestations of this
include things like striving to automate all of your operations activities and
putting infrastrcture code in the same source control your application lives.
If you have a problem with a developer using a different version of a library
than you deployed to production, give that developer the ability and
responsibility to change your Pip requirements
file and review their code change in your source control. Just like
development’s job is to find the highest value/priority thing to work on every
day, operations should be looking for the place with the most pain and working
on removing that pain through automation so that it