PolicyStat's Dev Blog

Network Latency Between Ec2 Instances: Public vs Private IP and Same vs Different Availability Zone

How Realistic is Your ec2 => ec2 Load Test?

Today I was reading about a cool distributed load-testing tool called Bees With Machine Guns (awesome name) and a commenter asked how testing ec2 to ec2 might affect the load test. After all, real clients aren’t going to have local network connection speeds to your application, which is going to change the characteristics of how your application responds. With slow clients you get increased resource usage depending on your setup and you’d probably like to know whether or not putting that reverse proxy in front of your application server actually helps.

I was going to suggest in the comments that using a different availability zone would actually solve the problem, but then I realized I was just guessing. So I ran some simple, unscientific tests.

Results Summary

EC2 Network Latency Graph

In a nutshell, if you want mostly-realistic ec2=>ec2 load tests, put your load testing instances in another region. As you would suspect, using the internal vs external DNS entry doesn’t matter (except for your bill) and moving to another availability zone still gives you super quick connection speed. While moving your tests to another region won’t give you the variance you need for realistic tests (there are tools for this, though), it is a dead-simple way to get a more-realistic picture of what your application does under load. 

The average response times I received:

  • 00.5 ms- Same Region, Same Availability Zone
  • 02.0 ms- Same Region, Different Availability Zone
  • 81.7 ms- Different Region

Test Methodology

I spun up 4 instances with 32-bit ubuntu 10.04-based images and ran pings between them using both the public and the private DNS entry. I ran a couple of pings to start to take care of caching the DNS lookup and then I ran:

$ ping address.to.ping -c 20

Instances:

  • instance.a: high-cpu medium in us-east-1b
  • instance.b: high-cpu medium in us-east-1b
  • instance.c: high-cpu medium in us-east-1d
  • instance.d: high-cpu medium in us-west-1a

Detailed Results

Same Region, Same Availability Zone

instance.a => instance.b

Public DNS

--- ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19087ms
rtt min/avg/max/mdev = 0.397/0.540/0.629/0.061 ms

Private DNS

--- ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19096ms
rtt min/avg/max/mdev = 0.362/0.544/0.733/0.084 ms

Same Region, Different Availability Zone

instance.a => instance.c

Public DNS

--- ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19187ms
rtt min/avg/max/mdev = 1.817/2.035/3.113/0.260 ms

Private DNS

--- ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19185ms
rtt min/avg/max/mdev = 1.790/1.936/2.051/0.078 ms

Different Regions

instance.a => instance.d

Public DNS

--- ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19176ms
rtt min/avg/max/mdev = 81.517/81.672/82.059/0.281 ms

Private DNS

Not Available

DevOps at Our Startup

The guys at DevOps Cafe have been preaching the gospel of sharing in operations, and I’ve been agreeing as loudly as possible. This is my first step towards putting up (instead of shutting up).

Startups Need DevOps

DevOps problems don’t just exist in tiered organizations with a big wall separating the development team from the operations team. In our case, they exist when part of the dev team and the whole ops team share a brain. Over a few posts, I’d like to share what our experience has been in solving these problems, including the missteps we’ve had along the way and the problems we’re still working on. First though, I’d like to explain why DevOps and a focus on Culture, Automation, Metrics and Sharing (CAMS) has worked well for us and why I think it’s a critical element of success in startups.

Startups are fundamentally about learning and building in the face of uncertainty. For this and many other reasons, agile development is almost universally seen as the way to build a software product (I’m going to take this as a given from here on out). If you’re doing software development in a startup, you should be using an agile process because the requirements are uncertain. They’re not just uncertain because requirements are hard to write down. They’re uncertain because not only do you not know what your customer want but you might not even know who your customer is. Agile development means rapid iteration, and one of the first things you run in to is that it’s difficult to keep an application deployed and operational when you’re changing large parts of your application on a regular basis (especially early). Agile development doesn’t work with slow, rigid operations.

Bad Solutions

Your startup is chugging along, iterating quickly and deploying lightly-tested code to your one server by doing git pull && touch deploy.wsgi. You’re learning tons as you get rapid feedback from customers and people love you because you can push a fix for their bug before you even hang up the phone!. Then, one day, you find a HUGE bug in production and How Could This Happen!? The next week, your application goes down because you tested that new feature using python 2.6 locally and the server has 2.5 and urllib hates you and you were going to update the server to 2.6 before the next git pull but Fred deployed his feature first and you remember telling Fred about the python 2.6 thing and now your co-founder is questioning this rapid iteration thing and your customers don’t care about your 5-minute turn around on bug fixes because they were in a presentation when you went down and… arg.

The worst reaction is to slow down development.

  • You can shame/punish developers for any bugs or mistakes
  • You can require every update go through a manual set of several hundred test cases that takes a few days to click through
  • You can lengthen your release cycle so that the rapid iterations only happen on the internal build and customers only see the changes every couple of months after a rigorous manual QA process.

All of those options work in the sense that you can improve uptime and quality over your old way of doing things. The problem is that in the mean time, your business suffers.

  • Punishing errors won’t actually solve your problem, but it has the misfortune of making shamers feel better. This is a mindset that we’ve had to constantly fight in order to build a culture where it’s ok to make a mistake exactly once. Unilateral mistake prevention isn’t possible without HUGE costs.
  • Days worth of manual tests are the opposite of rapid. Before automating testing was a viable alternative, this might have been the only option to ensure quality, but every delay lengthens the distance between you and learning from your customers. It’s also very possible that your time running manual tests on known problems cuts in to the time you have for exploratory testing to find the actual bugs.

    In the beginning, this was the path we went down until we got smart and began investing in a suite of Selenium tests (in conjunction with unit tests) to solve the same problem. Now all our code changes require accompanying selenium tests and we don’t have the kind of bugs that manual tests would have caught. We also run these tests many times per day, which means we know on an hourly basis what our quality looks like. hudson_status

  • Lengthening your release cycle might seem like a good idea, but if that’s all you’re changing, it’s just as likely to cause lower quality as it is to raise quality. Instead of building the minimum set of features and iterating on feedback, longer cycles encourage you to build too much and “too much” is more difficult to test than Just Enough. See our group collaboration story for an example. It’s kind of a dead horse, but longer release cycles are more waterfall, and waterfall is bad.

We need to deliver changes to our customers quickly because that’s the only way we can find out what they actually need and give it to them.

A Bad Solution Hurts the Business

In 2008, when our update cycle was still in the 6-8 week range (because releases were a pain), I built a group collaboration piece to help users give feedback on policies in advance of a meeting. It took the full 8 week cycle to nail down and get built and tested. We then spent another 6 week cycle refining the idea based on direct meetings with customers. I thought it was pretty awesome and the major sponsor of the feature seemed happy. Later down the road, we integrated Mixpanel to start tracking exactly what our customers were using and we got some bad news. Not only was the sponsor customer the only one using the group collaboration, they were only using it a few times a month (versus hundreds of uses a week for our “core” features) AND they were only using a subset of the features we added. If we were keeping metrics to track usage, we would have known right away that this feature was not the most important. If our devops process was better and we didn’t fear releases, we could have built the core functionality that they were actually interested in instead of spending weeks on parts that were “nice to have.” As more of a customer development note, a bit more customer interaction would have led us to build us something very different. We added lightweight in-line collaboration this year in a single 2-week cycle. This would have fulfilled 90% of the original requirements and is now utilized about 40% of the available time instead of the low single-digit utilization for the larger, more-complicated group collaboration feature. Our devops problem cost us ~10 weeks of desperately-valuable development time that our business and customers needed.

usage
graph

Another Solution, Almost as Bad

Another bad solution is to realize that you have a deployment/release problem, discover that there are good technical solutions and then spend the next year completely automating your processes, collecting metrics, and changing behavior to build better culture while you shut down the development side of things. Obviously, a startup can’t survive treading water for months while working on “internal” improvements and you wouldn’t lay off a development team (or pay them to sit) in the meantime. In our case though, there was no line between the development team and the operations team. Time spent on operations was time not spent on development. This balance has been a continuing struggle for us, but the flexibility that gives us to always work on the most important problem has been valuable.

A Better Way

Recognizing that your business has a problem and that there is a solution seems like most of the battle, but there’s still the problem of how to get from where you are to where you want to be. The good thing about DevOps in general, is that it’s not a set of things you have to do to be “compliant.” It’s a process that your startup can use starting today to make things a bit better. It starts with a culture that treats operations and development as two side of the same coin and that values continuous improvement. The manifestations of this include things like striving to automate all of your operations activities and putting infrastrcture code in the same source control your application lives. If you have a problem with a developer using a different version of a library than you deployed to production, give that developer the ability and responsibility to change your Pip requirements file and review their code change in your source control. Just like development’s job is to find the highest value/priority thing to work on every day, operations should be looking for the place with the most pain and working on removing that pain through automation so that it stays gone.