Cheap and reproducible testing environments on AWS

Giorgio Sironi

If you are looking at these slides on your pc, there are speaker notes in the HTML

Giorgio Sironi (@giorgiosironi)

Software Engineer in Test (automates stuff for a living)
What do I do
- Distributed systems
- Automated complex tests, integrating many different projects
- Continuous Delivery
- Pasta and risotto

An unreadable diagram

Production environment

12x

End2end environment

12x

Called like this because end2end tests run there. The end2end environment is built for automated testing only (and sometime experimentation). It is identical to the production environment, especially when using many managed servers like RDS, Elasticache, S3 buckets, SQS queues and so on. Otherwise how can you find out about bugs in the provisioning and orchestration? Where you can actually integrate with the outsourced technologies? Where can you test an architectural change like replacing disks with SSDs? This has saved us multiple times where we found out about missing configurations, caching problems, services misunderstanding their communication protocol. End2end tests are slow and more brittle than project-level tests, but the top of the pyramid can't be completely substituted. There are some simplifications you can make, like having 2 instances as a minimal cluster even if in production you have 3. Moreover, some of the instances needed there are quite powerful to run lots of tests in parallel, one of the backends had a c4.4xlarge to run many image resizes in parallel.

Ci environment

The ci environment is called like this because the first layer of tests (a project in isolation) runs there. Feature branches living 1 or 2 days are also built there, in addition to all new commits on the mainline. The ci environment is simpler than end2end because you only need one virtual machine per service, test suites are usually limited to one process and are not smart enough to use effectively more than one virtual machine for a given codebase. Sometimes you have data-oriented tests that benefit from many (expensive) CPU cores, like when we run the entire corpus of years of articles through a conversion process, but it's rare to need more for a single service. Yet there are many instances anyway because of the sheer number of different services.

Breaking the bank

We have funding, you know, but I'm not run a c4.8xlarge instance to make tests run faster, says the sysadmin. It costs a lot compared to these cheap t2. It makes sense as costs are not added together but actually multiplied with each other: number of services, times number of virtual machines needed per service, times number of environments. Running one virtual machine is relatively cheap compared to the costs of developers, but at scale you need some optimization. Yet testing environments are very valuable: the ci layer provides feedback to developers in minutes on every new commit; the end2end layer provides feedback (albeit in a slower way) that you couldn't possibly get elsewhere if not by trying things in production (for the happiness of your customers).

Objections

Objections

Containers
"Serverless"
Travis CI (and similar)

I should just deploy applications in the form of containers everywhere. They have a fast startup time and can share the resources of an EC2 fleet for hosting. But there are advantages to virtual machines like maturity and strong isolation of resources, and the fact that all existing projects that predate Docker-mania were built like that. You could reimplement everything in the form of JavaScript functions stored as blobs on S3 and integrating with each other through proprietary protocols. What could possibly go wrong? Travis CI and other build-as-a-service products are great for portable libraries, but I haven't find them flexible enough to build customized servers or a production-like environment (unless you deploy on Travis CI as a production environment?) In CI environments you have a maintenance problem where the same setup of the application is shared between the Travis build and what you use in production. In end2end environments you have to connect together multiple services in a complex topology, provisioning DNS entries, databases and all sort of resources so a single build doesn't cut it.

Builds and test suites are a dynamic load, that varies widely during the day and the week. For example, if you are not geographically distributed the load is high during the day and low at night (scheduled jobs). It's very low during weekends and holidays (right?!) But when everyone is churning new pull requests in a busy morning, there are spikes. Isn't the promise of the cloud that we can dynamically adapt to load, paying only for what we use? For AWS EC2 this promise is fulfilled with a robust orchestration layer and a granularity of 1 hour (recently changing to the order of minutes). I'm here to help you with that robust orchestration layer. It's pretty clear then that we can't keep our environments functional all the time, but we should instead pursue a on-demand strategy.

Server-based resources	Shared resources
Web servers, databases	Queues, CDNs, ...
EC2, RDS, ElastiCache	S3, SQS, CloudFront
Pay by the hour	Pay per use
Optimize	Don't worry about it

You may want to build EC2 instances from scratch starting from an AMI, or even to dynamically instantiate a CloudFormation template when you need it. However, the overhead of provisioning new resources is still significant for test suites. You may have a test suite that takes 5 (single project) or 20 (end2end) minutes to run. It's not efficient to create everything from scratch whenever you need it, as for some resources you only pay for usage and there are lots of things that can go wrong during the initial creation. For example, to provision a CloudFront CDN for end2end tests, it takes ~1 hour. However you only pay for the data actually transferred through it. And you don't want to wait 1 hour to find out the DNS name is conflicting with something else and the creation is rolled back.

Persistence

EC2 lifecycle

Lifecycle of an EC2 instance, showing start and stop commands and their transitions

Costs EC2 vs EBS

t2.small	$0.74/day
t2.medium	$1.46/day
t2.large	$2.88/day
t2.xlarge	$5.86/day
c4.4xlarge	$23.93/day
SSD gp2, 10 GB	$1.20/month

Starting

aws ec2 start-instances --instance-ids i-1234567890
// poll for started state:
aws ec2 describe-instances --instance-id i-1234567890
// poll with ssh that you can connect
// (optionally) poll for some smoke test to pass
// update DNS with new public ip

Stopping

aws ec2 stop-instances --instance-ids i-1234567890
// poll for stopped state:
aws ec2 describe-instances --instance-ids i-1234567890

Lock system: Jenkins and Lockable resources plugin

Starting and stopping instances periodically would be otherwise dangerous if there wasn't a mechanism for mutual exclusion between builds and lifecycle operations like starting and stopping. Not only you don't want to run builds for the same project on the same instance if they interfere with each other, but you definitely don't want an instance to be shutdown while a build is still running. Therefore, we wrap both these lifecycle operations and builds in locks for resource, using Jenkins Lockable Resources plugin. If the periodical stopping task tries to stop an instance where the build is running, it will have to wait to acquire the lock. This ensures that machines that see many builds do not get easily stopped, while other ones that are idle will be stopped at the end of their already paid hour.

Error:
InsufficientInstanceCapacity

There's more!

RDS can start/stop too
EC2 is starting to bill by the second

Conclusions

Know the pain points of your cloud architecture
Work to solve these problems leveraging its strengths
Optimize in the right place

Thanks!

@giorgiosironi @eLife

g.sironi@elifesciences.org

We are hiring!

Image credits

https://peurdunoir.deviantart.com/art/Harry-Potter-and-the-Order-of-the-Phoenix-384799893 https://commons.wikimedia.org/wiki/File:Clock_face_one_hand.png