How to test 2 years of behavior
for 16 70 countries
in 4 minutes... 18 minutes
Giorgio Sironi
Since well-engineered things tend to acts as a center of gravity, the latest data I have on this test suite suggest that it has been greatly expanded from when I had this talk for the first time.
I'm a developer (writes code, design stuff)
Interested in
Automated testing and TDD
Object-oriented programming
Distributed systems
If you are looking at these slides on your pc, press S to see the notes
I built this test infrastructure with (the help of) the Onebip team, where I worked for the last 3 years. Now I work at eLife, a scientific journal that open sources everything they do in the spirit of open access in science.
Context
Onebip is a payment provider that takes money from your SIM card and uses it to pay for digital goods. There are hundreds of mobile phone operators to integrate with. Different people may work on each system. I'm ignoring reporting here and concentrate on the transactional part: what is needed to perform a payment.
What we wanted to do
daily deployment
refactor the hidden legacy code in the core system
deal with moving and complex requirements, always know what's live
Test suites support multiple goals, like safety (I deploy and don't break what was working before), but also change (refactoring with a safety net) and documentation (readability of steps and avoiding writing tens of thousands of lines of test code.
The standard solution
unit
end-to-end
fast
slower
isolated
all-encompassing, covering even legacy
1000s
100s
technical
customer-facing
"A return code of 709143 is FAILED"
"I am billed with the message 'Thank you for your purchase'"
Bicycle toolkit
Cycling
So I'm showing here how to write effective and fast end-to-end tests, but I'm saying at the same time that unit tests are just as important. Just easier to write, so here's some advice on the end-to-end ones.
Introducing simulators
"Just test with the real thing you mockist design-damager!"
A run of the suite when tested with real mobile phone operators (which have no preproduction systems) costs about 500 EUR
Substituting the real APIs of operators. Implementing the Fake pattern.
Simulator API
SimulatedUser::addBilling($phoneNumber, $amount) : boolean
new SimulatedUser($mongoCollection);
It's not as simple as mocking, as simulators need to be backed by an actual database as their code may run in many different processes and servers. Don't make the mistake of going for "as simple as possible" solutions like writing on files as they will get crazy with race conditions.
If you can't test results through the API (need to look into the database directly for results),
you're missing a way to monitor and drive the real system.
Surprise, not only unit test drive design forcing low coupling and cohesion of objects; also end-to-end tests drive the design of a self-sufficient, complete and concurrent API.
Tests are just another client of
Merchant notifications (SUBSCRIPTION_ACTIVATED)
Logs of calls between components
Domain Events raised
"Stop introducing APIs just for testing you mockist design-damager!"
A run of our suite if executed manually by someone would take several weeks to complete. You can test with real databases, but not with real time .
1st try: PHPUnit
1st try: PHPUnit
public function testAIusacellUserWithMoneyCanActivateASubscription($serviceId, ...)
{
$this->given(
$this->withServiceId($serviceId)
->withOperator('Iusacell')
)
->when(
$this->subscribing()
)
->then(
$this->userReceivesBillings($firstBillingMessage)
->forTimes(1)
->merchantIsNotifiedOf('SUBSCRIPTION_ACTIVATED')
->userReceivesBillings($renewalMessage)
->forTimes(3)
);
}
2nd try: Behat
Given I am a US user
When I subscribe to US_BROWSER_GAME
Then I am billed 3.00 USD
Separate automation from specification, which is one thing that Cucumber-style frameworks force you to do. Also you cannot write much data inside Gherkin as it's not a structured format, so you have to select the most relevant, provide sensible defaults and make the job of the users of your system easier by asking them to fill in only what is written here (no 10-page form).
2nd try: Behat
Ah, decoupling. Apparently it works with testing frameworks too.
(potentially you could ship the Client object as an SDK for PHP)
The renewals problem
Given I am a US user
And I have subscribed to US_BROWSER_GAME a week ago
Then I am renewed with a 3.00 USD billing?
Execution time: 168 hours
Time passing is a Command to the system
Given I am a US user
And I have subscribed to US_BROWSER_GAME
When a week has passed
Then I am renewed with a 3.00 USD billing
Business time API
$ curl -X POST 'http://example.com/subscription/42/clock' \
-H 'X-Some-Authentication: ...'
-d 'ticks=168'
{
now: "2014-05-16T09:30:00Z"
}
Needs to be aggregate-specific, in this case impacting a single subscription or a single user. That opens up the possibility of parallelizing later.
The race condition problem
They expose design problems that may become serious bugs in production under large traffic.
When automating a payment flow that lasts tens of seconds to a single scenario of 1 second
Concurrent accesses to the same rows, responses that come too early.
Which may not necessarily being the last commits, a problem can emerge because them produce a difference in timing patterns, or a change in the CI machine
A test suite where 1 test randomly fails is flaky and may not be trusted
Read high level logs to find out which interactions are
(another example of system TDD driving you towards better auditing)
Once found out the problem, reproduce them reliably by inserting sleep() that exacerbate the problem in the right place
Sometimes you need support from people from multiple projects to find out the culprit
The "integrated tests are slow" problem
$ time vendor/bin/behat features/
... much later
real 121m6.114s
If you run the tests serially, one after the other, you can grow a beard before you have a result.
Discovery: our applications support concurrency
The tests can be designed so that they are isolated from each other, following the boundaries of the Aggregates (Domain-Driven Design jargon) in the system. For example, use different randomly-generated phone numbers for each test, and only assert things that happen to that user such as their credit being affected by the payment and their subscription to a magazine being now active.
Also, any test should be able to run no matter what is the state of the database, so that large batches of setup and teardown operation are not necessary.
The code for parallelization
grep -nHR "Scenario:" features/ | sort -t: -k1,1 -k2,2n \
| cut -d ":" -f1,2 | parallel -P 200% --gnu --halt-on-error=0 \
--keep-order "php bin/behat {}"
by @badkill
Concurrency enables parallelism: if the tests are isolated from each other, you can run them in parallel. Specify a fixed process pool in order not to overload the system (here 2 processes per CPU.)
Final results
X countries with 2 years of simulated time, in X minutes
Configuring new countries, merchant and services with confidence
Stop regressions while daily integrating (and deploying) the work of multiple teams and microservices
Living documentation (at least for the developers)
Enabling cleanup of legacy code now covered
Simulating the feasibility of a scenario in minutes
Built a sandbox for integration of merchants where they do not spend real money
We did not concentrate on BDD or on business communication, but started solving our verification problems by selecting good practices coming from experts. They worked and had also positive spillovers such as offering a sandbox.
Interested?
These are good, bullshit-free practical books that teach how to write good test suites which which also have good side-effects on architecture.