15,736 reads

Detox: Gray Box End to End Testing Framework for Mobile Apps

by Rotem Mizrachi-MeidanMay 5th, 2017

Too Long; Didn't Read

Our story start with the <a href="https://itunes.apple.com/us/app/wix-com/id1099748482?mt=8" target="_blank">Wix App </a>— our official native iOS/Android app

People Mentioned

Companies Mentioned

Coins Mentioned

featured image - Detox: Gray Box End to End Testing Framework for Mobile Apps

End-to-End tests are at the tip of the testing pyramid, they are supposed to give the most confidence that the system under test works, but on most End to End testing frameworks we find ourselves fighting “flaky” tests, and ending up not trusting the test suite. We hope we can change that with Detox.

Inception

Our story start with the Wix App — our official native iOS/Android app

It’s written from scratch in React Native
Started working on it a bit more than 18 months ago (March 2016)

In terms of engineering efforts:

The app is a cross company effort, It currently incorporates code from 6 different product groups.
There are currently 40 developers working on this project, or supporting with infra tools.
The majority of the code is written in JavaScript

Having Google Play and Apple App Store as our means of distribution, our releases are inherently not continuous deployment. So we have a release train (2 platform every week). But the distribution mechanism is not the real reason we don’t do true CD.

We rely on manual testing, a lot!

Currently, the full regression QA test suite contains 300 tests, which takes 14 person-days to run, and since it’s so big, we can’t finish testing it on time for the next release, so we only test ~70 of them, which also takes a long time, 3 person-days on one device.
In fact, if we’d run the entire suite on both platforms, and just two OS versions on each platform, we’ll end up with 56 (2 platforms x 2 OS versions x 14) person-days for a full regression. But it gets even worse.

This ice cream cone is a software testing anti-pattern.

QA Doesn’t Scale

QA test suite will always grow, meaning that even if development continues at the same pace, QA will have additional work on each release, so we’ll either need to hire more QA, or give up on some tests.
Mobile development at Wix grows around 25% per quarter, and the pace is increasing.

Let’s take a simplified example, if development goes at the same pace of 2 features or bug fixes per week, than QA will have two additional tests each week, meaning that in week 1 they will have 2 tests, and in week 7 the suite will be 7 times larger.

Add a growing product to the mix, which needs to hire more developers and increase the development process rate, and the QA test suite just explodes…The higher rate of features meant that the QA regression suite grows even faster.

The graph on the bottom shows the number of new developed features every week, the graph on the top shows the number tests in a QA test suite.

Automated tests are the future!

That’s nothing new though…

We don’t want to hire an army of people to do manual QA, we want automated tests, with modern continuous workflow that runs on CI, with a very short development feedback loop, and if all tests are green we have all the confidence we need to release a new version.

This is the testing pyramid, since you already know why tests are so important, and understand the different types of tests, we’ll focus on how we test each type rather than explaining what they are. Let’s break down E2E into two parts, pure UI automation (meaning, not testing external services), and full E2E, mimicking a user with real server data. These must run on a device or a simulator.

Let’s focus on mobile development, and React Native specifically.

What do we know how to do?

Unit TestingBusiness logic is mostly in JS, it’s easy to test on node with Jest. React Native, like React, uses Flux architecture to control app data flow. One of the most popular Flux implementations is Redux, which we use app wide. Although Redux is widely popular, we never felt comfortable with unit testing Redux apps, so we’ve developed methodologies and a test-kit for testing redux apps, check out redux-testkit for more information. Another popular flux implementations is Mobx, which is much more unopinionated than Redux, and has great testing capabilities. We’ve created an opinionated flavor to make it easier for our engineers. Remx can be tested quite easily, unit tests can be vanilla JavaScript, totally unaware of the underlying implementation, we will add more information about Remx in the near future.
**Component testing**Also run on node, we rely on Enzyme by Airbnb, and use Enzyme Drivers to help with mocking.
**UI Automation/End to End**But what do you do about end-to-end tests? These tests give the most confidence because they’re pretty much a robot running your app on a device.Maintaining an End to End test suite is hard, and isn’t as reliable as the others. but why ?

Flakiness

E2E tests are often considered to be flaky, in all platforms, web, iOS, Android.

Tests may fail for no apparent reason, even without code changes.
Tests are nondeterministic , There are many moving parts inside the app, they may finish executing in different order on different runs.
We can’t really be sure when the application is idle, since it is unclear when the app finished handling user interaction.
Users of E2E frameworks often have to deal with synchronization manually, So they find themselves adding multiple sleeps in strategic locations just to make tests pass.

Manual Synchronization

Manual synchronization is used so commonly that we incorporate it into our testing frameworks infrastructure, API calls are filled with loops containing sleep() functions.This is an example I took from Aaron Greenwald’s talk in React Amsterdam, it’s an actual piece of code we used to test our React Native app on our previous testing framework.

sleep(a_lot);

How unreliable a flaky test suite is ?

In order to understand how big of a problem is flakiness, let’s calculate the probability of a test suite to fail.

q: probability of a test to failn: number of tests

1-q is the probability to succeed.(1-q)^n is the probability of the entire suite to succeed.1-(1-q)^n is the probability of at least one test to fail.

If a test is flaky 0.5% of times:

And we have 20 tests

50 tests

100 tests

You get the point, very unreliable…

Past Experience

So, this is a complex issue… and we’ve had experience with a few frameworks in the past.

AppiumThe most popular solution out there the de-facto standard in the industry. We also checked what other companies with mobile products do regarding End to End tests, and found out that many don’t even have automation, and those who do, use Appium. The internals of Appium, its driver, is implemented using Instruments (iOS) and UIAutomator (Android) which are essentially external ways to interact with the device, just like a user.

We used Appium for 2 years in general and for 8 months with React Native, and found that we invested an unreasonable portion of our time writing tests and petting the system than actually writing features.

We found that End to End testing is really hard:

Tests are flaky, we got different results on different machines, frequent failures in CI, which the only solution for was addition of sleeps, which slowed the tests down.

Tests were already slow since Apple UIAutomation tool is limited to performing one action per second, and there’s a hack which removes this cap Instruments without delays (which is already unmaintained), so after each release of a new Xcode we would have to wait for patch before upgrading.

MagnetoIt is also worth noting Magneto, an E2E testing framework for Android only,a solution by Everything.me, where I previously worked, built with UIAutomator as the main driver.

It was much more stable, but we could still not eradicate flakiness.
We were 12 mobile developers, one developer was dedicated to pet the framework and CI system.
About ~5–10% false negatives.

Other frameworks, like Robotium and Calabash are not under active development anymore.

The main resemblance between these frameworks is that they are blackbox testing frameworks.

Black Box Testing

A box

What is blackbox testing? It’s a method of testing stuff from the outside, without knowing what’s going on internally.In mobile, black box E2E frameworks essentially go over the view hierarchy, looks for an element (if it’s not found, sleeps, continues looping in this manner until a certain timeout), then interacts with that view. Same principles are applied in web black box E2E.

Now, think how unfair is asking the users to provide this timeout, they have no idea what’s going on inside the operating system, or even inside the application, and that is the main cause of flakiness.

Black Box Testing + React Native

E2E gets even more flaky when used on react-native apps…

Rendering

On native apps, there’s only one thread responsible for rendering the UI (the main thread).With React Native it’s a bit trickier, React Native’s unique architecture adds complexity to the system, its UI rendering starts at the reconciler , which calculates which parts of the UI have changed, this is done on the JavaScript thread, which is then passed over an asynchronous bridge and translated into native instructions for the main thread to render a real layout.Due to this asynchronous rendering mechanism it uses, there are now two threads controlling the rendering, so blackbox testing frameworks have even greater trouble controlling React Native apps.

Loading and parsing the bundle

When a React Native app starts it loads a bundle, either from a local packager server or from an asset on the device, in any case this is an asynchronous process which takes an undetermined amount of time. A black box testing framework will need to sleep during this process as well, but for how long ? There is no real answer.

Black box was a dead end, we needed a different approach …

Detox

Gray box, not black box

Detox does Gray box, not Black box, to allows the test framework to monitor the app from the inside and actually synchronize with it.

Gray box essentially uses a piece of code that is planted in the app, it can help us see what’s going on inside.

Unlike Black box, Gray box runs in the same process, has access to memory, and can monitor the execution process. Being able to read internal memory gives it the ability to detect what’s happening inside the process: if there are network requests in flight, when the main thread is idle, other threads are idle, Animations have ended, the react native bridge is idle. It can execute on main thread, to make sure that when it performs actions nothing in the UI hierarchy changes in the meantime.

But there are also downsides — Usually when testing with gray box testing frameworks the app goes through a different compilation/running process since it needs extra code that is executed from inside the process. For us it was worth sacrificing this point and get this huge value in return.

Uses EarlGrey and Espresso

The leading native gray box drivers are developed by Google — EarlGrey for iOS and Espresso for Android. These frameworks can synchronize with the application, making sure to only interact with the app when it’s idle.

The underlying synchronization mechanism used in these gray box frameworks works in the following way.

Instead of retrying an action/expectation on the UI, they will query internal resources every few milliseconds or listen to callbacks from them telling that they have now switched to idle mode. The test will not continue until all of them return yes and only then, when the app is idle, it will interact with the UI.

Idling Resources

Does not rely on WebDriver

Detox does not rely on WebDriver, since this is not the web. Detox communicates with its native driver (which extends EarlGrey and Espresso) using a JSON-based reflection mechanism, this allows a common JavaScript implementation to invoke native methods directly on the device.

Simple API

Protractor like API, written in JavaScript.
Minimal boilerplate, and very small configuration process.
Cross platform: Test code is unaware of the platform it tests it can be shared between platforms.
Synchronized: no need to manually sync test with the app, Detox is inherently synchronized, it will execute its commands only when the app is idle, no more sleeps!
Debuggable: Using native constructs such as modern async-await instead of putting everything in a promise queue means that breakpoints will work as expected.

A simple login flow test written with Detox

React Native support

Detox is built from the ground up for native mobile and has deep first-class support for React Native apps.

We found out that React Native pretty much reimplements iOS and Android, so apart from the basic synchronization support of EarlGrey and Espresso for native apps, we had to create special synchronization mechanisms for React Native as well.

Evaluating expectations on the device

Traditionally, test frameworks evaluate expectations in the test script running on the computer. Detox evaluates expectations natively directly in the tested app running on a simulator. This enables operations that were impossible before due to different scope or performance reasons.

How Detox Works

Let’s take a look at High level diagram, hopefully it will help us understand how Detox works.

Test Runner: Execution of an action or expectation (awaiting on a promise)
Tester: Expectation being serialized into a nested invocation JSON
Server: relaying a message
Testee: Invocation of EarlGrey through method reflection
Invocation will only execute when app is idle
Testee: Invocation result is sent back through websocket
Tester: resolve/reject the expectation promise

For more in depth information on how Detox works visit the docs.

UI Automation with Detox

Let’s get back to our testing pyramid.

So we now have a stable End to End testing framework. But it may still be flaky due to network and server issues.

In order to do that we would need to remove the dependency of tests in network, with expected requests and responses in a consistent well-timed manner we will create pure UI automation (UI hermetic tests).

react-native-repackager

react-native-repackager is a mocking mechanism for our react-native JS code. Essentially it extends the packager’s ability to override bundled files with any other file, essentially creating an easy way to mock environments in react-native.

So you can create your own pre-packaged responses or set your endpoints to your local mock server, it can help a lot with separation of your testing concerns.

React-native-repackager turns Detox into a UI automation framework as well. Pyramid is all green, no excuses :) we can start testing!

Detox in Action

The surprising thing is that not only gray box is more stable than blackbox, it’s also much faster. No more sleeps or waitUntil, code executes the millisecond the app becomes idle. so it’s about 5–10 times faster than black box solutions.In fact, it’s so fast it runs the full detox test project suite (79 tests) in 4 minutes.

Detox’s own test suite running on an iOS Simulator

This is Detox’s own E2E tests, which are of course written with Detox
Simulator running the app is on the right
The console running our tests is on the left
We use Mocha as our test runner but you can use whatever you like, people have already setup Detox successfully with Jest.
Every line you see in the console is a new isolated test scenario, so it restarts everything, from the beginning, and can be sharded in the future.
As you can see it’s pretty fast

Cross Platform right ? Where is Android ?

Many have been asking about Detox support for Android. This is a feature that is being eagerly anticipated inside Wix as well.Now, open source is a wonderful thing, it can form collaborations that can take projects to the next level. A few months back we were contacted by Simon Rácz from KPN (a major telecom company in the Netherlands), he offered to help with Detox for Android. Since then he practically became a team member of Detox, implementing key features in our upcoming Android support.

Let’s see how it looks

Detox’s own test suite running on an Android Emulator

This is the same test suite we have been using to test our iOS implementation. It was virtually untouched, this way we could ensure our API is truly cross platform. Detox for Android is almost ready, in fact, there are very few things missing, for more details on our Android release, keep an eye on our releases page on github.

Detox is a TDD project

It has 100% code coverage, and set to fail builds if gets lower than that.
Detox’s E2E API is tested with Detox: We run the entire Detox API on a special test app on every build.
It is designed to accept contributions: Builds run on TravisCI, and only contributions which meet the standard are accepted, we try to stay very open and engaged with the community, and are very happy to receive any kind of help.

wix/detox_detox - Gray Box E2E Tests and Automation Library for Mobile Apps_github.com

I would like to thank the team members behind Detox: Leo Natan, Tal Kol, Sergey Ilyevsky, Simon Rácz, Elad Bogomolny, Daniel Schmidt, and to all of our other internal and external contributors, thank you guys, you’re awesome!

Summary

Our initial mission was to create a framework that we can trust, such that will give us confidence that when builds are green, we can release our new version, and with that to build a real continuous deployment workflow. In order to achieve that we needed to change our state of mind, there is no such thing as a flaky test, either there’s a bug in the app or the test framework is lacking, and our top priority is to fight flakiness, but this is not a a very easy task.We’re very excited about Detox, we hope that it will be useful to others as it is to us.