Assertions are the go-to checking mechanism in unit tests. However, when applied to testing interfaces, specifically GUIs, I consider them to be toxic. Thankfully, there is a promising alternative.
JUnit was a huge success, being the single most used library in all of Java. And JUnit brought with it the famous Assert.assert… statement. This mechanism is designed to only check one thing at a time in isolation. And when testing a single unit, this is the most sensible approach: we want to ignore as much volatile context as possible. And we want to focus on ideally checking only a single aspect of only the unit under test. This creates maximally durable tests. If a test depends only on a single aspect of the code, then it only needs to change if that aspect changes. Assertions are a natural and intuitive mechanism to achieve that. Being “inside” of the software during the test, where practically all of the internals are exposed in one way or another, all else wouldn’t have made sense.
Because of its success, JUnit is considered the state-of-art of test automation—and rightfully so. As such, its mechanisms were also applied to non-unit testing, i.e. they were applied to interface testing (e.g. GUI testing). And intuitively, this makes sense. Because, as the individual features stack up towards the interface, the interface become very volatile. Testing only individual aspects of the system seems to solve this problem.
Except that it doesn’t. It is already hard, albeit still feasible, to achieve that grade of separation on the unit level. On the interface level, where integration is inevitable, it is outright impossible. And practice shows exactly that. One of the reasons of the shape of the famous test pyramid is that tests on that level tend to break often and require a lot of maintenance effort.
Imagine that you want to test a single aspect of the code—the calculation of the number of items a single user has ever bought. On the unit level, all you need is a user object and some associated items or transactions. Depending on the complexity of the system, you can create these objects either on demand or mock them. Then you can test just the code that counts the items.
However, on the GUI level, you first need to log into the system with an existing user. Then you need to navigate to a certain page where the relevant information is shown. So even if you create only a single assertion to check the number of items, your code still depends on a working persistence layer, a predefined state (e.g. user existing, and correct number of items), on the ability of the user to log in and on the navigation. How well is this test isolated?
In an integrated test, it is basically impossible to ignore context. Involuntarily, we always depend on numerous aspects that have nothing to do with what we want to test. We suffer from the multiplication of effects. This is the reason for the famous test-pyramid. However, if we cannot ignore context, maybe we should embrace it instead?
Imagine, just for a second, we could somehow mitigate the multiplication of effects. Then we could check the complete state of the system instead of individual aspects. We could check everything at once!
So because interfaces are fragile, we now want include more context, making our tests even more fragile? Because instead of depending on single aspects, the test now depends on everything at once? Who would want that? Well … everybody who want’s to know if the interface changed. If you think about that, the same question applies to version control. A version control system is a system with which, every time you change about anything in any file, you have to manually approve that change. What a multiplication of efforts! What a waste of time! Except that not using one is a very bad idea.
True for both Manual and Automated Test Execution
Because people change things all the time without meaning to do so. And they change the behaviour of the system without meaning to do so. Which is why we have regression tests in the first place. But sometimes we really wanted to change the behaviour. Then you have to update the regression test. Actually, regression tests are a lot like version control.
With the mindset that software changes all the time, an assertion is just a means to detect a single such change. So writing assertions is like blacklisting changes. The alternative is to check everything at once, and then permanently ignore individual changes—effectively whitelisting them.
Whitelisting of changes vs. blacklisting of changes
When creating a firewall configuration, which approach would you rather choose? Blacklisting (i.e. “closing”) individual ports or whitelisting (i.e. “opening”) individual ports? Likewise with testing … do you want to detect a change and later recognise that it isn’t problematic, or would you rather ignore all changes except the ones for which you manually created checks? Google introduced whitelist testing, because they didn’t want to miss the dancing pony on the screen again. Whitelisting means to err on the side of caution.
Tools for pixel-based comparison aka visual testing
Of course I am not the first one to come up with that idea. In his book Working with legacy code, Michael Feathers called this approach characterization testing, others call it Golden Master testing. Today, there are two possibilities: pixel-based and text-based comparison. Because pixel-based comparison (often called visual regression testing) is easy to implement, there are many tools. For text-based comparison, there are essentially two specific testing tools: ApprovalTests and TextTest. But both pixel-based and text-based approaches suffer from the multiplication of effects.
On the GUI level, many things depend on one another, because isolation is not really possible. Imagine you wrote automated tests naively, as a series of actions. Then, if someone changed the navigation or the login screen, this single change would most likely affect each and every test. This way, the implicit or explicit dependencies of the tests potentially cause a multiplication of the effects of a single change.
How can we contain that multiplication of effects? One possibility is to create an additional layer of abstraction, as is being done by single page objects or object maps. But, in order to later reap the fruits in the form of reduced efforts if the anticipated change happens, this requires manual effort in advance. According to YAGNI, implementing that abstraction “just in case” is actually a bad thing to do.
What other possibilities do we have, to contain the multiplication of effects? When doing refactorings in programming, we happen to be in the same situation. One method is probably called in dozens or even hundreds of places. So when renaming a single method (please only do that in internal, not-exposed APIs), we need to also change every place where that method is called. For some cases we can derive these places from the abstract syntax tree. For other cases (properties files, documentation, …) we have to rely on text-based search and replace. If we forget or oversee something, this often shows only in certain situations—usually when executing the software. But for tests, this is different. Because tests, by definition, are already executing the software. So we get shown all the places where something changed (i.e. by failing tests). Now we just need a mechanism to “mass-apply” similar changes.
There are two different kind of changes: differences in layout and differences in flow.
If, for instance, the login-button now is called “sign in”, has a different internal name, XPath or xy-coordinates, this is a difference in layout. Differences in layout are relatively easy to address with object maps.
But, surprisingly, differences in layout are also relatively easy to address if we have more context. If we know the whole puzzle instead of only individual pieces, we can create one-on-one assignments. This makes for very robust object recognition.
Imagine, we have a form where some elements are added. And we want to recognize the “Accept” button to submit the form. If everything about the button changes, we can still recognize it, based on a one-on-one assignment of the remaining unused UI-components.
And mass-applying these changes is also easy. We can just apply every similar change. E.g. combine all instances of the change of “Accept” to “Save” into a single change, that needs to be reviewed only once.
With such a strong mechanism, redundancy is suddenly not a problem anymore. So we can suddenly collect many attributes of our UI-components, making our recognition of them even more robust.
So we can gather XPath, name, label and pixel-coordinates. If some of the values change, we still have the remaining values to identify the element. And mass-applying makes this still easy to maintain.
Sometimes, the use-cases or internal processes of the software change. These can be minor changes (e.g. if an additional step is required—filling a captcha or resetting a password). Sometimes these are major changes—a workflow changes completely. In the latter case, it is probably easier to rewrite the tests. But this happens seldom. More often, we need to just slightly adapt the tests.
Differences in flow cannot be addressed by object maps. Instead, we need other forms of abstractions: extracting recurring flows as “functions” or “procedures” and reusing them. This can be achieved with page objects, but requires manual effort and the right abstraction.
Instead, I propose a different approach: passive update. What do I mean by that? Traditionally, we have to actively identify all the occurrences of a specific situation in the tests and update them manually. So if we need to adjust the login process, we have to find all the instances where the tests log in. Then we manually need to change them accordingly. This is active update.
Passive update is to instead specify the situation we need to update together with a rule about how to update. So instead of finding all the login attempts, we specify the situation: the login page is filled with credentials and the captcha is showing. Now we add a rule about how to update a test script that finds itself in that situation — filling the captcha. We do this by deleting or inserting individual actions, or a combination thereof. Then that update is applied passively, upon execution of the tests. This means we are essentially turning the extraction of a procedure on its head.
This approach has various advantages:
Being able to address the multiplication of effects allows us to embrace the whole context of a test, rather than trying to ignore it. This approach promises to make test automation and result checking both more powerful and more robust.
We already implemented that approach for Java Swing. Now we want to create an open source tool to foster a widespread adoption. Any support is highly appreciated — give us feedback, back us or spread the word.
Thank you!