Selenium testing: a new hope

Part I. Problem and first solutions.

Selenium project launched in 2004 now became an industry standard for browser automation. However if your QA department is relatively big, sooner or later you will face to recommended Selenium architecture limitations. In this article I would like to tell you how to create a scalable and fault-tolerant Selenium solution easily.

Problem

Selenium architecture radically changed several times since 2004 when its first prototype was created. Current Selenium architecture introduced in 2.0 branch is called Selenium Grid. It works like the following:

Usually a cluster consists of two daemon applications: Selenium Hub and Selenium Node. A hub is an API that handles user requests and redirects them to respective nodes. A node is an actual request executor launching browser processes and requesting desired test steps from them. In theory an unlimited number of Selenium Nodes can be connected to one Selenium Hub and every node can launch any installed browser. But what’s in practice?

Such architecture has a weak part. Selenium Hub is a single browser access point. If it goes down or does not respond all browsers become unavailable. The same happens if a datacenter with hub is powered off or its network fails.
Selenium Grid does not scale well. Our 5+ years of Selenium cluster expertise show that even under moderate load a hub can work with a limited number of connected nodes. Depending on hardware even dozens of connected nodes can dramatically increase hub response time.
No quoting functionality. You can’t create users and specify browser consumption limits.

Solutions

The simplest scalable approach is to use multiple Selenium Hubs distrubuted across multiple datacenters. However standard Selenium libraries can only work with one Selenium hub. We need to teach them to work with such distributed system.

Client-side load balancing

An initial approach we successfully used several years ago was a client library that did client-side load balancing. This is how it works:

We launch multiple Selenium Hubs and respective Nodes in multiple datacenters.
A list of hub hostnames with supported browsers is saved to file.
Selenium user attaches a small client library as a dependency to his tests and requests a Selenum session using the library.
The library reads the file with hubs and randomly selects one of them having desired browser. Then it requests a browser using standard Selenium client.
If session is created successfully then test steps start executing. Otherwise the library tries another hub host until a session is created. Different hubs can contain different quantities of browsers. To deliver uniform load distribution we need to assign different weights to hub hosts and then select these hosts according to their weights.
If the client fails to create a session on every hub from the list — it should throw an error.

A single line of test code (new session request) should be changed to support that library. For example in Java tests a new session request may look like that:

WebDriver driver =new RemoteWebDriver("http://my-hub.example.com:4444/wd/hub", caps);

All classes in this code come from a standard Selenium Java client. E.g. if a client-side library is called SeleniumHubFinder a new session request will look like:

WebDriver driver = SeleniumHubFinder.find(caps);

No Selenium hub URL is used in updated code — this information is stored inside client library. That’s it! This approach worked for years. Hundreds of software testers in our company were satisfied. What are the drawbacks of using client library?

A supplementary library should be added to every test project. You can’t launch your test without this library.
A separate client library should be implemented for each language. E.g. Javascript, Java or Python Selenium tests may exist in your company. In that case you need to support several client libraries and ensure that hub lists are in sync. That’s why a server-side solution is necessary.

Server-side load-balancing

Relying on our experience with client-side solution we introduced the following natural requirements to server-side one:

The server should look as Selenium hub to client libraries. To achieve this it should implement Selenium JsonWire protocol.
Any number of server nodes can be installed in any datacenter. They can be installed behind any software or hardware load balancer.
Server instances are stateless. They don’t use neither database server nor queue server to share state.
Server should support multiple users and quoting.

We called the server — GridRouter because the only thing it does is routing user requests to correct Selenium Grid Hub. Here’s the new architecture:

The load balancer distributes user requests across multiple GridRouter instances.
Every GridRouter instance stores information about all available Selenium Hubs like client-side library did.
To handle new session request GridRouter uses the same random distribution algorithm.
As you probably know every new browser session in Selenium automatically obtains an ID called session ID. According to Selenium JSONWire protocol this ID is always passed to request. GridRouter appends information about selected Selenium Hub to this session and returns enriched session ID to user.
After session is obtained GridRouter extracts used Selenium Hub information from enriched session ID on each following request and simply proxies it to a corresponding hub. Since all session information is stored in its ID there’s no need to synchronize GridRouter instances. This is why GridRouter is stateless.

GridRouter

Initially we implemented GridRouter using Java, Jetty and Spring Framework. Its source code is available on Github. This implementation is using a plain text properties file to store users list and an XML file to save a list of Selenium hubs for each user. A typical users list (by default /etc/grid-router/users.properties) looks like the following:

user:password, useruser2:password2, user

Every line corresponds to one user. Passwords in current implementation are stored without any encryption. This is because we consider that users are mainly needed to account browsers consumption by different teams. Selenium hub lists are stored in XML files of the following format (by default /etc/grid-router/quota/*.xml):

<qa:browsers xmlns:qa="urn:config.gridrouter.qatools.ru"><browser name="firefox" defaultVersion="33.0"><version number="33.0"><region name="us-west"><host name="ff33-hub-1.example.com" port="4444" count="5"/></region><region name="us-east"><host name="ff33-hub-2.example.com" port="4444" count="5"/></region></version><version number="37.0"><region name="us-west"><host name="ff37-hub-1.example.com" port="4444" count="3"/><host name="ff37-hub-2.example.com" port="4444" count="4"/></region><region name="us-east"><host name="ff37-hub-3.example.com" port="4444" count="2"/></region></version></browser><browser name="chrome" defaultVersion="42.0"><version number="42.0"><region name="us-west"><host name="ch42-hub-1.example.com" port="4444" count="10"/></region><region name="us-east"><host name="ch42-hub-2.example.com" port="4444" count="10"/></region></version></browser></qa:browsers>

You can see that we define available browser names, their versions and a set of hosts distributed across multiple regions. A region in our terms is just a datacenter. Information about datacenters is mainly needed if one datacenter goes down. We select a host from another datacenter if the first session attempt fails. This approach increases the probability of faster Selenium session creation.

Using GridRouter in tests

As I previously said GridRouter implements standard Selenium protocol and is fully compatible with all existing client libraries. The topic we have left is how to authenticate in GridRouter i.e. specify which quota we want to use. All Selenium client libraries support only one authentication method — Basic HTTP Authentication. That’s why GridRouter supports only this method too. Usually Selenium hub url is like the following:

http://example.com:4444/wd/hub

As you probably know basic HTTP authentication username and password can be encoded to URL like that:

http://username:[email protected]:4444/wd/hub

This is the only change you need to do in your code to use GridRouter instead of Selenium Hub. The majority of Selenium client libraries including Java and Python implementations work with such notation. Some Selenium-based Javascript tools however require you to specify username and password as separate configuration options.

Selenograph

GridRouter allowed us to stop using client-side libraries. It gave users with different languages access to a scalable Selenium installation. To scale GridRouter installation you just need to add more Selenium hubs to its XML configuration — all changes are applied automatically without service restart. To serve more requests per second you also need to add GridRouter hosts behind load balancer. Our experience shows that GridRouter works perfectly when total percentage of used browsers of any version is below ~80%. Problems begin when the peak load arrives and browser consumption grows up to 90–100% of total capacity. In this case the random uniform session attempts distrubution becomes inefficient.

We are trying to obtain Selenium session on fully occupied hub too often and do attempts to several hubs before returning session to user. This increases session start time and slows down tests. Our next stage in Selenium cluster development aimed to resolve the issues above was a new product called Selenograph. Selenograph is a Java server based on GridRouter source code fully compatible with its configuration files. The main differences are:

It is stateful. To be more efficient on high loads Selenograph is using more sophisticated algorithm of choosing hub hosts. The main idea is to dynamically adjust hub host weight by considering total number of already running sessions. This number should be saved to storage shared among Selenograph nodes. We use MongoDB as such storage.
It provides more statistics and user-friendly interface. For example Selenograph API can return total number of concurrently running sessions at each moment of time. Although Selenograph is a stateful solution it is confirmed to work correctly under high load allowing to serve hundreds of requests per second to every instance.

Conclusion

In this part I told you about standard Selenium scalability problems and how they can be resolved with minor changes to your cluster architecture. In the next part we’ll discuss topics like:

How to prepare worker nodes for big cluster so it scales well
Some thoughts about nearest Selenium future
How to run Selenium inside Docker containers
What are the new open source tools that will help you to deploy an efficient Selenium cluster with low resource consumption

Stay tuned…