Scaling Selenium

Imagine you have just one test using Selenium. What can make it unstable? How to speed it up? Now imagine that there are two tests. Now imagine a hundred. How to make quickly work out such a bunch of tests? What happens if the number of tests continues to grow?


In this article, Simon Stewart walks us through the hard way of scaling, from one test to hundreds of tests running in parallel. We will get acquainted with the problems that arise in this case, and with practical methods for solving these problems. There will be Java code and some thoughts on the development of the test infrastructure.



The prototype of this article is a report by Simon Stewart at Heisenbug 2017 Moscow. Simon is the creator of WebDriver, a technology that is now almost 11 years old. He became a Selenium project manager about 9 years ago. At Google, he was engaged in scaling Selenium, from several tens of thousands to several million tests every day, on their infrastructure. Then went to Facebook. He is currently developing the WebDriver specification for W3C, which is part of the W3C testing and tuning group. We can say that based on WebDriver, a standard is being created.


In the course of the article, I want to consider a simple test and show how it can be scaled. First, we will launch it with a personal laptop, and in the end it will work in the cloud and amaze the imagination of everyone around, including your boss, who will scratch his chin and say: “I’m obviously not paying enough for you . For this we write software, is not it?


First, let's decide - why do we need tests? We write them not because we like the green (or red) color, and not because we like our work. Their only function is to ensure that the software works as intended. End-to-end tests (such as Selenium) should be part of a balanced test diet. If nothing is consumed besides them, nothing good will come of it. But we will talk about this later.



Let's start with a simple example that is open in my IDE. It’s called longAndWrong() , that’s how the test should be, right? We are creating a new FirefoxDriver one so that execution is done locally. Then create an explicit wait WebDriverWait . After that, go to http://localhost:8080 , provide an email address and password, wait until the “create a todo” element is created, and finally click on it. Everyone saw the launch of Selenium, and there is nothing unusual in this particular code.


This test is terrible. Let's see why. First of all, he works for good luck. Having received the page, we do not wait until an element appears through which the input is made. Thus, we expect WebDriver to correctly guess the page load time. In the above example, HTML is simply displayed, so this approach works, but in other cases there will be problems.


Even more important is the craziness that we see in another section of the code: driver.findElements(By.tagName("button")).stream() etc. Filtering occurs, and if nothing is found, then for some reason it rushes AssertionError . Only after all these operations is a click made, because only then is it clear that we have everything we need. Does everyone agree that it looks very creepy?




This, highlighted in green - is that what? The problem with this whole structure is its fragility. And not only this one. For example, does any of you use excessively long XPath? Long means longer than one line. Fragile locators are one of the main reasons why tests become creepy. You can learn a lot about solving this problem in the Selenide report .


There are several ways to remedy the situation. Firstly, you can correctly rewrite the application under test itself if you have access to the source code and the right to edit it. He is not always there. Sometimes it happens that a development team in the UK communicates with a team of testers in Romania through tickets in jail, and some say that the application works “normally”, while others do not pass the tests. If there is access to the code, then meaningful identifiers can be added to the elements: classes, certain attributes.



Most likely, there is one WebDriver.findElement that takes the only one to enter By . You can manually inherit from it. We look for all the elements by the selector * , we look for all the child elements in the tree, and then filter by the value of the attribute. However, this is extremely inefficient. Each call to the WebDriver or Selenium API is a call to a remote procedure, one way or another it is made through the network.



Instead, all of this can be written in JavaScript, and it will do exactly the same thing. If you look at the slide, it looks much more complicated and scary, but in fact, we take the context and ask - are you really WebDriver? .. If so, then we look - maybe it is single WrapsDriver ? Then we can extract the wrapped from it WebDriver . And since we want to execute JavaScript, we further cast it to JavascriptExecutor . Well and so on on trifles. Maybe someone does not know, but executeScript() can return elements and various other things. It will go to the browser, do some work there, return the result back, and the result will be correctly cast into java types. What you see on the screen really works.


Some developers dream of such a feature, not knowing that it has already been implemented. Suppose you have a JS framework that already has an element search engine built in. Even if this is something simple, such as generating random id :-) You can not re-implement this mechanism. Feel free to ask the framework itself to find the necessary elements! This greatly simplifies life.


I’ll also add why Selenium uses JavaScript. Which readers use JavaScript frameworks in their application? JQuery, React, Angular, or some homegrown nightmare? Here for interaction with them you also should use JavaScript. In the example I cited, jQuery tracked the number of queries. There is no other way to get unambiguous information from the system about what is happening in it. Sometimes it is necessary. In addition, WebDriver attempts to simulate user behavior. What does the user do? He clicks the browser, prints, clicks on the elements. There are things that WebDriver and Selenium cannot do. If you want to monitor HTTP status codes or network traffic, you might need a proxy. If something happens on the page, the best option might be to ask the page itself. For example, quite often identifiers are created randomly, although they are used in an orderly manner. So you cannot always count on their availability. You can simply go to the page to find out which identifier an element is and then use it in a regular index. This whole mechanism allows you to do gray-box testing. A “white box” is when the system inside is fully accessible to you during testing, in a “black box” situation, the system is hermetically closed, as if it had come to us from space. And when testing the “gray box”, you first grab your head at how everything inside is complicated, and then you start to change something here, then insert a handler there, and all this is done to make testing more stable and simple. Someone looks at it as a hack. Larry Wall believes that a first-class developer should have three qualities: impatience, arrogance and laziness. Impatience requires everything to happen now. Thanks to this, your programs will be fast. If you are asked to do something over and over again, you will not do this, there are cars for this. A lazy person is ready to work a lot once, so as never to work again. So in the approach described, I see not hack, but laziness. I could try to figure out how it works - or I can ask someone about this. My life will be easier. Well, arrogance is just a desire to splurge.


Another topic is the expectation of an event. There is such a thing in Selenium - Wait<?> .



I hope everyone is familiar with her. There are two ways in Selenium to wait for something: explicit and implicit. In implicit, we are waiting for some strange time, in explicit - we use what you now see in the illustration. Advice from the Selenium team: don't use implicit expectations, use Wait.until ! .. Why? The problem is that people, as a rule, do not know how much time they need to wait, and therefore set the time for implicit waiting up to a minute. If everything is in order - this is not a problem, everything is fine. But if the test fails, it will take an extra minute before stopping. Thanks to this, the launch of tests, usually taking 5-15 minutes, can stretch for several hours.


If the time of an explicit wait is shorter than that of an implicit wait, we will essentially work only with the results of an implicit wait. This is very confusing, it is impossible to support such tests. But still you can do that.


Yes, explicit expectations interact very strangely with implicit ones. Strange - does not mean "unpredictable." In fact, everything is absolutely predictable. If a command that sets an implicit wait time is executed, it will not end execution until the time runs out. Suppose the implicit wait time is 10 seconds and the explicit wait time is 15. You execute a request that after 10 seconds reports that it failed. The explicit wait then compares 10 and 15, decides that 15 is greater, and performs a new wait, again 10 seconds. You’re racking your brain, why does it wait 20 seconds if I asked 15? It is not always clear when implicit expectations will be launched. So everything happens quite predictably, but if you don’t know this internal mechanics, from the outside this behavior may seem extremely strange, and your life will be extremely difficult. My advice is do not use implicit expectations at all. Explicit expectations carry certain information. Your test suite not only checks the operation of the code, it describes how the system functions for people unfamiliar with it. For example, explicit expectations can tell: at the moment, an Internet connection should occur, an AJAX call must be made, something must be updated, etc. A person reading your test may ask: does the system really do this at the moment? Should she do this? Why is she doing this? This allows you to engage in dialogue that is impossible with a different approach. In general, explicit and implicit expectations interact, not always in an obvious way, but implicit always appear to be a priority.


Back to the original test.



He is not sleeping anywhere. The naive approach is that if at some point you need to wait - just run it Thread.sleep() . The problem with this approach is that the test will wait too long. This is not necessary.


Instead, use a class Wait . Its advantage is that it uses generics, so the type that you throw in the input will be forwarded to until() . For example, you can write Wait.until(d -> driver.findElement(...)) , it will find the element, it will until be forwarded further, and it is very convenient to call from it right there .isDisplayed() .


In addition, it can be convenient to store a link to WebElement, which for some reason people avoid. I communicate with clients quite a lot, visit their sites and notice that for each interaction with an element they are looking for it again. Thus, for each action call, two remote calls are made. Suppose you need to get the value and send the keys. In this case, people often write first driver.switchTo().activeElement().clear() , and then driver.switchTo().activeElement().sendKeys() .


But the element is the same. The only situation in which it can change is if it is completely removed or disconnected from the DOM, in which case you will receive StaleElementReferenceException . Are you all crazy about this exception? It reports that something updated the DOM, and there is no longer the item there. This means that the opportunity to set the wait is missed - you won’t be able to wait until the desired item appears.



The code shown on the screen will execute the optimal amount of time, since the wait will be minimized. If you restart the test with this code, the result will not change at all.


Are you still reading? Hope so far I haven’t said anything new :-)


So, the gray box testing.



There are ways to make expectations more effective. Look at the method isJqueryDone() in the slide above. He keeps records of active operations. If it jQuery.active becomes 0, it becomes clear that nothing else is happening.


On the other hand, why constantly pull the page to find out these statistics? After all, there is a similar mechanism in JS frameworks. In AngularJS, for example. Why can not it be limited only to this? Perhaps libraries are simply not the right level.


And yet, the problem can be solved using your application. Suppose you are making an AJAX call and want to know if it is complete or not. Sometimes the DOM does not update, and new content is simply thrown into it. It becomes unclear whether the test can continue. Maybe everything remains the same, and the test can be continued? In such a situation, it is worth monitoring at the application level - when making a meaningful operation, you need to create a variable, and after the operation is completed, its value can be reset. Then it will be possible to check this variable, and when it takes the correct value, it will mean that you can safely continue testing further.


Finally, you can seek help from Selenium . References to obsolete items may be useful information. If it is expected that as a result of your actions the DOM element will be updated and removed from the DOM, you can get a link to the element before this action is performed. Then it will be possible to wait for the exception StaleElementReference , find a new element and return it. You Wait can ignore some kinds of exceptions, so this code will be clean, tidy, and easy to use. Selenium will take on some of the work. Signals that indicate a change in the DOM, that something has moved under the hood, are a great opportunity to make testing more stable.


Let's move on to Page Objects. Have you heard about the god Janus from ancient Roman mythology?



Janus is a two-headed god looking to the past and the future. That is why January was named after him.


Page Objects are often the victim of misunderstanding. In the speeches at SeleniumConf there were a bunch of presentations about the automatic generation of PageObjects, about the automatic positioning of elements. You can’t do that. It only looks beautiful, because it seems that “you will write a framework” and you will be great, in reality everything will be much worse.


In the original definition, Page Object is one of the faces of Janus. These are services addressed to the user. If the application under test has a login page, then you will want to log in to it and describe all this in the language of the subject area. If you show such a test to business analytics, to the project owner, to your parents, it will immediately become clear to them whether the application is behaving correctly. But Page Object has another Janus face that requires deep knowledge of the code and page structure. This is necessary for DRY ("don't repeat yourself"), for abstracting. To make things easier, Selenium provides a class LoadableComponent . It seems to me that the name Page Object does not quite reflect the essence, since here you can simulate in smaller pieces, intentionally reduce their size.



I have a test here using Page Objects. It does exactly the same as the previous test. We create an object User , go to the login page and pass it the driver maximum wait time, and then execute on it get() . This happens according to our model LoadableComponent . If you log in, the main page will return. The advantage of how the Page Object template is implemented in this test is that it navigates. If the login page no longer throws MainPage, you change the signature of the function and return something else from the signUp method. Such a test does not even need to be run, it simply does not compile.


Let's look at the demo:



This is a simple todo list. It is not amazing, but it illustrates the idea well.



In the function code SignupPage.signUp() , nothing but the search for elements occurs.



Everything is abstracted and placed in this one function. If the login page code changes, this function is the only place where you will need to make corrections. If your developers changed the UI workflow or renamed some element, then all the changes will be here. An alternative is to run through millions of tests and fix them.


This concludes with the basics. We have a test whose support is easy to implement. One way to scale Selenium is to simply write well-running and easily maintained tests. Once I met with clients who were told that their task was to make the test suite green. They solved this problem by removing all checks, and in the event of exceptions, they completed the test successfully. Testing in general was successful, but it didn’t mean anything.


Also, the data used is extremely important. It is unlikely that your application is absolutely stateless. You most likely have users, persistent data, and more. Data is one of the problems you will encounter when scaling tests. There are several recommendations on how to parallelize tests.



First, static in Java, the Singleton design pattern is evil and must be avoided. The reason is that static fields are common to all threads. If you change this field from two tests at the same time, the result will be unpredictable. When communicating with clients, I often see the use of a static variable to store a link to WebDriver. There are complaints about the crazy behavior of the tests when they run in parallel: sometimes they work, and sometimes they don’t. This is called a “race condition”.


Trying to avoid static , some switch to using ThreadLocal . However, he is also evil. It makes you rely on thread affinity. If you constantly be sure that the test is executed in the same thread, you can safely teleport data from level to level. The fact that you have to teleport data (WebDriver instance, username) is already a bad sign, it is a code smell. So they are poorly structured. They are difficult to understand, difficult to maintain. A test that is hard to figure out is a source of enormous trouble. One colleague said that to debug a test you need twice as much intelligence as to write it. If you had to use all your abilities to write a test, then during debugging you will run into a dead end, getting out of which will be extremely painful. One of the things that functional programming taught us, including in Java: ideal code should be immutable and should not preserve stateless state. If you create an object Todo (as in the previous illustrations), in order to change something in it, you need to create a new object. If there are two threads working simultaneously with a mutable state, this leads to terrible consequences.


Imagine this test: the user Fred goes to example.com and logs in, registration must be successful. The first time the test is successful. However, when you run the test again, it crashes because a user named Fred is already registered. Unpleasant, isn't it? I am sure that you meet with such a constantly. The right approach is to have data prepared specifically for it for each test.



This problem is not common as you recreate the environment between test runs. However, it will certainly arise if you try to organize CI / CD (continuous integration and assembly). The test runs in more and more places (which is also a scaling, horizontal), the likelihood of data conflict is growing, inexplicable errors begin to occur.


To avoid such problems, you need to prepare the data inside the test itself. But there is a difficulty - ossification of data.



Ossification means the hardening of something plastic and pliable, for example, code or data. Changing them is getting harder and harder. In a dev environment, most likely, you can attach data directly to the database. But in production, adding random data is no longer worth it. But it would be great if tests could be run in production too! It is not necessary to do a full run, a happy path is enough.


In dev environment you can use data generators. It works like this: when we write some assert (a user with this name or with such a password must exist, some user must have three todo elements), the data generator automatically pushes such data into the database so that the statements start to be executed . When the same thing is launched in production, a selection is made from the list of existing options, the search area is reduced until there is something in the database that matches the search criteria. Thanks to this, you can scale the test, starting from local launch on one machine, and up to production. How exactly to do this is your choice. Just keep in mind that as you get closer to production, the data is “ossified”.


An example of a classic web application.



The server collects user data, which is then stored in the database. It also uses an authentication service that validates usernames and passwords using LDAP. There is nothing unusual here. But suppose the authentication service is not running - you may not have LDAP, and your own version of the service is badly written. If you run the tests, they will fall. But the errors will be caused not by the problem that you are trying to solve, but by the fact that one of the services is not running. You know in advance that these tests will fail. As the tests grow, you must enter prerequisite checks to run the tests. Suppose the database has stopped working - it is always unpleasant, many things stop working. But most likely, the authentication service tests can still pass successfully. Most testing tools have mechanisms for labeling tests or filtering their results. You can mark the tests according to the backend systems that they require. Before starting the test, you can perform a similar check, in JUnit there is a class for this Assume . Or you can filter the results of already completed tests.


Returning to the question, why do testing at all? To know that software works when it is released in production. If we know that one of the backend systems is not running, the results of tests of this system should not affect our decision on whether to release a release. Tests are valuable insofar as people trust them and can rely on them. If there is no trust, the tests can simply be erased. It happens that you run a test, it ends with an error, you ask the author - “Is this normal?”, And they answer you - “Don’t worry, he always gives an error”. Erase this test. It destroys trust, it does more harm than good. Neither errors nor successful completion of such a test tell you anything.


So, at this stage we have already figured out how to organize user data for testing, and the tests themselves can be run in parallel throughout the organization, in different environments. Our code is easily maintained through the use of Page Objects and possibly the Screenplay template that Anthony Marcono talks about . But you can scale in another way, running a huge number of tests at the same time.


How exactly does Selenium work?



Under the WebDriver is the Wire protocol. In essence, this is the transfer of formatted JSON data to URLs in a certain way. There are two main dialects: Json Wire Protocol and W3C Dialect. The first is sometimes called OSS Dialect. It provides us with the usual functionality that is implemented in ChromeDriver, the old FirefoxDriver, it uses Selenium Server, PhantomJS. The W3C dialect is a version of the Wire protocol that has passed the standardization process. There are two fundamental differences. The first is how a new session is created, the second is Actions. If you used the old version of Selenium Grid and FirefoxDriver, you might have problems like a broken drag-n-drop. The reason is precisely the differences between the protocols. Keep this in mind.


Here is the usual Selenium architecture: tests interact with a separate server.



But then you will most likely want to have additional nodes to run your tests. Selenium Grid can help here. Run in the hub mode:



Selenium will receive incoming requests and distribute them. Let's look at the console:



If you run the test now, it will not work at all. You need to run another node. To simplify life, if a node runs on the same machine as the hub, it is registered automatically. Now in the browser we see the standard configuration: Firefox, Chrome and, since it knows that I have macOS - Safari:



The problem is that here you must have ChromeDriver, GeckoDriver installed, correctly specify the paths and, in general, do a lot of work. I will be honest with you: is it simply not fashionable anymore? because Docker is not used . Now, if there was a docker - it's wow!


Fashion is important in our industry, isn't it? Yes, most are in jeans and a T-shirt, in relatively comfortable clothes. But we have the sharpest sense of fashion. If Docker is not used , if there are no microservices, and a number of other things, there is simply no point in investing in it.


Fortunately, Selenium knows that you love fashionable things. Therefore, there is a subproject - Selenium Docker . This is a set of compatible browser images that are used for testing. It doesn’t matter if you have GeckoDriver or ChromeDriver installed, which version of Firefox works with which version of GeckoDriver, etc. As a result, Selenium tests interact with the hub in your architecture and nodes are run in Docker containers.



If you run a simple network with a hub and two nodes (one for Chrome, the other for Firefox).



Now in the browser we see these two nodes.



When they run the tests, they will use Docker , and it will allow us to scale this system. Obviously, how to add additional nodes here. On one miserable laptop there are only four cores, so it will quickly cease to have enough power. I think if you run more than 8 threads, it will begin to bend. But the advantage is that these nodes can be launched in different places, and registered in one.



Often you do not need to start all nodes at the same time, and you do not know in advance what situation may arise. You need to be able to run Docker instances on the hub as needed. This can be achieved through the Zalenium project . It is quite easy to use.



If you run it in Docker, the browser will offer a choice between two browsers:



On a fairly strong machine, the hub can run nodes as needed. Or you can combine these approaches, let some of the nodes continue to work continuously. But there is a nuance. Few people have easily accessible real real hardware with Windows on which you can run, for example, tests for the Edge browser.



From here comes the third type of architecture. We need to get these cars out of the cloud. Fortunately, Zalenium has a very simple mechanism for this.



When Zalenium starts, an option is set in the console --sauceLabsEnabled true , and you are logged into Sauce Labs with a username and password. Now you can see in the browser that you can run tests using the Sauce Labs infrastructure , if you suddenly cannot run them locally.



In addition, * Zalenium *** can show you everything that happens in real time.



In Zalenium have a dashboard, on Urla /dashboard . When the test is completed, there appears a video of everything that happened with the magazine and so on. Thanks to this whole system, inside your company you can maintain your own network with a hub and nodes. And when you no longer have enough power, you need to scale, or when you just don’t have a machine with Windows or macOS, you can resort to the help of Sauce Labs , BrowserStack or some other cloud provider.


You can immediately go to the cloud provider. The problem with this approach is that between each run of the test and the receipt of the results, your data will be transmitted twice over the Internet. Upon receiving the request, the cloud transfers it to the Selenium server , which then accesses your browser, which sends HTTP requests to the server, and then the server returns the results. All this happens terribly slowly, but you can go this way if you want to do support yourself.


There are several pitfalls when scaling tests. The first is random DDoS, when you run so many tests that an unfortunate server fails. If you start to encounter random errors, analyze the root causes. Quite often, the problem lies in the fact that some part of your infrastructure is no longer enough resources. Selenium users often try to fix errors without understanding their causes. As errors go nowhere, more complex structures appear in the code, people begin to use creepy hacks.



Another problem that arises with tests is the so-called “ice cream cone”. This is the situation when most of the tests are pass-through (the upper part of the “horn”), and there are a very small number of unit tests (bottom of the “horn”). The look of joyful flickering of browsers probably inspires confidence and enhances self-esteem. However, this approach is completely useless, as end-to-end tests do not provide any accurate data. The only thing you can learn from them is that somewhere in the stack an error occurred, but what exactly caused it is not clear.


Ideally, the test structure should look like a pyramid: a lot of unit tests and a small number of end-to-end tests.



If you want to transform an ice cream cone into a pyramid, you need to analyze the root causes of errors in end-to-end tests and then write as small tests as possible specifically for these errors. Ideally, these should be unit tests. My usual advice: stop writing Selenium tests. If you are trying to scale Selenium - this seems like a problem.


Sometimes people run all their tests for every change in code. This should not be done. If you have not changed anything on the authentication server, you do not need to run a test of this server. Thanks to modern build tools like Buck or Bazel that perform graph analysis, you don’t have to run all the tests at once. But even marking tests can help here.


Finally, try to avoid running every test in every browser on every platform. On the one hand, it gives a feeling of security, on the other hand, it is a bust. Most browsers today comply with common standards, so if you tested everything in one browser, most likely it will work in all the rest. They can be limited to a set of tests for overall performance, to make sure that nothing essential was broken. Ideally, you will also have JavaScript tests that you can run in every browser with every change. How to choose in which browser to run the test? Take a look at user data, logs, and it will become clear to you what people use the most. Testing should be focused on this. This does not apply to the situation когда вы собираетесь поменять способ использования сайта.


Минутка рекламы. Как вы, наверное, знаете, мы делаем конференции. Ближайшая конференция про тестирование — Heisenbug 2018 Piter , которая пройдет 17-18 мая 2018 года в Санкт-Петербурге. Можно туда прийти, послушать доклады (какие доклады там бывают — вы уже увидели в этой статье), вживую пообщаться с практикующими экспертами в тестировании и разработчиками разных моднейших технологий. Короче, заходите, мы вас ждём!