French men can't code: 2017

vendredi 19 mai 2017

A simple selenium load test based on step

This morning I published a tutorial on youtube to help people get started from scratch with selenium load testing, using step as an execution platform. In less than 30 minutes you'll be able to create a script, design a test plan and use step's agent grid to run your selenium scripts in a parallel and massively scalable way.

Here's a link to the video : https://www.youtube.com/watch?v=_D4PQjdbjMI

The main reason I'm writing this blog entry is that I wanted to publish the code of the script I ended up with at the end of the tutorial, so that people could easily copy-paste it if needed. Other than changing the path of the chrome driver binary, you should be able to use it as it comes.

So here it is:

package scripts;

import java.util.HashMap;

import org.junit.Test;
import org.openqa.selenium.By;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import step.handlers.javahandler.AbstractScript;
import step.handlers.javahandler.Function;
import step.handlers.javahandler.ScriptRunner.ScriptContext;

public class SeleniumScripts extends AbstractScript {
 
 @Function
 public void CreateDriver(){
  System.setProperty("webdriver.chrome.driver", "path_to_chrome\\chromedriver.exe");
  ChromeDriver chrome = new ChromeDriver();
  
  session.put("driver", chrome);
 }
 
 @Function
 public void SampleKeyword(){
  
  long sleepTime = input.getInt("sleepTime");
  
  ChromeDriver chrome = (ChromeDriver) session.get("driver");
  chrome.get("http://exense.ch");

  sleep(sleepTime);
  
  WebElement el = chrome.findElement(By.xpath("//a[text()='Consulting']"));
  el.click();
  chrome.findElement(By.xpath("//h3[text()='Performance Analysis']"));
  
  sleep(sleepTime);

 }
 
 private void sleep(long duration){
  try {
   Thread.sleep(duration);
  } catch (InterruptedException e) {
   e.printStackTrace();
  }
 }
 
 @Function
 public void CloseDriver(){
  ChromeDriver chrome = (ChromeDriver) session.get("driver");
  chrome.quit();
  //chrome.close();
 }
 
 @Test
 public void testSampleKeyword(){
  ScriptContext sc = new ScriptContext(new HashMap< String,String >());
  sc.run("SampleKeyword", "{}");
 }

}

One more thing I would like to point out is that I made a mistake in writing my keyword "on the fly" in the video: in the CloseDriver() method, we really should be using the quit() method instead of close(), because close will just close the window but won't terminate the chrome driver process. So in order to avoid leaking chromedriver.exe process instances, make sure to call quit.

Also here's a link to the docs, which essentially guide you through all of this and provide more details. Keep in mind that not all of step's functionality has been documented yet, and that you'll need to explore and poke around if you want to benefit from all of what the platform has to offer. We're available on github and via the contact form of our website if you need help with any aspect of the tool.

I'm probably going to redo this video because I believe I could do a much better job and pack it in a 5 minute clip. I'm also planning on doing another series of tutorials like this like I did last year to get people started with djigger. We feel step is reaching a maturity level which allows us to start producing this kind of content in an effective way, meaning that the APIs have been stabilized, which means that the content will remain up-to-date, and it really is just that easy to get started with the tool now.

I hope this tutorial will help you get started and make you want to join our community, as we believe step really is bringing a lot to the table for whoever wants to do automation in a both effective and elegant way.

Not only are we addressing important technical issues such as scalability, compatibility and portability, and enhancing comfort and ease of use with a modern, central application for our users, but we're providing a ton of flexiblity.

What I mean with flexibility is that if you adopt step, you won't have to build yet another cluster for every new tool you want to use. You'll be reusing our agent technology no matter what simulation engine you've selected. Hell, it could even be your own custom technology you deploy on our agent grid! JMeter wants you to use its own master/slave logic. Grinder wants you to use its workers. Selenium wants you to use its own grid. Etc, etc. And then it's the same problem when it comes to analyzing results. Each one has their own charts, logs, file formats, etc. So it's time we unify and rationalize this mess, and finally work together on one target platform for the execution and the analysis of test results.

With step, we're doing just that. We're bringing that central solution that fits all of our test automation use cases and addresses the challenges that we've come across as a community in about 10 years of experience testing applications together.

mercredi 17 mai 2017

The iterative process of achieving scale (3/3) - Unnecessary synchronization

What you'll find in this blog entry : the E2E and step-by-step analysis of a performance problem and a couple of concrete tips to help you spot synchronization issues faster.

I'll conclude this three-part analysis session with a more simple problem which still affected step's maximal keyword throughput in v3.2 after we corrected the aggregation problem (part 1) and the HTTP connection pooling problem (part 2). It seems to be it's a simple type of problem because I feel I can diagnose these quickly, but I think it's definitely worth going over, as I'll be able to showcase some of my favorite features in djigger, our open-source profiler and production monitoring tool.

So if you recall, at this point in our story, we're trying to optimize step's code to reach a desired target Keyword load of 2000 Keyword executions per second or more in a single-controller environment. I should point out that the term "optimization" was not entirely appropriate in our context up until now, since the two first problems were actually major issues affecting the stability of our test executions, thus destroying our ability to service the user which wanted to use step for that particular kind of testing (which again, was new for us/step). In particular, the HTTP connection pooling issue from part 2 was not even a throughput problem - constantly recreating HTTP connections doesn't cost that much CPU and/or time - but definitely more of stability issue revealed only in an endurance test (leaky sockets).

At the end of part 2 however, we can pretty much say we're moving toward "optimization" because we're able to achieve stable throughput at a level close to our target. So now it's all about hunting down the last few bottlenecks and hopefully getting rid of them without having to undertake any major architectural rework.

So again, we started a load scenario with a target of a about 2000 kw/s, with proper corresponding values for threading and pacing, and we found ourselves saturating around the 1600 kw/s mark. And again, I used djigger to produce a picture of the distribution of samples (which can be used as an approximation of elapsed time) across the different packages inside of the controller component.

And looking at the Reverse view with a filter on step's active thread stacks, this is what I found:

See anything interesting in here?

Remember, when we look at the Reverse tree view, we see the top-most methods of the thread stacks. And after excluding park(), wait() and other intentionally "inactive" methods, and focusing on stacks going through our own packages (.*step.*), we'll end up with method calls which are supposed to be doing some concrete thing for us.

We could take the liberty to call these "on-cpu" methods, even though the term wouldn't technically be 100% correct. Either way, these leaves (or roots, depending on the view you've chosen) are usually low-level methods from java or sun packages (for instance, methods involved with data manipulation, I/O, etc).

Knowing that, it makes sense to find methods like socketRead0(). Threads busy with this method call are either blocking synchronously and waiting for a message from a remote process (here in our situation, it could be the DB or an Agent), or are actively reading that message, using up some CPU time. We won't see this distinction just looking at this view because it's a native method and from the Java world, we can't see which line of code inside that method is being executed or which other native methods are called by this method.

Either way, the best way to investigate this time consumer, assuming you have a legitimate reason to do so, would be by unfolding that branch and further analyzing which packages lead to these socketReads0() and whether their associated percentages make sense, according to what you know the application is doing or should be doing.

Here since we're sampling step's controller, they're totally legitimate because the controller should spend most of its time delegating work to its agents, which results in blocking calls to socketRead0().

Now, something which I found a lot more interesting is the method PluginManager.invoke(). First of all, it's the only method belonging to "business code" (i.e non-java, non-sun, etc) in which a significant amount of time is being spent. Secondly, just judging by its name, it's clearly not the kind of method that will do a lot of work on its own (like iterating through a large collection and computing stuff). Its name is "invoke". Of course, there's always a chance that the sampler catches any method, even the dumbest/fastest ones, on top of a thread stack by "accident", but when 18% of the samples point to that method being on top of the stack, it can't be an accident anymore.

If you have prior experience analyzing these kinds of patterns, you should already know what the problem is at this point. For the other readers, let's take a closer look at the context of this method call, by unfolding that branch and displaying code line numbers as a node-differentiating criterion in our tree view:

So first of all, it's interesting to note that after splitting on code lines, we still find all of the samples busy in the exact same spot (i.e, line 72). This should be another strong hint at what is going on with that method.

Unfolding the branch allows us to confirm the origin of that method call and as a developer, I now know that this method is called at the end of the execution of each step of my test plan, which, from the performance standpoint, is a big deal, meaning that its a highly parallel and hence hot code path and that special care should be given to what kind of work is being done here and how it's being done.

Finally, let's take a look at that line of code:

Yeah, of course, it's part of a synchronized block. I say of course, because this was the only logical possibility which could fit with the symptoms we just went over.

Synchronization is a quick and easy way to make stuff safe and it's fine to use. But if you're going to synchronize methods or blocks which are part of very hot code paths such as in this example, you better have a damn good reason to do so.

And in this case, we didn't. It was actually an unnecessary safety precaution for an object that was actually thread safe.

There you have it: we've taken premature optimization to a whole new level, or rather we've created a new branch in that family: premature thread-safety concerns.

In any case, we just removed that synchronized block were finally able to reach our target load.

I hope this series of step-by-step analysis sessions was either helpful or fun to most of the readers. I might do another round of these in the next few months since I've again analyzed and fixed a bunch of performance issues while cleaning up R3.3. Each new feature (especially the ones that involve changes in core packages or in these "hot" code path areas), introduces a new set of risks, calls for a new series of tests, and leads to a new session of old-fashioned analysis and code cleaning.

mercredi 5 avril 2017

The iterative process of achieving scale (2/3) - HTTP Connections Pooling

What you'll find in this blog entry : the E2E and step-by-step analysis of a performance problem, some thoughts about endurance testing and a couple of concrete tips regarding http connection pooling in a JVM.

Today I'm going to describe the second performance problem we ran into while working on step R3.2. I'm going to have to step up my blogging game though because I'm starting to drift behind releases (we've released v3.3 this week and there's some interesting stuff to talk about in there as well).

As a quick reminder of the last entry, in 3.2 we pushed the performance of our current architecture to its limits in order to support highly granular test scenarios in step, which is the test automation platform we've been building for the last few years.

The Symtoms

So here's how things went. After we dealt with the mongo aggregation problem and implemented static views, we were immediately able to scale up to about 1600 keyword executions per second, which was already about an order of magnitude higher than what we'd been doing until then.

At this load level however, and about 20 minutes into the test, we started to run into some strange socket errors, apparently relating to time outs (java.net.SocketTimeoutException: connect timed out). Here from the execution view (results of each keyword call) :

At first glance, it seemed like some sort of connection issue. But the immediate question that comes to mind is "why only after 20 minutes"? And what you'll find here is a good example of why endurance testing is a crucial part of the QA process.

Of course the classic example is the good ole memory leak "hunt". But memory leaks are not the only problem that need time or repetition to reveal themselves. There are plenty of other problems that can occur and which relate to some sort of accumulation or exhaustion somewhere in the technical stack.

The Analysis

So let's take a closer look at one of the stack traces leading to one of these Socket exceptions (in this case, taken from sampling results with djigger) :

As you can see, this thread is attempting to create a new http connection and to connect to some remote endpoint. At the time of the call to connect(), a BindException is thrown.

Now unless the peer went unresponsive or terminated, on a local network, a connect should always succeed within a few milliseconds. Reaching a timeout - and reaching it suddenly 20 minutes into the test - means in this case that something must be wrong with socket management at the system level.

This is what lead me to taking a closer look at the TCP connection activity on the different machines involved. And what I saw was this :

Tens of thousands of connections accumulating in a TIME_WAIT status (this is obviously just a sample of the netstat output).

Of course connections get recycled, but this process takes time and soon enough, all of the machine ports were exhausted.

As it turned out we weren't pooling connections between our controller and agents. And once again, this problem was brought into light because of the suddenly very high number of controller-agent calls per second in this new, more granular setup. In older contexts, where controller-agent communication was less frequent, The average recycling rate of TCP connections was probably enough to cover up the problem.

The Solution (or one of them)

There are a few ways of fixing this problem (some more elegant than others) but I decided to repeat the test after adding the following system parameters to both JVMs (controller and agent, although the truely necessary one is the flag set on the controller side) :

-Dhttp.keepAlive=true -Dhttp.maxConnections=100

In that retest, connections weren't leaked anymore and the test remained stable for hours.

dimanche 5 février 2017

The iterative process of achieving scale (1/3) - Aggregations in MongoDB

What you'll find in this blog entry : the detailed analysis of a performance problem, starting from the symptoms and going all the way to the root cause, with intermediate steps in the different layers of a distributed application (called "step").

Taking step out of its comfort zone and to the next level through release 3.2 was an interesting challenge. In this series of three performance-centric blog entries, I will reflect on some of the technical issues we've faced, analyzed and solved. This is basically a lessons-learned session involving open-source components which I believe are worth sharing and can be valuable to some people. I'll also put a certain emphasis on my thought process as a performance analyst during the entire exercise.

As you'll notice, every time visibility into the JVM runtime was necessary, I used djigger in its most simple form (i.e with the basic JMX connector) to diagnose the problem. That means that - as in 80% of my everyday-type performance problem cases - I never had to engage with instrumentation or any other complex form of monitoring or data-capturing approach.

That's because sampling is absolutely beast if you know how to process the resulting data and interpret it.

Of course, in addition to using a good tool, and before answering the question "why is this problem happening?" you'll first have to precisely answer the questions "what is happening?" and "where is it happening?".

And you need to start somewhere right? I've seen some interesting ideas over time and a lot of people I've seen dealing with performance aspects tend to start where their obsession or experience takes them. People who understand the OS well will go for a bottom-up approach. People who are interested in GC behaviour will start there, etc.

Without setting a goal, there's no such thing as achieving performance.

Personally, I always initially focus on:

1) understanding the inputs of my scenario (what is the user trying to achieve)
and 2) understanding the developer's and architect's intent (how did they technically plan to satisfy the user?)

These two pieces of information are the beginning of an answer to what is happening. And this is probably the common property shared by all performance problems : they all come from some sort of mismatch or miscommunication between some expectation and the actual behavior of the system that's assigned with satisfying that expectation.

If you skip this step, there's a high chance that you'll end up investigating the wrong thing or investigating the right thing for the wrong reasons. Everything can be tuned, optimized and if needed, redesigned, if you put in the work. From using indexes or rewriting queries to caching objects, to redesigning business cases and workflows... Caching. Pooling. Sizing. You name it. So let's keep this in mind as we analyze this series of problems.

By the way, if you're interested in my take on tooling and my "minimalist" approach to performance analysis, you can check out the manifesto-style FAQ I wrote for djigger, and also this past blog entry. It all basically revolves around the necessary and sufficient concept, which I've also touched on a bit in my presentation at JUG last year (the slides are in English and are available for download on that web page).

A bit of context on the tested system

First off, I feel the need to write a few words about step, which is the indirect object of our study here and the system whose performance will be dissected in the next paragraphs and blog entries. We like to describe step as the first open-source test automation platform for unified agile testing. You might hear or see these words get thrown around a lot, but we've extensively documented how exactly we believe step implements and supports these concepts. It's not just a fancy way to advertise our stuff.

In addition to bringing a new stack and modern approach to the table, step can fit in very easily in just about any legacy technical environment, and so it makes transitioning from old-school in-house testing to our more modern and efficient approach a lot easier. However, unifying testing practices and moving towards continuous delivery mechanics means higher performance and scalability requirements for us, as software providers. The flexibility we provide also means that step has to accept and support a wide range of possible use cases on that type of scale. This is what brings us to this series of performance analysis entries.

The first important bit of information to process in order to understand the rest of this blog entry is that step is highly distributed in nature and is based on a 1-n controller-agent architecture.

The central controller is responsible for dispatching keyword executions (i.e the execution of test fragments or scripts) and their inputs, while agents are responsible for their actual local execution and for returning an output to the controller. Agents are horizontally scalable. The controller, however, can be duplicated but there is no built-in support or semantics for inter-controller communication or synchronization at this point. This means that if you want to add controllers, you have to distribute load across the cluster of controllers yourself. It's possible and not that hard actually, but since a single controller instance has handled the load requirements of all of our users thus far, we've focused on implementing other, more dire and important features.

While scaling up to handle new target load levels and new types of load, we realized there was still a lot of untapped performance potential left in this architecture, so it made sense to optimize it and reasonably push it to its limits before changing it or implementing additional controller clustering features.

More specifically, one of the things we were trying to achieve as part of R3.2 was making sure that people who use step for API-style testing (in the context of micro-services for example), wouldn't run into any deal-breaking bottleneck while ramping up to thousands of calls per second with a single controller instance. You see, step was initially designed to handle longer, E2E-style test cases (built on top of heavy-duty stacks like Selenium+Firefox), meaning that most of the heavy lifting would be handled on the agent-side, and the overhead of controller-agent communication was not significant. This is why you'll find a dynamic grid system in step, which allows for effortless scaling of agent nodes.

However, as we moved our way toward finer, more granular test case designs such as user-action-level test cases or even more extreme ones, such as these very fast API calls to simple services, we progressively shifted the load and pressure from the agent toward the controller in the form of controller-agent communication and controller-side, keyword-related task executions. In other words, we put the step framework and in particular the code that runs on the controller under heavier pressure.

Spoiler alert : Mongo's aggregation framework does not scale infinitely

The user who first ran into the problem I'm about to describe is the first person who pushed the architecture to its current (now old) limits. I've talked about why his setup was so interesting to me and why what he did would have been difficult to achieve with other tools.

Once his test setup was fully operational and the first basic concurrent tests were completed, we started to increase the amount of threads in our simulation in order to reach the target load. Since we were caching and reusing the same webservice port for several thousand iterations, the pressure put on the agent was minimal and we were able to start 30 and then 60 threads with a single agent node without any problem.

However, after a while, we noticed in RTM that our transaction throughput was slowly decreasing even though the response times based on the business code timers were stable. We quickly realized that CPU consumption of step's MongoDB instance was increasing concurrently to that phenomenon. Not only that, but the CPU usage looked very "spiky". Eventually, the test would hardly make any progress anymore and parts of the execution view would become unresponsive.

This is what the throughput started to look like in RTM :

And the DB host CPU looked something like this :

This is the kind of curve you don't like to see when you're load testing something, especially if you're the provider of the tool that's used as part of the test.

A quick look at the mongodb log file, where queries exceeding a certain time threshold (default = 100 ms) are automatically reported, showed me that certain aggregation queries were taking an exponentially increasing amount of time to complete.

Here's a sample entry from the mongo log file :

2016-12-02T18:10:36.886+0100 I COMMAND [conn4] command step.reports command: aggregate { aggregate: "reports", pipeline: [ { $match: { executionID: "58418c026de47b2ffefe28c8", _class: "step.artefacts.reports.CallFunctionReportNode" } }, { $group: { _id: "$status", sum: { $sum: 1 } } } ] } keyUpdates:0 writeConflicts:0 numYields:16032 reslen:101 locks:{ Global: { acquireCount: { r: 32092 } }, Database: { acquireCount: { r: 16046 } }, Collection: { acquireCount: { r: 16046 } } } protocol:op_query 16014ms

A db.currentOp()also confirmed that nothing substantial was really going on in the DB other than these queries.

So I started where I always start, which is with the symptoms. Here's a summary of my observations at this point :

A) There's a continuous decrease in transaction throughput which isn't correlated with longer transaction response times

B) MongoDB uses an increasing amount of CPU to do - as far as we know - the "same amount" of work (i.e we're not increasing the load as we go)

C) Parts of the GUI, in particular the real-time monitoring views are going unresponsive

D) The duration of aggregation queries in mongoDB are increasing exponentially over time according to its log file

First conclusions: we've clearly got a bottleneck involving the database and which apparently relates to aggregation routines. Seeing that we've got this continuous worsening pattern, it also seems to relate tightly to time or to the accumulation of some sort of object (i.e, the test's duration or the amount of executed test steps).

At this point, if you know the product as a developer, the mongo logs are already giving away some critical information, because if you're familiar with the way you're querying your backend and how/why, it won't take very long until you figure out the whole cascade of events. It was the case for me there. But since I don't like shortcuts and also for the sake of making this blog post worth something, I went ahead and checked out step 3.1.1 and reproduced the problem to show you what I'd have done if I hadn't had any prior knowledge of step's code. Most of my career, that's the scenario I was put in, after all.

So we've got a throughput decrease correlated with an increase in CPU utilization on the DB. Mongo also gives us a solid hint as to which queries are taking the longest to execute, and they're consistently the same. Here, we've isolated what is taking time. To me, the name of the game now is to figure out where these queries come from (are they issued by components running the test itself, by some other functionality, or are they user-triggered, and if so from the GUI?). The only thing we know is that in step's architecture, the controller is the only process accessing the MongoDB instance.

So let's perform a bit of sampling on the controller and find out what's going on.

Since we're interested in DB queries here, we'll directly take a look at the reverse tree to see if we can quickly spot the code paths that lead to reads in Mongo. And this is what we get with a simple filter on step packages (we do this to reduce noise and make the stats in the tree more relevant):

Okay, so we've got 92% of all of the "step" code busy with socketReads (excluding the FileWatcher which runs alongside the main threads and does nothing most of the time). Not too surprising to find a lot of time reading on the network (in essence waiting for something), considering the symptoms we're already aware of. Let's unfold this branch though, in order to find out where these reads come from :

I had to edit the picture to get the complete stack to appear at once.

What we learn here, as we go down the stack, is that:

the jongo driver tells us we've spoted the Java-side aggregation code paths, which lead to the queries we've seen in the mongo logs (see the class Aggregate and method call "executeAggregateCommand").
all of this stuff comes from the same business code : step's ProgressUnit plugin
this pluging code is invoked via Jersey which is step's REST provider.

Assuming no one is using a custom client or script to poll the information, we now know that these queries are directly or indirectly triggered by the user through step's webapp.

So I restarted the test, went back to the execution view and turned on network capture in Chrome's developer extension. The first thing I noticed, other than that the Ajax calls were taking longer and longer to complete was that they were issued automatically at regular time intervals. It made sense because auto-refresh of the views is enabled by default (the user can turn it off via a button in the top right corner). Let's take a look at this.

Very early on in the test, it's not that bad, but there's a clear increase in one of the backend calls. We start at around 2 ms (which you can't see here) and then the response time for that call keeps increasing (a few hundred milliseconds in this first picture, and then up to tens of seconds when the test actually chokes up completely).

Later, at choke time :

Now looking at the kind of data this backend call was retrieving and looking at the javascript code, if I hadn't already identified the guilty view element based on the query itself, it would quickly have become clear that the main culprit was the progress bar of the execution view :

If we're computing aggregations every 5 seconds and if for some reason, MongoDB can't deliver a result within less than 5 seconds, then the next call is going to overlap with or "stack" on top of the previous one, further increasing the amount of stress we put mongo under, which will in turn, cause for that second call to be slightly slower than the first one, thus starting an obvious vicious cycle. Eventually, the DB resources are exhausted and no more progress is made.

Also interestingly enough, early in the reproduction of the problem, I was able to correlate CPU spikes with each backend call (yes they were that costly!), this is one of the reasons as to why I drew quicker conclusions from the query I had found in the db logs and didn't really need sampling data to confirm my hypothesis :

Either way, at the end of this session, it was clear to me that it was time to question the way the aggregation of keyword counts is computed within a test execution. The aggregation routine was clearly reaching its limit and would not scale past a few million data points grouped in a single test execution. And it would especially not scale with additional users. I mean, we were already torturing the database with one or two web browsers open!

What a shame, how could you design something like this?

Of course, as always when it comes to performance problems, you could say this was a major flaw in the architectural design of the program and blame the architect and developers for going for a convenient way to implement aggregation pipelines and not anticipating that this would become an issue. The thing is, I'll take a bad yet real performance problem any day over a performance problem due to premature optimization.

For starters, this implementation had worked for years in other production environments, with little to no complaints as far as the performance of this view and with no substantial impact on mongo's CPU usage. So you could argue that this "quick implementation" bought us 2 or 3 years, during which we were happily working on other, more dire problems.

With "quick implementation", I mean that we were already using mongoDB and it seemed it was worth attempting to use this convenient feature prior to moving to anything more complex or less generic (i.e, optimizing).

Also, had we anticipated this issue, there's no guarantee that the other implementation would have been more successful or sustainable for a longer period of time, at least not without additional testing, which always comes at a cost. Let's remember here, that the reason why we hadn't endurance-tested this functionality, was that the order of magnitude of test steps tied to a single test execution was much lower up until we started supporting fine granular keyword designs, and started to push that kind of setup to the 1000's of executions per second.

Some would argue that introducing an endurance test early on would have been a positive thing regardless of the old requirements, because more tests mean a better product. I agree to a certain extent - depending on the context and which tests we're talking about - but you can also kill your productivity with that kind of dogmatic behavior.

Obsessing over test coverage can be a very bad thing sometimes and it can be a way to lose focus on more important things. In this case, considering the stage the product was in at the time, it was clearly benefiting more from adding important missing features than from maximizing test coverage in areas where we weren't sure we'd eventually make a push. I do believe in TDD and automated performance testing in some situations but not all situations. I don't even know if it should be the go-to or default development method.

Bottom line is, you can not effectively plan for nor execute every possible performance test in advance. However, now that this has become an issue, we'll definitely make sure it never, ever, becomes an issue again. And we'll test for it. I do believe in writing a test for every bug or significant issue I come across, because statistically speaking they're more likely to show up again as a regression in some later version of my software.

The aftermath

Looking back at the first of these performance battles as part of R3.2, although I'd like to say I learnt a lot, I actually think my beliefs were just reenforced and I don't think we did that much wrong here, other than doing pragmatic development and solving a problem which was part of an authentically new experiment. What actually happened is that we used a piece of software which was originally designed and tuned for a certain set of inputs and assumptions (not so granular E2E tests), to perform with different inputs (a much more granular test suite, at higher load).

This is often what creates the bed not just for performance problems but non-functional problems in general. Security breaches also birth frequently from the misuse of software or use under new circumstances which were not initially accounted for. Just because something seems like it will work (because it's compatible from a functional standpoint), doesn't mean that it will in practice.

And I think our users are more worried about our ability to react, provide support and in this specific case, diagnose the performance problem and solve it quickly, than about the performance of the product in any specific area at a given point in time. This is why sound analysis skills and battle-tested tooling are so important.

Where I do learn a lot these days though, is in the support department. Some of the practical ways I've had to deal with issues are awful, including in the example of this blog entry. Sometimes I've had to do video calls with a webcam aimed at a monitor, and had to tell users what to do command by command. I've got some big plans to improve our ability to remote-diagnose issues in step. They include adding some advanced archiving/dumping functionality for troubleshooting and improving step's embedded monitoring and diagnostic functionality.

Now, to wrap up the technical side of the story: we decided to add and maintain incremental counters on the fly during test execution (i.e the queries are now pre-computed). This approach relieved mongoDB's pressure entirely.

It is a much less dynamic approach, but we actually didn't really need to compute these aggregations dynamically in that spot. We're just looking to feed a fairly static dashboard in that part of the app. Exploratory and interactive analysis with slice and dice functionality is something that we provide in RTM.

Stay tuned for parts 2 & 3 of this performance adventure !

samedi 4 février 2017

Why fine-granular scripts matter

aka. load-testing with a complex client stack

One of our users was working on a project where he needed to load-test heavily secured webservices. He was the first person to run into the bottleneck that I described in the first entry of my series on achieving scale via iterative analysis. The target load he needed to reach was approximately 1000 tps and his project required that the client-side code, used as part of the test harness, be written in a custom way and in Java. He couldn't use our SoapUI plugin or any other traditional HTTP/S testing tool. Let's see why.

The first reason was that this was a Test as a Service scenario where our user was not the developer of the service and had no access to the code base. He had only been sent a few jar files, certificates, config files and the endpoint URL. His time budget was limited and so he needed to implement his test client in the same language as the client-side libraries (i.e Java). After fighting a bit with the different security libraries, he got his test client code to work. He then managed to package it as a library and went on to expose its functionality as step keywords via annotations.

The other reason he wanted to use the client libraries shipped by the team responsible for the service was that this would make for much more relevant test results (i.e, more realistic measurements). Indeed, if you've load tested complex distributed systems before, and more specifically their RPC interfaces / endpoints, then you know that a lot of problems can lie inside client code. Whether it's distributed memory leaks, cache issues or other performance issues resulting from the incorrect use of business objects on the client side, there are many reminders that the devil hides in the details and that load tests need to be as complete and "end to end" as possible to be relevant. And as part of this process aiming at simulating conditions as similar as possible to those of production, including the actual client-side libraries can be crucial.

step's support for custom Java keywords was a nice fit for this use case, and there's no realistic scenario in which he would have been able to build this test harness with standard HTTP tools such as SoapUI or such, or at least not within the time budget he had.

Another interesting result of this approach is that this user was in full control of his test scenario and measurement points. step's keyword API and test plan semantics allow you to build and orchestrate the execution of code blocks at the exact level that you wish to have in order to design your test scenario.

Applied to the test plan in our webservice example, this is what this looked like :

As you can see, highlighted in yellow, calling the webservice is a 2-step process (as it is usually the case in most RPC mechanisms). The first step is about initialization and instantiation. It is assumed to be done only once (per what, depends of the technical context) but the resulting objects are then expected to be cached and reused many times. This initialization phase is often a costly step for both the client and server, as it involves building or loading many complex objects and classes, such as the webservice "port" object, authentication tokens or other framework components. The second step is usually less costly, as its duration depends almost only on the provider-side business code and suffers from much smaller protocol-induced overhead.

In the context of our user, the initialization of the webservice client lasted 1 second on average, but it could then be cached and each subsequent service call would last only 20 ms. He didn't anticipate this aspect and initially did a few tests with a more monolithic version of his keyword (essentially tying the initialization and business calls together).

Not only did this lead him to produce unrealistic response time measurements, potentially misguiding project stakeholders in the end, but it also caused for the server to become unstable and for the agents to use much more CPU than they should.

In a productive scenario, it was projected that applications would indeed cache the webservice port and so the only truly representative load test harness would be one in which the load would be primarily generated through concurrent executions of step 2, not step 1. So we solved this rather elegantly by splitting the keyword in half, and as you can see in the test plan, by isolating the initialization step prior to entering the ForEach block in which we iterate over a datapool to simulate different business calls (with different input data).

Illustrating this in a web workflow, it would be equivalent to making sure we're simulating the expected amount of HTTP Session objects by having each user only log in once in a while (after executing several functional scenarios), rather than recreating new sessions at the end of each action or iteration. The impact on memory usage is significant and simply can not be ignored, if you intend to run meaningful performance tests and produce valid results.

Now, this test plan worked well in a single threaded "baseline" scenario, but since step distributes its load heavily across both the agents and a given agent's execution units (tokens), there is no guarantee that each unit would already have a valid webservice client at its disposal. So eventually we tweaked our test plan a bit to make sure to test whether a client object was present in our context or not, and initialize it if needed.

And so we were able to ensure that :

each business call could be made using a correctly instantiated webservice port
the initialization keyword would only be executed once per context (token)
the whole setup would work in step's highly concurrent and distributed yet, (by default) stateless context

Another way to do this would have been to use a FunctionGroup control to enter a stateful mode in which keywords are guaranteed to land on the same agent and agent unit (token). In a single-tenant context (only one user submitting tests to the platform), and without complex dispatch rules, stateful execution isn't a big deal, but it can become a source of problems in other situations. I also just find it good practice to refrain from introducing strong dependencies whenever I can.

Finally, we made sure to wrap the different interesting blocks of code with different timers using step's Measurement API (which directly feeds data to RTM for real-time analysis). This would allow us to understand exactly which parts of the code were costing the most execution time. It also allowed us to isolate backend time in a very fine way (i.e, distinguish between that time and the rest of the client-side time), thus providing the project stakeholders with highly relevant and detailed metrics, upon which they could take action if needed.