French men can't code

vendredi 19 mai 2017

A simple selenium load test based on step

This morning I published a tutorial on youtube to help people get started from scratch with selenium load testing, using step as an execution platform. In less than 30 minutes you'll be able to create a script, design a test plan and use step's agent grid to run your selenium scripts in a parallel and massively scalable way.

Here's a link to the video : https://www.youtube.com/watch?v=_D4PQjdbjMI

The main reason I'm writing this blog entry is that I wanted to publish the code of the script I ended up with at the end of the tutorial, so that people could easily copy-paste it if needed. Other than changing the path of the chrome driver binary, you should be able to use it as it comes.

So here it is:

package scripts;

import java.util.HashMap;

import org.junit.Test;
import org.openqa.selenium.By;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import step.handlers.javahandler.AbstractScript;
import step.handlers.javahandler.Function;
import step.handlers.javahandler.ScriptRunner.ScriptContext;

public class SeleniumScripts extends AbstractScript {
 
 @Function
 public void CreateDriver(){
  System.setProperty("webdriver.chrome.driver", "path_to_chrome\\chromedriver.exe");
  ChromeDriver chrome = new ChromeDriver();
  
  session.put("driver", chrome);
 }
 
 @Function
 public void SampleKeyword(){
  
  long sleepTime = input.getInt("sleepTime");
  
  ChromeDriver chrome = (ChromeDriver) session.get("driver");
  chrome.get("http://exense.ch");

  sleep(sleepTime);
  
  WebElement el = chrome.findElement(By.xpath("//a[text()='Consulting']"));
  el.click();
  chrome.findElement(By.xpath("//h3[text()='Performance Analysis']"));
  
  sleep(sleepTime);

 }
 
 private void sleep(long duration){
  try {
   Thread.sleep(duration);
  } catch (InterruptedException e) {
   e.printStackTrace();
  }
 }
 
 @Function
 public void CloseDriver(){
  ChromeDriver chrome = (ChromeDriver) session.get("driver");
  chrome.quit();
  //chrome.close();
 }
 
 @Test
 public void testSampleKeyword(){
  ScriptContext sc = new ScriptContext(new HashMap< String,String >());
  sc.run("SampleKeyword", "{}");
 }

}

One more thing I would like to point out is that I made a mistake in writing my keyword "on the fly" in the video: in the CloseDriver() method, we really should be using the quit() method instead of close(), because close will just close the window but won't terminate the chrome driver process. So in order to avoid leaking chromedriver.exe process instances, make sure to call quit.

Also here's a link to the docs, which essentially guide you through all of this and provide more details. Keep in mind that not all of step's functionality has been documented yet, and that you'll need to explore and poke around if you want to benefit from all of what the platform has to offer. We're available on github and via the contact form of our website if you need help with any aspect of the tool.

I'm probably going to redo this video because I believe I could do a much better job and pack it in a 5 minute clip. I'm also planning on doing another series of tutorials like this like I did last year to get people started with djigger. We feel step is reaching a maturity level which allows us to start producing this kind of content in an effective way, meaning that the APIs have been stabilized, which means that the content will remain up-to-date, and it really is just that easy to get started with the tool now.

I hope this tutorial will help you get started and make you want to join our community, as we believe step really is bringing a lot to the table for whoever wants to do automation in a both effective and elegant way.

Not only are we addressing important technical issues such as scalability, compatibility and portability, and enhancing comfort and ease of use with a modern, central application for our users, but we're providing a ton of flexiblity.

What I mean with flexibility is that if you adopt step, you won't have to build yet another cluster for every new tool you want to use. You'll be reusing our agent technology no matter what simulation engine you've selected. Hell, it could even be your own custom technology you deploy on our agent grid! JMeter wants you to use its own master/slave logic. Grinder wants you to use its workers. Selenium wants you to use its own grid. Etc, etc. And then it's the same problem when it comes to analyzing results. Each one has their own charts, logs, file formats, etc. So it's time we unify and rationalize this mess, and finally work together on one target platform for the execution and the analysis of test results.

With step, we're doing just that. We're bringing that central solution that fits all of our test automation use cases and addresses the challenges that we've come across as a community in about 10 years of experience testing applications together.

mercredi 17 mai 2017

The iterative process of achieving scale (3/3) - Unnecessary synchronization

What you'll find in this blog entry : the E2E and step-by-step analysis of a performance problem and a couple of concrete tips to help you spot synchronization issues faster.

I'll conclude this three-part analysis session with a more simple problem which still affected step's maximal keyword throughput in v3.2 after we corrected the aggregation problem (part 1) and the HTTP connection pooling problem (part 2). It seems to be it's a simple type of problem because I feel I can diagnose these quickly, but I think it's definitely worth going over, as I'll be able to showcase some of my favorite features in djigger, our open-source profiler and production monitoring tool.

So if you recall, at this point in our story, we're trying to optimize step's code to reach a desired target Keyword load of 2000 Keyword executions per second or more in a single-controller environment. I should point out that the term "optimization" was not entirely appropriate in our context up until now, since the two first problems were actually major issues affecting the stability of our test executions, thus destroying our ability to service the user which wanted to use step for that particular kind of testing (which again, was new for us/step). In particular, the HTTP connection pooling issue from part 2 was not even a throughput problem - constantly recreating HTTP connections doesn't cost that much CPU and/or time - but definitely more of stability issue revealed only in an endurance test (leaky sockets).

At the end of part 2 however, we can pretty much say we're moving toward "optimization" because we're able to achieve stable throughput at a level close to our target. So now it's all about hunting down the last few bottlenecks and hopefully getting rid of them without having to undertake any major architectural rework.

So again, we started a load scenario with a target of a about 2000 kw/s, with proper corresponding values for threading and pacing, and we found ourselves saturating around the 1600 kw/s mark. And again, I used djigger to produce a picture of the distribution of samples (which can be used as an approximation of elapsed time) across the different packages inside of the controller component.

And looking at the Reverse view with a filter on step's active thread stacks, this is what I found:

See anything interesting in here?

Remember, when we look at the Reverse tree view, we see the top-most methods of the thread stacks. And after excluding park(), wait() and other intentionally "inactive" methods, and focusing on stacks going through our own packages (.*step.*), we'll end up with method calls which are supposed to be doing some concrete thing for us.

We could take the liberty to call these "on-cpu" methods, even though the term wouldn't technically be 100% correct. Either way, these leaves (or roots, depending on the view you've chosen) are usually low-level methods from java or sun packages (for instance, methods involved with data manipulation, I/O, etc).

Knowing that, it makes sense to find methods like socketRead0(). Threads busy with this method call are either blocking synchronously and waiting for a message from a remote process (here in our situation, it could be the DB or an Agent), or are actively reading that message, using up some CPU time. We won't see this distinction just looking at this view because it's a native method and from the Java world, we can't see which line of code inside that method is being executed or which other native methods are called by this method.

Either way, the best way to investigate this time consumer, assuming you have a legitimate reason to do so, would be by unfolding that branch and further analyzing which packages lead to these socketReads0() and whether their associated percentages make sense, according to what you know the application is doing or should be doing.

Here since we're sampling step's controller, they're totally legitimate because the controller should spend most of its time delegating work to its agents, which results in blocking calls to socketRead0().

Now, something which I found a lot more interesting is the method PluginManager.invoke(). First of all, it's the only method belonging to "business code" (i.e non-java, non-sun, etc) in which a significant amount of time is being spent. Secondly, just judging by its name, it's clearly not the kind of method that will do a lot of work on its own (like iterating through a large collection and computing stuff). Its name is "invoke". Of course, there's always a chance that the sampler catches any method, even the dumbest/fastest ones, on top of a thread stack by "accident", but when 18% of the samples point to that method being on top of the stack, it can't be an accident anymore.

If you have prior experience analyzing these kinds of patterns, you should already know what the problem is at this point. For the other readers, let's take a closer look at the context of this method call, by unfolding that branch and displaying code line numbers as a node-differentiating criterion in our tree view:

So first of all, it's interesting to note that after splitting on code lines, we still find all of the samples busy in the exact same spot (i.e, line 72). This should be another strong hint at what is going on with that method.

Unfolding the branch allows us to confirm the origin of that method call and as a developer, I now know that this method is called at the end of the execution of each step of my test plan, which, from the performance standpoint, is a big deal, meaning that its a highly parallel and hence hot code path and that special care should be given to what kind of work is being done here and how it's being done.

Finally, let's take a look at that line of code:

Yeah, of course, it's part of a synchronized block. I say of course, because this was the only logical possibility which could fit with the symptoms we just went over.

Synchronization is a quick and easy way to make stuff safe and it's fine to use. But if you're going to synchronize methods or blocks which are part of very hot code paths such as in this example, you better have a damn good reason to do so.

And in this case, we didn't. It was actually an unnecessary safety precaution for an object that was actually thread safe.

There you have it: we've taken premature optimization to a whole new level, or rather we've created a new branch in that family: premature thread-safety concerns.

In any case, we just removed that synchronized block were finally able to reach our target load.

I hope this series of step-by-step analysis sessions was either helpful or fun to most of the readers. I might do another round of these in the next few months since I've again analyzed and fixed a bunch of performance issues while cleaning up R3.3. Each new feature (especially the ones that involve changes in core packages or in these "hot" code path areas), introduces a new set of risks, calls for a new series of tests, and leads to a new session of old-fashioned analysis and code cleaning.

mercredi 5 avril 2017

The iterative process of achieving scale (2/3) - HTTP Connections Pooling

What you'll find in this blog entry : the E2E and step-by-step analysis of a performance problem, some thoughts about endurance testing and a couple of concrete tips regarding http connection pooling in a JVM.

Today I'm going to describe the second performance problem we ran into while working on step R3.2. I'm going to have to step up my blogging game though because I'm starting to drift behind releases (we've released v3.3 this week and there's some interesting stuff to talk about in there as well).

As a quick reminder of the last entry, in 3.2 we pushed the performance of our current architecture to its limits in order to support highly granular test scenarios in step, which is the test automation platform we've been building for the last few years.

The Symtoms

So here's how things went. After we dealt with the mongo aggregation problem and implemented static views, we were immediately able to scale up to about 1600 keyword executions per second, which was already about an order of magnitude higher than what we'd been doing until then.

At this load level however, and about 20 minutes into the test, we started to run into some strange socket errors, apparently relating to time outs (java.net.SocketTimeoutException: connect timed out). Here from the execution view (results of each keyword call) :

At first glance, it seemed like some sort of connection issue. But the immediate question that comes to mind is "why only after 20 minutes"? And what you'll find here is a good example of why endurance testing is a crucial part of the QA process.

Of course the classic example is the good ole memory leak "hunt". But memory leaks are not the only problem that need time or repetition to reveal themselves. There are plenty of other problems that can occur and which relate to some sort of accumulation or exhaustion somewhere in the technical stack.

The Analysis

So let's take a closer look at one of the stack traces leading to one of these Socket exceptions (in this case, taken from sampling results with djigger) :

As you can see, this thread is attempting to create a new http connection and to connect to some remote endpoint. At the time of the call to connect(), a BindException is thrown.

Now unless the peer went unresponsive or terminated, on a local network, a connect should always succeed within a few milliseconds. Reaching a timeout - and reaching it suddenly 20 minutes into the test - means in this case that something must be wrong with socket management at the system level.

This is what lead me to taking a closer look at the TCP connection activity on the different machines involved. And what I saw was this :

Tens of thousands of connections accumulating in a TIME_WAIT status (this is obviously just a sample of the netstat output).

Of course connections get recycled, but this process takes time and soon enough, all of the machine ports were exhausted.

As it turned out we weren't pooling connections between our controller and agents. And once again, this problem was brought into light because of the suddenly very high number of controller-agent calls per second in this new, more granular setup. In older contexts, where controller-agent communication was less frequent, The average recycling rate of TCP connections was probably enough to cover up the problem.

The Solution (or one of them)

There are a few ways of fixing this problem (some more elegant than others) but I decided to repeat the test after adding the following system parameters to both JVMs (controller and agent, although the truely necessary one is the flag set on the controller side) :

-Dhttp.keepAlive=true -Dhttp.maxConnections=100

In that retest, connections weren't leaked anymore and the test remained stable for hours.

dimanche 5 février 2017

The iterative process of achieving scale (1/3) - Aggregations in MongoDB

What you'll find in this blog entry : the detailed analysis of a performance problem, starting from the symptoms and going all the way to the root cause, with intermediate steps in the different layers of a distributed application (called "step").

Taking step out of its comfort zone and to the next level through release 3.2 was an interesting challenge. In this series of three performance-centric blog entries, I will reflect on some of the technical issues we've faced, analyzed and solved. This is basically a lessons-learned session involving open-source components which I believe are worth sharing and can be valuable to some people. I'll also put a certain emphasis on my thought process as a performance analyst during the entire exercise.

As you'll notice, every time visibility into the JVM runtime was necessary, I used djigger in its most simple form (i.e with the basic JMX connector) to diagnose the problem. That means that - as in 80% of my everyday-type performance problem cases - I never had to engage with instrumentation or any other complex form of monitoring or data-capturing approach.

That's because sampling is absolutely beast if you know how to process the resulting data and interpret it.

Of course, in addition to using a good tool, and before answering the question "why is this problem happening?" you'll first have to precisely answer the questions "what is happening?" and "where is it happening?".

And you need to start somewhere right? I've seen some interesting ideas over time and a lot of people I've seen dealing with performance aspects tend to start where their obsession or experience takes them. People who understand the OS well will go for a bottom-up approach. People who are interested in GC behaviour will start there, etc.

Without setting a goal, there's no such thing as achieving performance.

Personally, I always initially focus on:

1) understanding the inputs of my scenario (what is the user trying to achieve)
and 2) understanding the developer's and architect's intent (how did they technically plan to satisfy the user?)

These two pieces of information are the beginning of an answer to what is happening. And this is probably the common property shared by all performance problems : they all come from some sort of mismatch or miscommunication between some expectation and the actual behavior of the system that's assigned with satisfying that expectation.

If you skip this step, there's a high chance that you'll end up investigating the wrong thing or investigating the right thing for the wrong reasons. Everything can be tuned, optimized and if needed, redesigned, if you put in the work. From using indexes or rewriting queries to caching objects, to redesigning business cases and workflows... Caching. Pooling. Sizing. You name it. So let's keep this in mind as we analyze this series of problems.

By the way, if you're interested in my take on tooling and my "minimalist" approach to performance analysis, you can check out the manifesto-style FAQ I wrote for djigger, and also this past blog entry. It all basically revolves around the necessary and sufficient concept, which I've also touched on a bit in my presentation at JUG last year (the slides are in English and are available for download on that web page).

A bit of context on the tested system

First off, I feel the need to write a few words about step, which is the indirect object of our study here and the system whose performance will be dissected in the next paragraphs and blog entries. We like to describe step as the first open-source test automation platform for unified agile testing. You might hear or see these words get thrown around a lot, but we've extensively documented how exactly we believe step implements and supports these concepts. It's not just a fancy way to advertise our stuff.

In addition to bringing a new stack and modern approach to the table, step can fit in very easily in just about any legacy technical environment, and so it makes transitioning from old-school in-house testing to our more modern and efficient approach a lot easier. However, unifying testing practices and moving towards continuous delivery mechanics means higher performance and scalability requirements for us, as software providers. The flexibility we provide also means that step has to accept and support a wide range of possible use cases on that type of scale. This is what brings us to this series of performance analysis entries.

The first important bit of information to process in order to understand the rest of this blog entry is that step is highly distributed in nature and is based on a 1-n controller-agent architecture.

The central controller is responsible for dispatching keyword executions (i.e the execution of test fragments or scripts) and their inputs, while agents are responsible for their actual local execution and for returning an output to the controller. Agents are horizontally scalable. The controller, however, can be duplicated but there is no built-in support or semantics for inter-controller communication or synchronization at this point. This means that if you want to add controllers, you have to distribute load across the cluster of controllers yourself. It's possible and not that hard actually, but since a single controller instance has handled the load requirements of all of our users thus far, we've focused on implementing other, more dire and important features.

While scaling up to handle new target load levels and new types of load, we realized there was still a lot of untapped performance potential left in this architecture, so it made sense to optimize it and reasonably push it to its limits before changing it or implementing additional controller clustering features.

More specifically, one of the things we were trying to achieve as part of R3.2 was making sure that people who use step for API-style testing (in the context of micro-services for example), wouldn't run into any deal-breaking bottleneck while ramping up to thousands of calls per second with a single controller instance. You see, step was initially designed to handle longer, E2E-style test cases (built on top of heavy-duty stacks like Selenium+Firefox), meaning that most of the heavy lifting would be handled on the agent-side, and the overhead of controller-agent communication was not significant. This is why you'll find a dynamic grid system in step, which allows for effortless scaling of agent nodes.

However, as we moved our way toward finer, more granular test case designs such as user-action-level test cases or even more extreme ones, such as these very fast API calls to simple services, we progressively shifted the load and pressure from the agent toward the controller in the form of controller-agent communication and controller-side, keyword-related task executions. In other words, we put the step framework and in particular the code that runs on the controller under heavier pressure.

Spoiler alert : Mongo's aggregation framework does not scale infinitely

The user who first ran into the problem I'm about to describe is the first person who pushed the architecture to its current (now old) limits. I've talked about why his setup was so interesting to me and why what he did would have been difficult to achieve with other tools.

Once his test setup was fully operational and the first basic concurrent tests were completed, we started to increase the amount of threads in our simulation in order to reach the target load. Since we were caching and reusing the same webservice port for several thousand iterations, the pressure put on the agent was minimal and we were able to start 30 and then 60 threads with a single agent node without any problem.

However, after a while, we noticed in RTM that our transaction throughput was slowly decreasing even though the response times based on the business code timers were stable. We quickly realized that CPU consumption of step's MongoDB instance was increasing concurrently to that phenomenon. Not only that, but the CPU usage looked very "spiky". Eventually, the test would hardly make any progress anymore and parts of the execution view would become unresponsive.

This is what the throughput started to look like in RTM :

And the DB host CPU looked something like this :

This is the kind of curve you don't like to see when you're load testing something, especially if you're the provider of the tool that's used as part of the test.

A quick look at the mongodb log file, where queries exceeding a certain time threshold (default = 100 ms) are automatically reported, showed me that certain aggregation queries were taking an exponentially increasing amount of time to complete.

Here's a sample entry from the mongo log file :

2016-12-02T18:10:36.886+0100 I COMMAND [conn4] command step.reports command: aggregate { aggregate: "reports", pipeline: [ { $match: { executionID: "58418c026de47b2ffefe28c8", _class: "step.artefacts.reports.CallFunctionReportNode" } }, { $group: { _id: "$status", sum: { $sum: 1 } } } ] } keyUpdates:0 writeConflicts:0 numYields:16032 reslen:101 locks:{ Global: { acquireCount: { r: 32092 } }, Database: { acquireCount: { r: 16046 } }, Collection: { acquireCount: { r: 16046 } } } protocol:op_query 16014ms

A db.currentOp()also confirmed that nothing substantial was really going on in the DB other than these queries.

So I started where I always start, which is with the symptoms. Here's a summary of my observations at this point :

A) There's a continuous decrease in transaction throughput which isn't correlated with longer transaction response times

B) MongoDB uses an increasing amount of CPU to do - as far as we know - the "same amount" of work (i.e we're not increasing the load as we go)

C) Parts of the GUI, in particular the real-time monitoring views are going unresponsive

D) The duration of aggregation queries in mongoDB are increasing exponentially over time according to its log file

First conclusions: we've clearly got a bottleneck involving the database and which apparently relates to aggregation routines. Seeing that we've got this continuous worsening pattern, it also seems to relate tightly to time or to the accumulation of some sort of object (i.e, the test's duration or the amount of executed test steps).

At this point, if you know the product as a developer, the mongo logs are already giving away some critical information, because if you're familiar with the way you're querying your backend and how/why, it won't take very long until you figure out the whole cascade of events. It was the case for me there. But since I don't like shortcuts and also for the sake of making this blog post worth something, I went ahead and checked out step 3.1.1 and reproduced the problem to show you what I'd have done if I hadn't had any prior knowledge of step's code. Most of my career, that's the scenario I was put in, after all.

So we've got a throughput decrease correlated with an increase in CPU utilization on the DB. Mongo also gives us a solid hint as to which queries are taking the longest to execute, and they're consistently the same. Here, we've isolated what is taking time. To me, the name of the game now is to figure out where these queries come from (are they issued by components running the test itself, by some other functionality, or are they user-triggered, and if so from the GUI?). The only thing we know is that in step's architecture, the controller is the only process accessing the MongoDB instance.

So let's perform a bit of sampling on the controller and find out what's going on.

Since we're interested in DB queries here, we'll directly take a look at the reverse tree to see if we can quickly spot the code paths that lead to reads in Mongo. And this is what we get with a simple filter on step packages (we do this to reduce noise and make the stats in the tree more relevant):

Okay, so we've got 92% of all of the "step" code busy with socketReads (excluding the FileWatcher which runs alongside the main threads and does nothing most of the time). Not too surprising to find a lot of time reading on the network (in essence waiting for something), considering the symptoms we're already aware of. Let's unfold this branch though, in order to find out where these reads come from :

I had to edit the picture to get the complete stack to appear at once.

What we learn here, as we go down the stack, is that:

the jongo driver tells us we've spoted the Java-side aggregation code paths, which lead to the queries we've seen in the mongo logs (see the class Aggregate and method call "executeAggregateCommand").
all of this stuff comes from the same business code : step's ProgressUnit plugin
this pluging code is invoked via Jersey which is step's REST provider.

Assuming no one is using a custom client or script to poll the information, we now know that these queries are directly or indirectly triggered by the user through step's webapp.

So I restarted the test, went back to the execution view and turned on network capture in Chrome's developer extension. The first thing I noticed, other than that the Ajax calls were taking longer and longer to complete was that they were issued automatically at regular time intervals. It made sense because auto-refresh of the views is enabled by default (the user can turn it off via a button in the top right corner). Let's take a look at this.

Very early on in the test, it's not that bad, but there's a clear increase in one of the backend calls. We start at around 2 ms (which you can't see here) and then the response time for that call keeps increasing (a few hundred milliseconds in this first picture, and then up to tens of seconds when the test actually chokes up completely).

Later, at choke time :

Now looking at the kind of data this backend call was retrieving and looking at the javascript code, if I hadn't already identified the guilty view element based on the query itself, it would quickly have become clear that the main culprit was the progress bar of the execution view :

If we're computing aggregations every 5 seconds and if for some reason, MongoDB can't deliver a result within less than 5 seconds, then the next call is going to overlap with or "stack" on top of the previous one, further increasing the amount of stress we put mongo under, which will in turn, cause for that second call to be slightly slower than the first one, thus starting an obvious vicious cycle. Eventually, the DB resources are exhausted and no more progress is made.

Also interestingly enough, early in the reproduction of the problem, I was able to correlate CPU spikes with each backend call (yes they were that costly!), this is one of the reasons as to why I drew quicker conclusions from the query I had found in the db logs and didn't really need sampling data to confirm my hypothesis :

Either way, at the end of this session, it was clear to me that it was time to question the way the aggregation of keyword counts is computed within a test execution. The aggregation routine was clearly reaching its limit and would not scale past a few million data points grouped in a single test execution. And it would especially not scale with additional users. I mean, we were already torturing the database with one or two web browsers open!

What a shame, how could you design something like this?

Of course, as always when it comes to performance problems, you could say this was a major flaw in the architectural design of the program and blame the architect and developers for going for a convenient way to implement aggregation pipelines and not anticipating that this would become an issue. The thing is, I'll take a bad yet real performance problem any day over a performance problem due to premature optimization.

For starters, this implementation had worked for years in other production environments, with little to no complaints as far as the performance of this view and with no substantial impact on mongo's CPU usage. So you could argue that this "quick implementation" bought us 2 or 3 years, during which we were happily working on other, more dire problems.

With "quick implementation", I mean that we were already using mongoDB and it seemed it was worth attempting to use this convenient feature prior to moving to anything more complex or less generic (i.e, optimizing).

Also, had we anticipated this issue, there's no guarantee that the other implementation would have been more successful or sustainable for a longer period of time, at least not without additional testing, which always comes at a cost. Let's remember here, that the reason why we hadn't endurance-tested this functionality, was that the order of magnitude of test steps tied to a single test execution was much lower up until we started supporting fine granular keyword designs, and started to push that kind of setup to the 1000's of executions per second.

Some would argue that introducing an endurance test early on would have been a positive thing regardless of the old requirements, because more tests mean a better product. I agree to a certain extent - depending on the context and which tests we're talking about - but you can also kill your productivity with that kind of dogmatic behavior.

Obsessing over test coverage can be a very bad thing sometimes and it can be a way to lose focus on more important things. In this case, considering the stage the product was in at the time, it was clearly benefiting more from adding important missing features than from maximizing test coverage in areas where we weren't sure we'd eventually make a push. I do believe in TDD and automated performance testing in some situations but not all situations. I don't even know if it should be the go-to or default development method.

Bottom line is, you can not effectively plan for nor execute every possible performance test in advance. However, now that this has become an issue, we'll definitely make sure it never, ever, becomes an issue again. And we'll test for it. I do believe in writing a test for every bug or significant issue I come across, because statistically speaking they're more likely to show up again as a regression in some later version of my software.

The aftermath

Looking back at the first of these performance battles as part of R3.2, although I'd like to say I learnt a lot, I actually think my beliefs were just reenforced and I don't think we did that much wrong here, other than doing pragmatic development and solving a problem which was part of an authentically new experiment. What actually happened is that we used a piece of software which was originally designed and tuned for a certain set of inputs and assumptions (not so granular E2E tests), to perform with different inputs (a much more granular test suite, at higher load).

This is often what creates the bed not just for performance problems but non-functional problems in general. Security breaches also birth frequently from the misuse of software or use under new circumstances which were not initially accounted for. Just because something seems like it will work (because it's compatible from a functional standpoint), doesn't mean that it will in practice.

And I think our users are more worried about our ability to react, provide support and in this specific case, diagnose the performance problem and solve it quickly, than about the performance of the product in any specific area at a given point in time. This is why sound analysis skills and battle-tested tooling are so important.

Where I do learn a lot these days though, is in the support department. Some of the practical ways I've had to deal with issues are awful, including in the example of this blog entry. Sometimes I've had to do video calls with a webcam aimed at a monitor, and had to tell users what to do command by command. I've got some big plans to improve our ability to remote-diagnose issues in step. They include adding some advanced archiving/dumping functionality for troubleshooting and improving step's embedded monitoring and diagnostic functionality.

Now, to wrap up the technical side of the story: we decided to add and maintain incremental counters on the fly during test execution (i.e the queries are now pre-computed). This approach relieved mongoDB's pressure entirely.

It is a much less dynamic approach, but we actually didn't really need to compute these aggregations dynamically in that spot. We're just looking to feed a fairly static dashboard in that part of the app. Exploratory and interactive analysis with slice and dice functionality is something that we provide in RTM.

Stay tuned for parts 2 & 3 of this performance adventure !

samedi 4 février 2017

Why fine-granular scripts matter

aka. load-testing with a complex client stack

One of our users was working on a project where he needed to load-test heavily secured webservices. He was the first person to run into the bottleneck that I described in the first entry of my series on achieving scale via iterative analysis. The target load he needed to reach was approximately 1000 tps and his project required that the client-side code, used as part of the test harness, be written in a custom way and in Java. He couldn't use our SoapUI plugin or any other traditional HTTP/S testing tool. Let's see why.

The first reason was that this was a Test as a Service scenario where our user was not the developer of the service and had no access to the code base. He had only been sent a few jar files, certificates, config files and the endpoint URL. His time budget was limited and so he needed to implement his test client in the same language as the client-side libraries (i.e Java). After fighting a bit with the different security libraries, he got his test client code to work. He then managed to package it as a library and went on to expose its functionality as step keywords via annotations.

The other reason he wanted to use the client libraries shipped by the team responsible for the service was that this would make for much more relevant test results (i.e, more realistic measurements). Indeed, if you've load tested complex distributed systems before, and more specifically their RPC interfaces / endpoints, then you know that a lot of problems can lie inside client code. Whether it's distributed memory leaks, cache issues or other performance issues resulting from the incorrect use of business objects on the client side, there are many reminders that the devil hides in the details and that load tests need to be as complete and "end to end" as possible to be relevant. And as part of this process aiming at simulating conditions as similar as possible to those of production, including the actual client-side libraries can be crucial.

step's support for custom Java keywords was a nice fit for this use case, and there's no realistic scenario in which he would have been able to build this test harness with standard HTTP tools such as SoapUI or such, or at least not within the time budget he had.

Another interesting result of this approach is that this user was in full control of his test scenario and measurement points. step's keyword API and test plan semantics allow you to build and orchestrate the execution of code blocks at the exact level that you wish to have in order to design your test scenario.

Applied to the test plan in our webservice example, this is what this looked like :

As you can see, highlighted in yellow, calling the webservice is a 2-step process (as it is usually the case in most RPC mechanisms). The first step is about initialization and instantiation. It is assumed to be done only once (per what, depends of the technical context) but the resulting objects are then expected to be cached and reused many times. This initialization phase is often a costly step for both the client and server, as it involves building or loading many complex objects and classes, such as the webservice "port" object, authentication tokens or other framework components. The second step is usually less costly, as its duration depends almost only on the provider-side business code and suffers from much smaller protocol-induced overhead.

In the context of our user, the initialization of the webservice client lasted 1 second on average, but it could then be cached and each subsequent service call would last only 20 ms. He didn't anticipate this aspect and initially did a few tests with a more monolithic version of his keyword (essentially tying the initialization and business calls together).

Not only did this lead him to produce unrealistic response time measurements, potentially misguiding project stakeholders in the end, but it also caused for the server to become unstable and for the agents to use much more CPU than they should.

In a productive scenario, it was projected that applications would indeed cache the webservice port and so the only truly representative load test harness would be one in which the load would be primarily generated through concurrent executions of step 2, not step 1. So we solved this rather elegantly by splitting the keyword in half, and as you can see in the test plan, by isolating the initialization step prior to entering the ForEach block in which we iterate over a datapool to simulate different business calls (with different input data).

Illustrating this in a web workflow, it would be equivalent to making sure we're simulating the expected amount of HTTP Session objects by having each user only log in once in a while (after executing several functional scenarios), rather than recreating new sessions at the end of each action or iteration. The impact on memory usage is significant and simply can not be ignored, if you intend to run meaningful performance tests and produce valid results.

Now, this test plan worked well in a single threaded "baseline" scenario, but since step distributes its load heavily across both the agents and a given agent's execution units (tokens), there is no guarantee that each unit would already have a valid webservice client at its disposal. So eventually we tweaked our test plan a bit to make sure to test whether a client object was present in our context or not, and initialize it if needed.

And so we were able to ensure that :

each business call could be made using a correctly instantiated webservice port
the initialization keyword would only be executed once per context (token)
the whole setup would work in step's highly concurrent and distributed yet, (by default) stateless context

Another way to do this would have been to use a FunctionGroup control to enter a stateful mode in which keywords are guaranteed to land on the same agent and agent unit (token). In a single-tenant context (only one user submitting tests to the platform), and without complex dispatch rules, stateful execution isn't a big deal, but it can become a source of problems in other situations. I also just find it good practice to refrain from introducing strong dependencies whenever I can.

Finally, we made sure to wrap the different interesting blocks of code with different timers using step's Measurement API (which directly feeds data to RTM for real-time analysis). This would allow us to understand exactly which parts of the code were costing the most execution time. It also allowed us to isolate backend time in a very fine way (i.e, distinguish between that time and the rest of the client-side time), thus providing the project stakeholders with highly relevant and detailed metrics, upon which they could take action if needed.

jeudi 15 décembre 2016

A plea for modern testing and the keyword approach

Observing how mainstream corporations try to adapt and keep up with modern trends such as agile development, REST-based architectures, continuous integration and browser-based emulation has been fascinating to me lately. While these new ideas always seem very sexy on paper, they're not always easy to implement and take advantage of. And it's not just because of lack of talent or proper management.

At the same time, their IT system is oft a ticking bomb, unsupported stacks crawl around and when delayed too long, major technical decisions tend to become radical and a bit brutal. So figuring out how to transition, toward what (there a whole ocean of technology out there, and it's getting bigger and more confusing by the minute) and with which goals in mind, is definitely a key element in the success of these companies these days.

For instance, the SaaS transformation which has taken many markets by storm over the last few years still poses many technical as well as legal problems to insurances, banks or administrations, and so most of these companies still need to produce, test and operate their own software at scale. Software which happens to be very complex due to the nature of their businesses.

But today I'd like to focus on just one of these practices. I want to talk about how testing has been done in many of these environments, and how testing should be done, in my opinion. What I'm about to describe here is my personal experience working with and for such companies and what I think the future holds for them as far as testing goes.

And I want to start by mentioning that their biggest challenges are not or not just about scaling services to millions of users but are also about ensuring they're reaching solid test coverage in applications which may produce hundreds of thousands of very different and peculiar use cases and which contain a lot of oddities due to historical reasons or regulations. These aspects impact a lot of things down the line when you're trying to modernize your system. They also turn test automation and test artefact reusability into absolutely critical factors which need to be taken in account when designing or redesigning a test factory.

The way testing has been done in most of these environments over the past decades involves an approach where each person or team who's concerned with a particular aspect of testing would implement not just their own strategy, but also their own scripts, buy their own tools, set up their own environment, etc. I call this "siloed testing".

There are many reasons for why I believe this approach is ill-advised, but I'm just going to cover a few of them here. These are big ones though, so buckle up, because if you're still testing in an unstructured way with old funky legacy tools, you're in for a ride.

Oh one last thing : I'm going to assume the reader understands the benefits of test automation here. I believe certain situations call for a bit of manual testing (let's not be dogmatic about anything), but I also don't think there's a single project I've worked on where anyone could argue in a solid way against a target of at least 80% of automated test coverage.

So let's dive into it!

The obvious reason : redundancy

I can't event count anymore how often I've come across corporations who have at least 3, but sometimes 4 or 5 different teams in charge of different types of tests for the same application or group of applications.

It seems like a healthy approach, since we're just trying to spread responsibility across different people and distribute workload here. But this configuration is only sustainable and scalable if you can do it efficiently enough. In my experience, it's not unusual to see a developer team, a functional testing team, a system/integration team, a load and performance testing team and then sometimes even a synthetic monitoring team re-script the same use case 5 times, maybe with 5 different tools, and often without even being aware of it.

Worse, they'll probably throw these scripts away entirely when the next release comes around.

Scripting a use case correctly can be pretty difficult and will sometimes involve hours or days of work, depending on how complex the scenario is and how proper the tooling and simulation strategy are.

I don't think there is a situation where that kind of waste makes sense. And I'm barely even going to mention the fact that in many cases, the resulting data - if any - is often meaningless because it's corrupt, incomplete or misinterpreted in some way. The complexity of this practice, sometimes the lack of skilled testers, but also the poor quality of the tools that people use, are all part of the problem here.

Nevertheless, it's important we make sure to get rid of this island of script waste in the middle of our ocean of Testing.

The serious reason : competency

Not only are we wasting assets such as scripts and test artefacts here, but we're also wasting some very precious time.

While a developer can probably write a solid Selenium or Grinder script easily, it's much less the case or at least it's less natural for a functional tester who works with tables like an Excel spreadsheet, or even a performance engineer, who's more comfortable analizing system behaviour, looking at charts and trying to identify incoherent, faulty or slow behaviour.

Asking a monitoring team to write a million scripts to monitor your services synthetically is also a bad investment. They already have a lot on their plate, trying to put everything together and making some sense out of the data that they're already gathering. Sometimes, if you're the CTO or an application owner for example, you'll come in and bug them for another dashboard so you can see what's going on in your system, but maybe you're making them waste a ton of time on things that seem mindless to them and that they're not truly equipped for. And maybe, that's the main reason why the dashboard you've asked for isn't already available.

Bottom line is (and it seems people forget or ignore this in testing departments sometimes) : to work efficiently, people should be working on what they're good at, with the correct tool and proper training. But wait now, this doesn't mean that people shouldn't learn new things or talk to each other. Our philosophy sits at quite the opposite end of the spectrum actually, which I'll demonstrate in the last section of this blog post.

The unacceptable reason : maintainability

So at this point in this old school scenario, we've got the wrong people working on the wrong tasks, possibly with the wrong tools, which is already pretty bad, but we're also having them do that very same task up to 5 times in a completely redundant way.

But wait. It gets even worse, much worse.

I remember distinctly at least 5 or 6 times in my career where I had to take over a testing project from someone for various reasons. Either they were gone or when the new release came out and was ready to be tested, they were busy working on something else at that time. Either way I had to try and reuse their code and in almost every case, I was unable to reuse what they had done. And that's not because I'm just horrible at scripting or didn't put in the time and effort, I beg you to trust me here.

It's because of two main factors :

big long monolithic code is awfully hard, and in some cases, impossible to maintain
the target applications we were supposed to test changed very rapidly

That's a very bad yet very common combination.

Sometimes people would come and try to help me make their own code work but it still wouldn't work. Sometimes that person would even be myself. Trying to test the same application with my own code 6 months after I wrote it was a true challenge. Of course, you're going to say :

"The answer is easy : just write more modular code!"

And... this is where the tooling problem comes in play. I don't want to call out the proprietary and sometimes legacy tools which I was literally forced to use at the time, but let's just say they don't particularly make the job easy for you in that department. It's very hard to create and maintain library code with these tools (you know who you are), when it is even possible at all. And it's even harder to share it with colleagues. Sure, we tried managing our script repository professionally and versioning them and everything, but the process was always so awkward, slow and buggy because these tools simply were not designed to work that way.

These tools are what I call "throw-away" oriented because they both expect you to throw your code away at the end of your testing campaign and make you want to throw them away too (the tools themselves).

I must say I did encounter the occasional big long term project which had managed to make it work because they had direct insight into the code changes and had a huge budget to pay people to keep maintaining the scripts and test artefacts. But that's just a silly, brute-force way of solving the problem which absolutely can not scale and should never be duplicated as far as my opinion goes.

Thankfully, tools such as step and Selenium have emerged in the mean time and the options provided by the open-source scene are much more satisfying these days. I must say also that some of the older open-source tools like Grinder or JMeter (which still seems to have a strong community to this day) didn't work too bad and relieved some of the pain in the performance testing department at times. It just depended on what exactly you had to do with them. The main problem being that again, they were not initially designed with the idea of sharing and reusing for other tasks in mind (across an entire dev-ops pipeline).

But tools themselves are not enough and I can't blame this entirely on them. A lot also had to do with the way we were approaching these things and back when I was mostly a Performance Tester, I'll confess I wasn't what I now consider a good developer in the sense that no one had really taught me what a good way of organizing my code was, nor why. Not even when I was in college, studying software engineering. Or did I just not pay enough attention to that part of the course back then? ;)

True modularity : the keyword approach

So there I was, stuck in that situation where no one really had a thoughtful approach to solving this problem at the industrial level at the time. And then things changed. I found out about the keyword approach and got a chance to work on a very innovative project for a client, in which we were able to design our testing platform from scratch. And we decided to embrace some of the concepts of that keyword approach and shoot for the best possible tradeoff between modularity, performance, maintainability and ease-of-use.

If I had to summarize this approach in one sentence I would say it's a way for people to create their own framework, their own domain-specific language almost, which will then allow them to describe the aspects of automation and test logic in a way that's relevant to their own business or application model.

And we'll do this in a very clean way, meaning that we'll use some solid interfaces for passing inputs and outputs to and from these keywords. We'll treat each little script as a little blackbox, so that higher level users don't have to concern themselves with code. Instead, they'll just use scripts like Lego's™ to implement their testing strategy.

As a result of that initial descriptive effort, many services and tasks can be streamlined and served in a unified way, where before, each would potentially have their own implementation based on a number of factors.

This basically means that we're going to analyze our needs and split the business or application logic into atomic chunks of behaviour. The level of atomicity or "amount of behaviour" we'll wrap into a keyword object is key here because the more chunks the more scripts and keywords to maintain and execute, but the larger the chunks, the less modular and less reusable.

Anyhow: once you've exposed your scripts as keywords with a definition of their expected inputs and outputs, you can use them as building blocks to design your tests cases.

Whether it's the logic that varies (you're testing different scenarios by using different sequences or combinations of keywords) or the data (you're performing data-driven scenarios with large volumes of different inputs but using mostly the same logic and thus, keyword combination), it becomes very easy to combine them into a test plan. The image below illustrates this concept.

By the way, in practice, we find that the Pareto rule applies and that on average, 20% of the keywords (the most important ones) will be involved in 80% of the test scenarios. Therefore, and in order to limit costs, it is advised to identify these key functions early and implement them first. You can always decide to maximize your automated test coverage down the line.

Now, as I stated earlier, one of the beauties of this, is that once you've found the granularity level that you need, you'll be able to streamline all of the different tasks related to testing, based on that framework that you've just created for yourself.

Here are some of these services you'll be able to deliver in a truly unified way, meaning independently from the underlying scripting tool, coding style or simulation approach, just as long as you're building your tests around keywords:

instantly exposing all test artefacts to other testers and users for reuse
presenting test results, errors and performance data (and in a way that makes sense to both technical and business teams)
reusing and deploying test cases for multiple purposes and types of tests (synthetic monitoring, performance testing, batched functional testing, etc)

and also (although these are not pure advantages from the keyword approach, but they become much easier to implement and build into your platform if you're using that approach) :

monitoring your tests in real time
managing test results and test meta data
rationalizing and managing your test infrastructure resources across all projects
planning, orchestrating and triggering test executions

Now, this is what the word "efficiency" means to us. We're not just talking about isolating a piece of code as a library anymore here !

On a more technical note, in the context of the step project, all of this meant that we had to come up with a dedicated architecture and a fully optimized stack to make this possible at scale and to provide our users with the best possible experience. And this was quite a challenge. I'll write more on this in future blog entries because I think there's a lot of interesting technical content to share there.

Signing off from Le Bonhomme.

jeudi 27 octobre 2016

Taken by storm

Hi everyone,

I'm typing these lines from a Tim Hordon's in Toronto, as our month long trip through the US and Canada nears its end. I don't normally post personal stuff on here, as it's not my primary intent, but I thought I would drop a few lines for those wondering what's up and also to briefly mention what I did before going on vacation and did not get a chance to write about.

I really wanted to put up some more material and tutorials before we left the country but I was so busy working on our first release of STEP that I just had to give up on that. And I never really got a chance to sit down and write a blog post since then either.

We're pretty excited about this release though, and I'm planing on working on a few tutorials when I get back home, which will be similar to those I did for djigger. You can expect a quick start video guide to get you going with a basic example and then in a second video, I'll probably demo our integration with Selenium with a script that's closer to a real life scenario.

I'll show off a little and put up a pic of the NFL game I went to (Falcons @ Seahawks, at the Century Link Stadium in Seattle) ;)

Signing off from Toronto.

Dorian