What you'll find in this blog entry : the E2E and step-by-step analysis of a performance problem, some thoughts about endurance testing and a couple of concrete tips regarding http connection pooling in a JVM.
Today I'm going to describe the second performance problem we ran into while working on step R3.2. I'm going to have to step up my blogging game though because I'm starting to drift behind releases (we've released v3.3 this week and there's some interesting stuff to talk about in there as well).
As a quick reminder of the last entry, in 3.2 we pushed the performance of our current architecture to its limits in order to support highly granular test scenarios in step, which is the test automation platform we've been building for the last few years.
So here's how things went. After we dealt with the mongo aggregation problem and implemented static views, we were immediately able to scale up to about 1600 keyword executions per second, which was already about an order of magnitude higher than what we'd been doing until then.
At this load level however, and about 20 minutes into the test, we started to run into some strange socket errors, apparently relating to time outs (java.net.SocketTimeoutException: connect timed out). Here from the execution view (results of each keyword call) :
At first glance, it seemed like some sort of connection issue. But the immediate question that comes to mind is "why only after 20 minutes"? And what you'll find here is a good example of why endurance testing is a crucial part of the QA process.
Of course the classic example is the good ole memory leak "hunt". But memory leaks are not the only problem that need time or repetition to reveal themselves. There are plenty of other problems that can occur and which relate to some sort of accumulation or exhaustion somewhere in the technical stack.
So let's take a closer look at one of the stack traces leading to one of these Socket exceptions (in this case, taken from sampling results with djigger) :
As you can see, this thread is attempting to create a new http connection and to connect to some remote endpoint. At the time of the call to connect(), a BindException is thrown.
Now unless the peer went unresponsive or terminated, on a local network, a connect should always succeed within a few milliseconds. Reaching a timeout - and reaching it suddenly 20 minutes into the test - means in this case that something must be wrong with socket management at the system level.
This is what lead me to taking a closer look at the TCP connection activity on the different machines involved. And what I saw was this :
Tens of thousands of connections accumulating in a TIME_WAIT status (this is obviously just a sample of the netstat output).
Of course connections get recycled, but this process takes time and soon enough, all of the machine ports were exhausted.
As it turned out we weren't pooling connections between our controller and agents. And once again, this problem was brought into light because of the suddenly very high number of controller-agent calls per second in this new, more granular setup. In older contexts, where controller-agent communication was less frequent, The average recycling rate of TCP connections was probably enough to cover up the problem.
There are a few ways of fixing this problem (some more elegant than others) but I decided to repeat the test after adding the following system parameters to both JVMs (controller and agent, although the truely necessary one is the flag set on the controller side) :
In that retest, connections weren't leaked anymore and the test remained stable for hours.
Today I'm going to describe the second performance problem we ran into while working on step R3.2. I'm going to have to step up my blogging game though because I'm starting to drift behind releases (we've released v3.3 this week and there's some interesting stuff to talk about in there as well).
As a quick reminder of the last entry, in 3.2 we pushed the performance of our current architecture to its limits in order to support highly granular test scenarios in step, which is the test automation platform we've been building for the last few years.
The Symtoms
So here's how things went. After we dealt with the mongo aggregation problem and implemented static views, we were immediately able to scale up to about 1600 keyword executions per second, which was already about an order of magnitude higher than what we'd been doing until then.
At this load level however, and about 20 minutes into the test, we started to run into some strange socket errors, apparently relating to time outs (java.net.SocketTimeoutException: connect timed out). Here from the execution view (results of each keyword call) :
At first glance, it seemed like some sort of connection issue. But the immediate question that comes to mind is "why only after 20 minutes"? And what you'll find here is a good example of why endurance testing is a crucial part of the QA process.
Of course the classic example is the good ole memory leak "hunt". But memory leaks are not the only problem that need time or repetition to reveal themselves. There are plenty of other problems that can occur and which relate to some sort of accumulation or exhaustion somewhere in the technical stack.
The Analysis
So let's take a closer look at one of the stack traces leading to one of these Socket exceptions (in this case, taken from sampling results with djigger) :
As you can see, this thread is attempting to create a new http connection and to connect to some remote endpoint. At the time of the call to connect(), a BindException is thrown.
Now unless the peer went unresponsive or terminated, on a local network, a connect should always succeed within a few milliseconds. Reaching a timeout - and reaching it suddenly 20 minutes into the test - means in this case that something must be wrong with socket management at the system level.
This is what lead me to taking a closer look at the TCP connection activity on the different machines involved. And what I saw was this :
Tens of thousands of connections accumulating in a TIME_WAIT status (this is obviously just a sample of the netstat output).
Of course connections get recycled, but this process takes time and soon enough, all of the machine ports were exhausted.
As it turned out we weren't pooling connections between our controller and agents. And once again, this problem was brought into light because of the suddenly very high number of controller-agent calls per second in this new, more granular setup. In older contexts, where controller-agent communication was less frequent, The average recycling rate of TCP connections was probably enough to cover up the problem.
The Solution (or one of them)
There are a few ways of fixing this problem (some more elegant than others) but I decided to repeat the test after adding the following system parameters to both JVMs (controller and agent, although the truely necessary one is the flag set on the controller side) :
-Dhttp.keepAlive=true -Dhttp.maxConnections=100
In that retest, connections weren't leaked anymore and the test remained stable for hours.