What you'll find in this blog entry : the detailed analysis of a performance problem, starting from the symptoms and going all the way to the root cause, with intermediate steps in the different layers of a distributed application (called "step").
Taking
step out of its comfort zone and to the next level through release 3.2 was an interesting challenge. In this series of three performance-centric blog entries, I will reflect on some of the technical issues we've faced, analyzed and solved. This is basically a lessons-learned session involving open-source components which I believe are worth sharing and can be valuable to some people. I'll also put a certain emphasis on my thought process as a performance analyst during the entire exercise.
As you'll notice, every time visibility into the JVM runtime was necessary, I used
djigger in its most simple form (i.e with the basic JMX connector) to diagnose the problem. That means that - as in 80% of my everyday-type performance problem cases - I never
had to engage with instrumentation or any other complex form of monitoring or data-capturing approach.
That's because
sampling is absolutely beast if you know how to process the resulting data and interpret it.
Of course, in addition to using a good tool, and before answering the question "
why is this problem happening?" you'll first have to precisely answer the questions "
what is happening?" and "
where is it happening?".
And you need to start somewhere right? I've seen some interesting ideas over time and a lot of people I've seen dealing with performance aspects tend to start where their obsession or experience takes them. People who understand the OS well will go for a bottom-up approach. People who are interested in GC behaviour will start there, etc.
Without setting a goal, there's no such thing as achieving performance.
Personally, I always initially focus on:
1) understanding the inputs of my scenario (what is the user trying to achieve)
and 2) understanding the developer's and architect's intent (how did they technically plan to satisfy the user?)
These two pieces of information are the beginning of an answer to
what is happening. And this is probably
the common property shared by all performance problems : they all come from some sort of mismatch or miscommunication between some expectation and the actual behavior of the system that's assigned with satisfying that expectation.
If you skip this step, there's a high chance that you'll end up investigating the wrong thing or investigating the right thing for the wrong reasons. Everything can be tuned, optimized and if needed, redesigned, if you put in the work. From using indexes or rewriting queries to caching objects, to redesigning business cases and workflows... Caching. Pooling. Sizing. You name it. So let's keep this in mind as we analyze this series of problems.
By the way, if you're interested in my take on tooling and my "minimalist" approach to performance analysis, you can check out the
manifesto-style FAQ I wrote for djigger, and also this past
blog entry. It all basically revolves around the
necessary and sufficient concept, which I've also touched on a bit in my
presentation at JUG last year (the
slides are in English and are available for download on that web page).
A bit of context on the tested system
First off, I feel the need to write a few words about
step, which is the indirect object of our study here and the system whose performance will be dissected in the next paragraphs and blog entries. We like to describe step as the
first open-source test automation platform for unified agile testing. You might hear or see these words get thrown around a lot, but we've extensively documented
how exactly we believe step implements and supports these concepts. It's not just a fancy way to advertise our stuff.
In addition to bringing a new stack and modern approach to the table, step can fit in very easily in just about any legacy technical environment, and so it makes transitioning from old-school in-house testing to our more modern and efficient approach a lot easier. However, unifying testing practices and moving towards continuous delivery mechanics means higher performance and scalability requirements for us, as software providers. The flexibility we provide also means that step has to accept and support a wide range of possible use cases on that type of scale. This is what brings us to this series of performance analysis entries.
The first important bit of information to process in order to understand the rest of this blog entry is that step is highly distributed in nature and is based on a 1-n controller-agent architecture.
The central controller is responsible for dispatching keyword executions (i.e the execution of
test fragments or
scripts) and their inputs, while agents are responsible for their actual local execution and for returning an output to the controller. Agents are horizontally scalable. The controller, however, can be duplicated but there is no built-in support or semantics for inter-controller communication or synchronization at this point. This means that if you want to add controllers, you have to distribute load across the cluster of controllers yourself. It's possible and not that hard actually, but since a single controller instance has handled the load requirements of all of our users thus far, we've focused on implementing other, more dire and important features.
While scaling up to handle new target load levels and new types of load, we realized there was still a lot of untapped performance potential left in this architecture, so it made sense to optimize it and reasonably push it to its limits before changing it or implementing additional controller clustering features.
More specifically, one of the things we were trying to achieve as part of R3.2 was making sure that people who use step for API-style testing (in the context of micro-services for example), wouldn't run into any deal-breaking bottleneck while ramping up to thousands of calls per second with a single controller instance. You see, step was initially designed to handle longer, E2E-style test cases (built on top of heavy-duty stacks like Selenium+Firefox), meaning that most of the heavy lifting would be handled on the agent-side, and the overhead of controller-agent communication was not significant. This is why you'll find a dynamic grid system in step, which allows for effortless scaling of agent nodes.
However, as we moved our way toward finer, more granular test case designs such as user-action-level test cases or even more extreme ones, such as these very fast API calls to simple services, we progressively shifted the load and pressure from the agent toward the controller in the form of controller-agent communication and controller-side, keyword-related task executions. In other words, we put the step framework and in particular the code that runs on the controller under heavier pressure.
Spoiler alert : Mongo's aggregation framework does not scale infinitely
The user who first ran into the problem I'm about to describe is the first person who pushed the architecture to its current (now old) limits.
I've talked about why his setup was so interesting to me and why what he did would have been difficult to achieve with other tools.
Once his test setup was fully operational and the first basic concurrent tests were completed, we started to increase the amount of threads in our simulation in order to reach the target load. Since we were caching and reusing the same webservice port for several thousand iterations, the pressure put on the agent was minimal and we were able to start 30 and then 60 threads with a single agent node without any problem.
However, after a while, we noticed in RTM that our transaction throughput was slowly decreasing even though the response times based on the business code timers were stable. We quickly realized that CPU consumption of step's MongoDB instance was increasing concurrently to that phenomenon. Not only that, but the CPU usage looked very "spiky". Eventually, the test would hardly make any progress anymore and parts of the execution view would become unresponsive.
This is what the throughput started to look like in RTM :
And the DB host CPU looked something like this :
This is the kind of curve you don't like to see when you're load testing something, especially if you're the provider of the tool that's used as part of the test.
A quick look at the mongodb log file, where queries exceeding a certain time threshold (default = 100 ms) are automatically reported, showed me that certain aggregation queries were taking an exponentially increasing amount of time to complete.
Here's a sample entry from the mongo log file :
2016-12-02T18:10:36.886+0100 I COMMAND [conn4] command step.reports command: aggregate { aggregate: "reports", pipeline: [ { $match: { executionID: "58418c026de47b2ffefe28c8", _class: "step.artefacts.reports.CallFunctionReportNode" } }, { $group: { _id: "$status", sum: { $sum: 1 } } } ] } keyUpdates:0 writeConflicts:0 numYields:16032 reslen:101 locks:{ Global: { acquireCount: { r: 32092 } }, Database: { acquireCount: { r: 16046 } }, Collection: { acquireCount: { r: 16046 } } } protocol:op_query 16014ms
A db.currentOp()also confirmed that nothing substantial was really going on in the DB other than these queries.
So I started where I always start, which is with the symptoms. Here's a summary of my observations at this point :
A) There's a continuous decrease in transaction throughput which isn't correlated with longer transaction response times
B) MongoDB uses an increasing amount of CPU to do - as far as we know - the "same amount" of work (i.e we're not increasing the load as we go)
C) Parts of the GUI, in particular the real-time monitoring views are going unresponsive
D) The duration of aggregation queries in mongoDB are increasing exponentially over time according to its log file
First conclusions: we've clearly got a bottleneck involving the database and which apparently relates to aggregation routines. Seeing that we've got this continuous worsening pattern, it also seems to relate tightly to time or to the accumulation of some sort of object (i.e, the test's duration or the amount of executed test steps).
At this point, if you know the product as a developer, the mongo logs are already giving away some critical information, because if you're familiar with the way you're querying your backend and how/why, it won't take very long until you figure out the whole cascade of events. It was the case for me there. But since I don't like shortcuts and also for the sake of making this blog post worth something, I went ahead and checked out step 3.1.1 and reproduced the problem to show you what I'd have done if I hadn't had any prior knowledge of step's code. Most of my career, that's the scenario I was put in, after all.
So we've got a throughput decrease correlated with an increase in CPU utilization on the DB. Mongo also gives us a solid hint as to which queries are taking the longest to execute, and they're consistently the same. Here, we've isolated
what is taking time. To me, the name of the game now is to figure out
where these queries come from (are they issued by components running the test itself, by some other functionality, or are they user-triggered, and if so from the GUI?). The only thing we know is that in step's architecture, the controller is the only process accessing the MongoDB instance.
So let's perform a bit of sampling on the controller and find out what's going on.
Since we're interested in DB queries here, we'll directly take a look at the reverse tree to see if we can quickly spot the code paths that lead to reads in Mongo. And this is what we get with a simple filter on step packages (we do this to reduce noise and make the stats in the tree more relevant):
Okay, so we've got 92% of all of the "step" code busy with socketReads (excluding the FileWatcher which runs alongside the main threads and does nothing most of the time). Not too surprising to find a lot of time reading on the network (in essence waiting for something), considering the symptoms we're already aware of. Let's unfold this branch though, in order to find out where these reads come from :
I had to edit the picture to get the complete stack to appear at once.
What we learn here, as we go down the stack, is that:
- the jongo driver tells us we've spoted the Java-side aggregation code paths, which lead to the queries we've seen in the mongo logs (see the class Aggregate and method call "executeAggregateCommand").
- all of this stuff comes from the same business code : step's ProgressUnit plugin
- this pluging code is invoked via Jersey which is step's REST provider.
Assuming no one is using a custom client or script to poll the information, we now know that these queries are directly or indirectly triggered by the user through step's webapp.
So I restarted the test, went back to the execution view and turned on network capture in Chrome's developer extension. The first thing I noticed, other than that the Ajax calls were taking longer and longer to complete was that they were issued automatically at regular time intervals. It made sense because auto-refresh of the views is enabled by default (the user can turn it off via a button in the top right corner). Let's take a look at this.
Very early on in the test, it's not that bad, but there's a clear increase in one of the backend calls. We start at around 2 ms (which you can't see here) and then the response time for that call keeps increasing (a few hundred milliseconds in this first picture, and then up to tens of seconds when the test actually chokes up completely).
Later, at choke time :
Now looking at the kind of data this backend call was retrieving and looking at the javascript code, if I hadn't already identified the guilty view element based on the query itself, it would quickly have become clear that the main culprit was the progress bar of the execution view :
If we're computing aggregations every 5 seconds and if for some reason, MongoDB can't deliver a result within less than 5 seconds, then the next call is going to overlap with or "stack" on top of the previous one, further increasing the amount of stress we put mongo under, which will in turn, cause for that second call to be slightly slower than the first one, thus starting an obvious vicious cycle. Eventually, the DB resources are exhausted and no more progress is made.
Also interestingly enough, early in the reproduction of the problem, I was able to correlate CPU spikes with each backend call (yes they were that costly!), this is one of the reasons as to why I drew quicker conclusions from the query I had found in the db logs and didn't really need sampling data to confirm my hypothesis :
Either way, at the end of this session, it was clear to me that it was time to question the way the aggregation of keyword counts is computed within a test execution. The aggregation routine was clearly reaching its limit and would not scale past a few million data points grouped in a single test execution. And it would especially not scale with additional users. I mean, we were already torturing the database with one or two web browsers open!
What a shame, how could you design something like this?
Of course, as always when it comes to performance problems, you could say this was a major flaw in the architectural design of the program and blame the architect and developers for going for a convenient way to implement aggregation pipelines and not anticipating that this would become an issue. The thing is, I'll take a bad yet
real performance problem any day over a performance problem due to
premature optimization.
For starters, this implementation had worked for years in other production environments, with little to no complaints as far as the performance of this view and with no substantial impact on mongo's CPU usage. So you could argue that this "quick implementation" bought us 2 or 3 years, during which we were happily working on other, more dire problems.
With "quick implementation", I mean that we were already using mongoDB and it seemed it was worth attempting to use this convenient feature prior to moving to anything more complex or less generic (i.e, optimizing).
Also, had we anticipated this issue, there's no guarantee that the other implementation would have been more successful or sustainable for a longer period of time, at least not without additional testing, which always comes at a cost. Let's remember here, that the reason why we hadn't endurance-tested this functionality, was that the order of magnitude of test steps tied to a single test execution was much lower up until we started supporting fine granular keyword designs, and started to push that kind of setup to the 1000's of executions per second.
Some would argue that introducing an endurance test early on would have been a positive thing regardless of the old requirements, because more tests mean a better product. I agree to a certain extent - depending on the context and which tests we're talking about - but you can also kill your productivity with that kind of dogmatic behavior.
Obsessing over test coverage can be a very bad thing sometimes and it can be a way to lose focus on more important things. In this case, considering the stage the product was in at the time, it was clearly benefiting more from adding important missing features than from maximizing test coverage in areas where we weren't sure we'd eventually make a push. I do believe in TDD and automated performance testing in some situations but not all situations. I don't even know if it should be the go-to or default development method.
Bottom line is, you can not effectively plan for nor execute every possible performance test in advance. However, now that this has become an issue, we'll definitely make sure it never, ever, becomes an issue again. And we'll test for it. I do believe in writing a test for every bug or significant issue I come across, because statistically speaking they're more likely to show up again as a regression in some later version of my software.
The aftermath
Looking back at the first of these performance battles as part of R3.2, although I'd like to say I learnt a lot, I actually think my beliefs were just reenforced and I don't think we did that much wrong here, other than doing pragmatic development and solving a problem which was part of an authentically new experiment. What actually happened is that we used a piece of software which was originally designed and tuned for a certain set of inputs and assumptions (not so granular E2E tests), to perform with different inputs (a much more granular test suite, at higher load).
This is often what creates the bed not just for performance problems but non-functional problems in general. Security breaches also birth frequently from the misuse of software or use under new circumstances which were not initially accounted for. Just because something seems like it will work (because it's compatible from a functional standpoint), doesn't mean that it will in practice.
And I think our users are more worried about our ability to react, provide support and in this specific case, diagnose the performance problem and solve it quickly, than about the performance of the product in any specific area at a given point in time. This is why sound analysis skills and battle-tested tooling are so important.
Where I do learn a lot these days though, is in the support department. Some of the practical ways I've had to deal with issues are awful, including in the example of this blog entry. Sometimes I've had to do video calls with a webcam aimed at a monitor, and had to tell users what to do command by command. I've got some big plans to improve our ability to remote-diagnose issues in step. They include adding some advanced archiving/dumping functionality for troubleshooting and improving step's embedded monitoring and diagnostic functionality.
Now, to wrap up the technical side of the story: we decided to add and maintain incremental counters on the fly during test execution (i.e the queries are now pre-computed). This approach relieved mongoDB's pressure entirely.
It is a much less dynamic approach, but we actually didn't really need to compute these aggregations dynamically in that spot. We're just looking to feed a fairly static dashboard in that part of the app. Exploratory and interactive analysis with slice and dice functionality is something that we provide in RTM.
Stay tuned for parts 2 & 3 of this performance adventure !