French men can't code: The iterative process of achieving scale (3/3)

What you'll find in this blog entry : the E2E and step-by-step analysis of a performance problem and a couple of concrete tips to help you spot synchronization issues faster.

I'll conclude this three-part analysis session with a more simple problem which still affected step's maximal keyword throughput in v3.2 after we corrected the aggregation problem (part 1) and the HTTP connection pooling problem (part 2). It seems to be it's a simple type of problem because I feel I can diagnose these quickly, but I think it's definitely worth going over, as I'll be able to showcase some of my favorite features in djigger, our open-source profiler and production monitoring tool.

So if you recall, at this point in our story, we're trying to optimize step's code to reach a desired target Keyword load of 2000 Keyword executions per second or more in a single-controller environment. I should point out that the term "optimization" was not entirely appropriate in our context up until now, since the two first problems were actually major issues affecting the stability of our test executions, thus destroying our ability to service the user which wanted to use step for that particular kind of testing (which again, was new for us/step). In particular, the HTTP connection pooling issue from part 2 was not even a throughput problem - constantly recreating HTTP connections doesn't cost that much CPU and/or time - but definitely more of stability issue revealed only in an endurance test (leaky sockets).

At the end of part 2 however, we can pretty much say we're moving toward "optimization" because we're able to achieve stable throughput at a level close to our target. So now it's all about hunting down the last few bottlenecks and hopefully getting rid of them without having to undertake any major architectural rework.

So again, we started a load scenario with a target of a about 2000 kw/s, with proper corresponding values for threading and pacing, and we found ourselves saturating around the 1600 kw/s mark. And again, I used djigger to produce a picture of the distribution of samples (which can be used as an approximation of elapsed time) across the different packages inside of the controller component.

And looking at the Reverse view with a filter on step's active thread stacks, this is what I found:

See anything interesting in here?

Remember, when we look at the Reverse tree view, we see the top-most methods of the thread stacks. And after excluding park(), wait() and other intentionally "inactive" methods, and focusing on stacks going through our own packages (.*step.*), we'll end up with method calls which are supposed to be doing some concrete thing for us.

We could take the liberty to call these "on-cpu" methods, even though the term wouldn't technically be 100% correct. Either way, these leaves (or roots, depending on the view you've chosen) are usually low-level methods from java or sun packages (for instance, methods involved with data manipulation, I/O, etc).

Knowing that, it makes sense to find methods like socketRead0(). Threads busy with this method call are either blocking synchronously and waiting for a message from a remote process (here in our situation, it could be the DB or an Agent), or are actively reading that message, using up some CPU time. We won't see this distinction just looking at this view because it's a native method and from the Java world, we can't see which line of code inside that method is being executed or which other native methods are called by this method.

Either way, the best way to investigate this time consumer, assuming you have a legitimate reason to do so, would be by unfolding that branch and further analyzing which packages lead to these socketReads0() and whether their associated percentages make sense, according to what you know the application is doing or should be doing.

Here since we're sampling step's controller, they're totally legitimate because the controller should spend most of its time delegating work to its agents, which results in blocking calls to socketRead0().

Now, something which I found a lot more interesting is the method PluginManager.invoke(). First of all, it's the only method belonging to "business code" (i.e non-java, non-sun, etc) in which a significant amount of time is being spent. Secondly, just judging by its name, it's clearly not the kind of method that will do a lot of work on its own (like iterating through a large collection and computing stuff). Its name is "invoke". Of course, there's always a chance that the sampler catches any method, even the dumbest/fastest ones, on top of a thread stack by "accident", but when 18% of the samples point to that method being on top of the stack, it can't be an accident anymore.

If you have prior experience analyzing these kinds of patterns, you should already know what the problem is at this point. For the other readers, let's take a closer look at the context of this method call, by unfolding that branch and displaying code line numbers as a node-differentiating criterion in our tree view:

So first of all, it's interesting to note that after splitting on code lines, we still find all of the samples busy in the exact same spot (i.e, line 72). This should be another strong hint at what is going on with that method.

Unfolding the branch allows us to confirm the origin of that method call and as a developer, I now know that this method is called at the end of the execution of each step of my test plan, which, from the performance standpoint, is a big deal, meaning that its a highly parallel and hence hot code path and that special care should be given to what kind of work is being done here and how it's being done.

Finally, let's take a look at that line of code:

Yeah, of course, it's part of a synchronized block. I say of course, because this was the only logical possibility which could fit with the symptoms we just went over.

Synchronization is a quick and easy way to make stuff safe and it's fine to use. But if you're going to synchronize methods or blocks which are part of very hot code paths such as in this example, you better have a damn good reason to do so.

And in this case, we didn't. It was actually an unnecessary safety precaution for an object that was actually thread safe.

There you have it: we've taken premature optimization to a whole new level, or rather we've created a new branch in that family: premature thread-safety concerns.

In any case, we just removed that synchronized block were finally able to reach our target load.

I hope this series of step-by-step analysis sessions was either helpful or fun to most of the readers. I might do another round of these in the next few months since I've again analyzed and fixed a bunch of performance issues while cleaning up R3.3. Each new feature (especially the ones that involve changes in core packages or in these "hot" code path areas), introduces a new set of risks, calls for a new series of tests, and leads to a new session of old-fashioned analysis and code cleaning.

French men can't code

mercredi 17 mai 2017

The iterative process of achieving scale (3/3) - Unnecessary synchronization

Aucun commentaire:

Enregistrer un commentaire