mardi 19 avril 2016

Why, when and how you should avoid using agent mode



So I thought I would write a short ironic post today.

Java agents are great for bytecode instrumentation but even as intrusive as they are, they still come short of their goal sometimes. Also, they come with a certain overhead in resources and configuration maintenance due to the fact that they require modifications and updates on the monitored JVM side. I could summarize this thought by saying that they cause a higher "TCO", than the simple, risk-free, collector-mode sampling approach.

Or I could illustrate it in a more explicit way. Depending on your context, using instrumentation might be like owning a Ferrari. If you liked sports cars and even if you had enough money to buy one, maybe renting a Ferrari for a day or two when you want to race or take a drive down the coast would make more sense than owning it, with all the ramifications it implies? I feel like I'm still not nailing this entirely here though, seeing as most Ferrari customers are probably not the most pragmatic and rational buyers, and they probably don't care about efficiency when it comes to using their sportscar.

But the reason why this post is slightly ironic though, is that I just uploaded a youtube tutorial today showing you how you can use djigger in agent mode for instrumentation purposes.

Obviously, in many cases you do need instrumentation. And in certain cases you'll want it on at all times, for instance if you end up building your business insights on top of it. But is that always a requirement, or is there some sort of way you could use sampling data to come up with the same or 'good-enough' information to understand and solve your problem?

I've partially covered this topic in my simulated "Q&A" session, but I felt I needed to explain myself a little more and illustrate my point with a recent example.

Here's an upcoming feature in djigger (which should be published some time this week in R 1.5.1) that will allow you to answer the classic question "is method X being called very frequently or do just a few calls take place but each call lasts a long time"?

This is the question you'll ask (or have asked) yourself almost every time after sampling a certain chunk of runtime behavior. In theory, the nature of the sampling approach and the logic behind stack trace aggregation (explained here) causes for us to be blind and "lose" that information.

However, there is a way to extract a very similar piece of information out of the stacktrace samples. Here's how.

When sampling at a given frequency, let's say every 50 ms, method calls lasting less then 50 ms might sometimes be "invisible" to the sampler. However, every single time a stacktrace changes in comparison to that of the previous snapshot (i.e a sub-method is called or the method call itself finishes or a different code line number is on the stack) then you know for sure that if you find that method or code line number again in one of the next snapshots, that the method count has to have been increased at least by one.

This is what we call computing a min bound for method call counts. And we're very excited about releasing this feature, as it is one of the primary reasons why people need instrumentation.

Again you have to understand, we have nothing against instrumentation and we offer instrumentation capability ourselves through our own java agent. However, there are numerous reasons (simplicity, overhead, risks, speed of analysis, etc) for which we love being able to refine our "data mining" logic at the sampling results level.

My next youtube tutorial will either provide in-depth coverage of that functionality or maybe cover collector mode. Either way, I can't wait to show you more of the benefits of djigger. Also I will try to wrap up part 2 of my silly benchmark tomorrow, so I can tell for sure what the impact of the network link looked like (see my previous entry on this blog).

Until then, I'm signing off from Le Bonhomme.

mercredi 13 avril 2016

A benchmark of jdbc fetch sizes with oracle's driver



Recently I investigated a small performance issue in the search function of a Java application in which sampling results had shown that a ton of time (91% of the total duration) was being spent in Oracle fetches.

After figuring out that the individual fetch times were normal/acceptable, I ended up putting together a small benchmark to isolate the round-trip overhead and sort of fine tune the size value. That way I was able to find out how much time I could save by getting rid of unnecessary round-trips between the application server and the oracle database. I have to admit this was close to being a "mind candy" kind of exercise as I had already convinced the developer to come up with a more selective query (he was basically querying and iterating through the entire table, without filtering hardly anything in his WHERE clause, and I knew that this was probably the wrong approach here). Nevertheless I thought the benchmark would be interesting for future reference and also because that way I would optimize that function regardless of how selective the new query would be.

This is what the sampling session looked like in djigger :



You can see the next() and fetch() call sequence taking 91% of the time in there.

The use case was a simple search screen in a Java application. A search returning no results would take 12 seconds to complete (not very nice for the end user). The actual result set iteration and corresponding table causing that 12 second delay didn't even have anything to do with the table against which the actual search was performed. It was just some kind of a security table that needed to be queried before the actual search results would even be retrieved from the business table.

I also found out that, per default, the fetch size was set to 10. So I did a quick benchmark reusing the same select as the one from the application, locally in my eclipse workspace but connecting to the same database. I measured the query duration and more importantly, the time it took to iterate over the result set, and then I changed the fetch size a few times and ran the benchmark again. I made sure the query results where cached by Oracle and I also ran the same benchmark 3 times each time to make sure I wasn't picking up any strange statistical outliers (due to say, network round-trip time variations or jobs running on the database or database host).

I tested the following values : 1, 10, 100, 1000 and 10000. More than that would have been pointless, since the table only contained 88k records.

Using the default size (10) I got pretty much the same duration from iterating over the entire result set as the duration from the real life scenario (just a little bit less, which makes total sense) : 11 seconds. As you can see in the results below, the optimal value seemed to be floating around 1000, sparing us almost 6 seconds (half of the overall response time). I'd say that was worth the couple of hours I spent investigating the issue.



Interestingly enough, the duration of the query execution itself increased slightly as I increased the fetch size. I'm not 100% positive just yet (I could investigate that with djigger but probably won't), but I assume it has something to do with the initial / first fetch. Since I initially set the fetch size to 1000, I assume Oracle already sends a first batch of X rows back to the client when replying to the query execution call (at cursor return time). In any case, the overall duration (executeQuery + Result Set Iteration) was greatly improved.

Here's the code I executed (I had to anonymize a few things, obviously, but you can recreate this very easily if you're interested) :



EDIT : I had a doubt at some point that my benchmark was not realistic enough, since I was running my test from a "local" VM (but still, it only was 1 hop farther away on the internal network) and not from the server itself.

So I uploaded and ran the benchmark from the server itself and I basically got the exact same results. So then I double checked what was the approximate duration I was supposed to be getting with a fetch size of 10, and it turned out that it matched exactly what I had observed initially in the real scenario and measured in the application itself via instrumentation (11 seconds). So everything makes sense now.

Signing off.

dimanche 10 avril 2016

The art of relaying information



Today, I finished writing comprehensive documentation for djigger. Although the tool has no secrets for me, I believe writing proper documentation is always a challenge. I also try to approach it in a way that is as exciting as possible. I try to put myself in the shoes of the reader/user, and sort of reverse engineer who that person might be, what content they would expect to find, in what order and whether they're going to succeed at using and understanding the tool.

One thing I've learnt is it sounds easy but it's a difficult process that's oft neglected. Neglected by developers who have much more fun writing code than documentation but also neglected by the users themselves, on the receiving end, because they have very limited patience (and rightfully so) and if your docs are low quality, instead of providing that feedback you need so badly, they'll just move on. We all know that. Everything works that way these days. And with each day and each engineer working towards better software and better documentation, presentation, etc, the patience budget of our average user decreases.

So I was happy to take on that challenge and I have to say I believe I did a pretty good job in that first batch of documents. There aren't too many pages, the pages don't seem excessively long to me, and I believe I've covered every feature available in R1.4.2. I took the time to take step-by-step screenshots and really illustrate my points precisely.

On that page, I also took the time to summarize the way Q&A's usually go when I present the tool to someone who's never used it, and I basically laid out the entire philosophy behind djigger. As I've stated, there are litterally thousands of hours of performance analysis behind that project, and it's important to me to show that, because I believe that's the true value we're bringing to you as a user. It goes well beyond the simple java program that we developed. Hopefully people will see that and hopefully they'll want to interact with us, as a result.

So with that, I really hope people will be able to get started with the tool and get answers to as many questions as possible. Of course, everyone is welcome to provide us with their feedback via our contact page. We'll try to answer all of your questions there too, whether they're about denkbar, djigger, technical points, our experience with APMs, everything.

Another quick update I wanted to make today is that we're currently working on publishing a road map for the next months in which the upcoming improvements and additional features of djigger will be documented. I'm not sure just yet when that'll be published (there are still conversations to be had internally) but we'll probably update djigger's page directly when it's ready.

If I have time today, I'm going to start working on the development of a reproducer (a small program destined to showcase a problem and a solution to that problem) that will allow me to illustrate a case study involving djigger and based on past experience at one of my clients in France.

The other project I have is the creation of a youtube channel in which I'll be posting video tutorials on a regular basis.

EDIT: I've added a first video which will show you how to use djigger in sampler mode here.

Those videos will essentially duplicate the content written in our docs and case studies, but will be a more dynamic, (hopefully) fun-to-watch experience, in which people will be able to follow and visualize every step of the logic behind the tutorial and more generally, behind the denkbar initiative and djigger. I might start with a short one covering the installation steps of djigger on the different platforms and how to use the different connectors we made available (see our installation page here for more details).

Remember, you can download djigger v1.4.2 for free at anytime by clicking here, read the documentation I wrote here, check out the release page of our github repository here, visit our website at http://denkbar.io or contact us via this page.

Signing off from Le Bonhomme.

jeudi 7 avril 2016

djigger v1.4.2 is available for download !




We are excited to announce the first public release of djigger (v1.4.1), our production-ready open source monitoring solution for java !

This is the first step as part of the denkbar initiative, and our goal here is simply to make available to the community what started out as a small thread dump analyzer and grew into a more mature production-ready APM solution.

I want to start off by thanking all of the people who made this possible, clients who trusted us and let us implement and test out our features against a variety of complex applications, but also colleagues and friends, who tried out the tool and helped us with their valuable feedback.

At this point there's still a lot of potential to unleash and great functionality to be built, but we're very confident that what we're releasing today can already make a big difference in the way many people and companies analyze performance problems.

We've really juiced the sampling approach like no one has ever done before (at least that we know of), and we're still working on milking that information cow some more. There's just so much you can do with just sampling, you or at least most people would be very surprised.

On top of our sampler, we've built a series of filtering, aggregation and visualization functions that allow you to inspect and understand your code in a very unique way. Once you think you're in a situation where you can't avoid instrumentation (which is highly intrusive) anymore, then you can use our process attach and/or agent mode functionality to start instrumenting methods and get the answers you need. We're now actively extending agent style and instrumentation functionality and we're planing on introducing a lot of new features very soon, such as distributed transaction tracing, getter injection and primitive types capturing.

Also, we intend to provide active support to the best of our ability (and availability) and we'll try to build as much momentum as we can in order to give djigger a chance to fly on its own. We hope the community responds, and we'd like to already thank in advance the people who'll try out djigger.

On that note, we've put a lot of effort in the last few days to package the tool in as pleasant a way as we could (which isn't necessarily easy considering the numerous supported connectors, JVM vendors, JDK versions, OS-specific start scripts, etc), and we hope we've made life easy for you as a user.

That said, we'll definitely appreciate any feedback you can give us, especially if you're stuck early in your attempt at using the tool. In a few days, we'll put up a form on the Contact page at denkbar.io, so you can get in touch with us directly, but for now, feel free to report any issue you like on our github repository.

We'll do our best to help you get going and fix any bugs as quickly as we can!

Signing off.

Dorian

mardi 5 avril 2016

A fresh start.




Hello, World!

It's good to be back. I believe haven't written a blog post since 2011. I must say however, I have been pretty busy and I believe it was worth waiting.

Over the past few years I've been maturing my idea of a collaborative open source platform, which would serve as a basis for promoting open source software, discussing performance, testing and analytic related topics, and simply sharing experience with other people in the industry.

This idea has come to fruition, thanks to the great participation of my good friend and killer java developer Jérôme Comte. We have officially opened a new domain and website at denkbar.io .

The funny thing is, one of my last blog posts on my old blog was centered around the release of an open source tool called jdigger on sourceforge. It didn't get much public attention as we did nothing to promote it really, but it planted a seed and now that I'm re-establishing contact with some of the engineers I worked with in the last few years, I'm getting some unbelievable feedbacks. Some saying they've been using the tool every time they had performance issues, most using it on and off depending on their needs.

To me this is just a tribute to the fact that we really cared about implementing real functionality, meaning things that we really needed and helped us tremendously in our daily life as performance analysts. And if you do that, it almost doesn't matter how badly packaged your tool is, the people who understand it will use it.

Which leads us to a big new step for us : an official release (yeah, yeah, on social networks and all..) of djigger. The more modern and refined version of jdigger (you can bury that name forever by the way, oh and I take full responsibility for the bad idea, but also for the fairly clever quick-fix :D).

The first public release should be announced offcially some time this thursday. We hope you enjoy the tool and give us as much feedback as we can handle.

Soon you will find what you would never have found back then in the old school version of the tool : a comprehensive installation package, documentation, illustration of use cases via official case studies on denkbar.io, an active community of developers on github, youtube videos to help you get started with the tool and much more.

You can follow the official communications (new release, events, etc) of the denkbar community lead over at @denkbar_io, on LinkedIn and on Facebook.

Now, I don't have a lot of time today, and it's already pretty late here, but there will more frequent and more interesting posts in the future, you can count on me.

Before I go, a couple of words on the title of this blog. I'm a pretty big NBA fan, and as some of you may have assumed, this is a reference to a pretty old yet cool basketball movie called White Men Can't Jump. And on these words, I'll wander off letting you think about what it could actually mean to me ;)

PS: my old blog archives (and a few broken screenshots) are still available at dcransac.blogspot.com.

Signing off from Le Bonhomme.