Using Radial Plots to Visualize Time Series Data

Whenever one talks about performance graphs and time series data visualization people tend to think about a X-Y graph. For system administrators the image that comes to mind most of the time is a RRD/Cacti/MRTG graph like the one shown above:

While this format has become common place, it may not fit into every situation.

One format that I really like to use when I need to display time series data is radial plots, particularly when you want to check for certain patterns or when you need some kind of scale transformation to be able to compare series of a different magnitude (eg. CPU usage versus Hits/s).

Besides that, the resemblance to a wall clock provides an intuitive idea of continuity.

In order to create such plot I’ll use a slightly modified version of the apache_svc.pl script that will output:

Hour(24 hour format);mean service time;mean hits/hour

The script can be downloaded here.

After processing a full month of Apache log files this is the output:

11;1676,928965;116,250926 21;1639,624705;113,361250 5;1241,370684;17,595185 17;1703,250085;150,101250 4;1227,487078;17,148241 2;1417,202514;30,994352 22;1651,318738;104,037222 18;1675,477736;122,894306 3;1309,117393;22,372500 8;1640,770019;91,538333 16;1722,880051;170,834028 6;1360,592095;23,434630 23;1626,394824;82,701528 13;1662,728863;156,627130 1;1503,136839;41,077778 12;1667,192097;114,413519 14;1677,577372;153,831944 20;1626,475911;113,596111 15;1678,035555;147,567870 7;1547,754519;47,789167 0;1567,574930;59,452037 19;1648,602616;120,635694 10;1696,497813;128,270556 9;1683,481828;120,212315

Keep in mind that before using  the script you need to modify your Apache log files as explained in this previous post.

By default R can’t do radial plots but you can use plotrix.I’ll keep things simple and use Excel instead of R in this example.

I’ll create two plots using the same data (the service time), one using a traditional X-Y graph and other using a radial graph:

I’ll leave the analisys and interpretation of the data to the reader, although you can see a very interesting pattern as the service time slightly decreases over the night.

The radial chart for the hits/s is as follows:

Now in order to see if there’s a correlation between the service time and the number of hits per second, I’ll perform a normalization between thse two values and plot another radial chart using the transformed data.

The method to normalize the data is pretty straightforward, all one has to do is choose a reasonable constant number (I’ll use 100), multiply the value for this constant and divide by the maximum value of this column, here’s a screen shot of my Excel table that should help:

The resulting chart, showing both the service time and the hits per second:

Conclusions

Radial plots can be useful to display time series data, particularly when the behavior of your system tends to be cyclic. The resemblance to a wall clock makes it easy for one visualize the activity during different times of the day.

Advertisements

Distributed Systems: scalability and high availability

As presented by Lucindo at QCon São Paulo:

Planned Capacity

As presented at CMG Brazil 2010 national conference.

Basic Library Booklist

Frequently people ask me about what books they should read in order to get a grip on capacity planning.

While this list is far from complete I believe that anyone that reads and understand the concepts presented in these books will be able to start doing capacity planning.

If you want a truly hands-on approach and most of your environment is running on Linux, I deeply recommend “The Art of Capacity Planning” by John Allspaw.

It’s a pretty well written and easy to read book that will make you jumpstart into the capacity planning world.

My currently preferred book on this subject is “Guerrilla Capacity Planning” by Neil J. Gunther. This is a must read for anyone that is serious about modern Capacity Planning.

Another reference book on this subject is “Capacity Planning for Web Performance“, by Daniel A. Menascé and Virgilio A. F. Almeida.

Last but not least, I recommend another book by Neil J. Gunther, “Analyzing Computer System Performance with Perl::PDQ“. There’s a whole chapter about the Linux Load Average and processing queues that changed the way I dealt with Linux performance data.

Application Performance Data Visualization

Performance Data Visualization I

This is the first post of a series on application performance data visualization. I’ll cover a simple but extremely useful technique that allows you to visualize the application behaviour against different workloads.

Before we get our hands dirty I need to explain a concept called service time (S).

Service time is the time that a resource in your system is busy during a given transaction. The resource may be physical, like a disk or a processor, or logical like an application process or a thread.

If your system is, let’s say, a store, you can think of the counter as a resource in your system. The time the counter takes to check out would be its service time.

It’s easy to understand that the counter will take more time to check out a hundred items than it takes to handle just a couple of items, so its service time is directly related to the amount of work (or load) it is subjected to. We’ll not delve into details like resource sharing for now, but keep in mind that in a real system frequently the service time is not related only to the load of an individual resource.

When I first started to figure out the importance of the service time in application performance evaluation, I was amazed to discover that many popular softwares had no way to register the service time for a given transaction while others did it in a poor and almost useless way.

As an example, nginx would log the time to serve a request in seconds before version 0.5.19. Consider that an HTTP request usually takes less than a single second for most websites and you can figure how useless this is.

Currently, nginx has a variable called $request_time that allows you to log the time it takes to serve a request in milliseconds. The Apache Web Server has a directive called %D that logs the time it took to serve a request in microseconds. You can find more information about the Apache log directives in the mod_log_config documentation.

Unfortunately neither of these options is enabled in the default configuration files so you need to manually configure these options. I’ll use Apache 2.2.3 running on a CentOS 5.5 system as a testbed for this article.

While it’s beyond the scope of this article to describe the Apache configuration directives, I’ll try to make it simple for someone not familiar with the software to understand what we’re doing.

I’ll edit the Apache configuration file located at /etc/httpd/conf/httpd.conf (remember to make a backup of the original file before commiting any changes) and look for the existent LogFormat directives.

There are four pre-configured LogFormat directives in the default configuration file:

LogFormat “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-Agent}i\”” combined

LogFormat “%h %l %u %t \”%r\” %>s %b” common

LogFormat “%{Referer}i -> %U” referer

LogFormat “%{User-agent}i” agent

Most sites use the “combined” format or some variation of it, so I’ll create a new log format called “combinedcap” by simply copying the combined format and adding the %D directive. Then I’ll change the CustomLog directive so it will use our new log format:

LogFormat “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-Agent}i\” %D” combinedcap
CustomLog logs/access_log combinedcap

If you’re not in a production server starting the httpd should be pretty straight forward, we don’t need any further configuration. Just make sure you allow incoming connections to the TCP port 80 if you’re running any kind of firewall (CentOS will enforce security constraints by default).

In a standard setup your log files should be in /etc/httpd/logs/. The access_log file should look like this (I’ll highlight the information we’ll need for our analisys):

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET / HTTP/1.1” 403 5043 “-” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8” 1523

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET /icons/apache_pb.gif HTTP/1.1” 200 2326 “http://192.168.98.129/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8” 37566

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET /icons/powered_by_rh.png HTTP/1.1” 200 1213 “http://192.168.98.129/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8” 62461

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET /favicon.ico HTTP/1.1” 404 288 “http://192.168.98.129/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8” 2172

So now we have the time the server receied the request and how long it took to serve it. In this example I’ve used the standard Apache 2 Test Page, a static HTML with some images. The longest request took 0,037 seconds and the fastest took 0,0015 seconds.

Obviously using data from a local server will be not be useful unless for benchmarking purposes. We’ll use production data from a live website for our analisys. This data was collected for one day on a relatively busy website.

In order to harness all this data we’ll use some Perl scripts and R. If you feel like using Python or any other programming language, stay at ease and don’t forget to share it !

You can download the Perl script that I use to summarize the Apache log files here.

Once you have enough data (I’d recommend at least a couple of days on a small site), simply run the script and your output should look something like this (I’ll hide diagnostics/progress indicators for the sake of clarity):

gzip -dc logs/accesslog.20100701* | ./apache_svc.pl 2>/dev/null | tee apache_svc_20100701.out

rate;svctime;freq

32;45316.739953;423

127;209501.459075;562

90;148895.554217;166

71;115292.206897;174

206;355824.196262;214

118;196627.652339;791

102;172595.483986;281

200;345448.035616;365

18;24767.897799;1272

125;207119.412238;621

16;22090.723684;1292

44;68436.582031;256

55;86475.967213;305

84;135554.740506;158

27;36318.533965;633

233;406275.000000;3

161;278944.878049;328

Now we have lots of data do chew on, my resulting output file has a total of 229 data lines plus the header. Let’s take a closer look at a randomly chosen line, like “32;45316.739953;423”:

  • The average service time for an arrival rate of 32 hits per second is roughly 0,045 seconds
  • During the period that was analyzed there was 423 individual seconds where we had 32 hits

I’ll import this data to R to see what kind of information we can extract from the data:

> svc <- read.csv("apache_svc_20100701.out",sep = ";")
> summary(svc)
rate          svctime            freq
Min.   :  1.0   Min.   :  1570   Min.   :   1.0
1st Qu.: 64.0   1st Qu.:102105   1st Qu.: 180.0
Median :121.0   Median :200154   Median : 279.0
Mean   :121.4   Mean   :205012   Mean   : 377.3
3rd Qu.:178.0   3rd Qu.:310029   3rd Qu.: 559.0
Max.   :308.0   Max.   :451029   Max.   :1301.0

I’ll begin by plotting a simple scatter plot that correlates the arrival rate with the service time, this gives us an idea of how the server handles traffic. Note that I’ve converted the Service time to seconds simply by multiplying it by 10E-6.

So here we can see an almost linear relation between the arrival rate and service time, and a slightly irregular behavior when we reach more than 200 hits per second. Bear in mind that this may be caused by the lack of samples, so you may filter the data and remove service time values when you don’t have enough data samples, there’s no rule of the thumb for that.

Now let’s say you want to know how is your system workload. I’ll use the frequency field in order to visualize how many times the server was subjected to a certain arrival rate:

Conclusions

Service time and arrival rate can be very useful in order to determine the behavior of your system in different workloads, as well as help you to discover what is the current load that your system is subjected to. The techniques presented in this article can be used for almost any system as long as it provides a way to register the service time.

It’s important to note that it takes some effort to realize how to extract information from the raw log files, but once you understand the data there’s a lot of useful information that may help you in capacity planning and even in troubleshooting some quirky behaviour in your systems.

Feel free to contact me if you need any further information, I appreciate your feedback.