Introduction to Descriptive Statistics

So far in this chapter we have been concerned with computing "snapshot" quantities: numbers that describe a single device, process or file. But it is often vitally important to give meaning to groups of numbers.

In the last section, we illustrated how multimedia applications require large capacity disk drives for data storage. Anyone working with digital photographs, music or video must be able to predict how much disk space will be required for any given application. Suppose you have taken a dozen pictures with a 3 megapixel digital camera (at high quality setting), and downloaded them to your computer. A list of the files, with their sizes in bytes, is:

806912 IMG_0760.jpg  868352 IMG_0763.jpg  774144 IMG_0766.jpg  815104 IMG_0769.jpg
626688 IMG_0761.jpg  872448 IMG_0764.jpg  806912 IMG_0767.jpg  831488 IMG_0770.jpg
634880 IMG_0762.jpg  884736 IMG_0765.jpg  835584 IMG_0768.jpg  798720 IMG_0771.jpg
Suppose you are preparing to go on vacation; you will be gone for three weeks, and anticipate taking 30 to 40 pictures each day. How much storage space will you need to take with you?

We can characterize a group of numbers by their mean, or average. This is simply the sum of their values divided by how many values there are. In this example, the mean file size is

(806912 + 626688 + 634880 + 868352 + 872448 + 884736 + 774144 + 806912 + 835584 + 815104 + 831488 + 798720) / 12
= 796330.6667
which we round up to 796331 bytes (since there must be a whole number of bytes). Now we can predict that for 40 pictures per day each day for 21 days, we will need
796331 * 40 * 21 = 668918040 bytes
or approximately 638 MB of storage. Note that none of the files had a size that was equal to the average, but that if we multiply the average by the number of pictures, we will get a pretty good idea of how much storage we need.

How good an idea is a function of how close our individual data items (the 12 file sizes) are to the average. But how do we quantify that, so that we will know something about how good our estimates are? One way is to compute the range of the data in terms of the mean. The lowest data value was 626688; the highest was 884736. The difference between the mean and the lowest is

796331 - 626688 = 169643
while the difference between the highest and the mean is
884736 - 796331 = 88405.
The larger of these differences is called the maximum absolute deviation. The range is then expressed as
796331 +- 169643,
that is, the mean plus or minus the maximum absolute deviation, since all of the data items fall within this range.

We can express the range as a percentage of the mean by using the maximum percentage deviation, which is equal to the maximum absolute deviation divided by the mean, times 100%:

(169643 / 796331) * 100%
= 21.303 %
so that we can write the range as 796331 +- 21.3 %. Note that the maximum percentage deviation has no units: we divided the absolute deviation by the mean, so the units canceled. This result indicates that our estimate of how much storage space we will need should not be off by much more than 20 %, and we can probably get by with three 256 MB flash cards for our camera.

Distributions

There are several well-known functions which can help us to organize our data and make additional predictions. These functions are called distributions because we will use them as models of how various kinds of data are distributed.

The most important of these distributions is the Normal or Gaussian Distribution. It describes data which is randomly distributed about the mean, and is familiar to most students as the "bell curve". Here is the normal distribution corresponding to our data (scaled by a factor of 1000 for readability):

It is symmetric about the mean, and the "ends" asymptotically approach zero. It describes the probability with which any given random value will be included in the data set: values near the mean are much more probable than values far away. The width of the curve is described using the standard deviation, and it is useful to know that approximately 68.27%, or just over two thirds, of the data items will fall in the range described by the mean plus or minus one standard deviation (within the red lines above). Thus the standard deviation gives us a convenient measure of the "spread" of a set of random data items. It can be computed using the formula

s = ( S ( xi - m )2 / n )1/2
where

For the file sizes in our example, the standard deviation is then

( ( (806912 - 796331)2 + (626688 - 796331)2 + (634880 - 796331)2 + (868352 - 796331)2 + (872448 - 796331)2 + (884736 - 796331)2 + (774144 - 796331)2 + (806912 - 796331)2 + (835584 - 796331)2 + (815104 - 796331)2 + (831488 - 796331)2 + (798720 - 796331)2 ) / 12 )1/2
= 80359.9
This means that our range was significantly broader than the width of our normal distribution: our estimate of how much storage we need is probably better than we at first thought. We can verify this by examining a histogram of our data: we split the range into "bins", and plot the number of data items in each bin:

We can see from this that most of our data is clustered together. The two files which had much smaller sizes were images with less detail than the others. This is a characteristic of JPEG (Joint Picture Experts Group) files: as a lossy compression method, the sizes of the files it generates are closely related to the amount of detail in the original pictures.

Another important distribution is the Poisson Distribution. It describes both service requests and response times, and is a common model for simulation and analysis of device queuing and access times (for instance, how requests for disk access are handled by an operating system, and how long they take). It depends on the mean rate of occurrence and the number of events that can occur during that time (ie., the number of servers available):

The red plot corresponds to a doubling of the rate of the black plot, while the blue plot corresponds to a doubling of the number of servers. We can see that response times grow more quickly when the requests arrive at a greater rate, and less quickly when there are more resources to service the requests.

Finally, we will be interested in the exponential distribution:

This is not a probability distribution (although it is closely related to the inverse of the exponential probability distribution) because the area under the plot is infinite. However, it is extremely important in that it describes response times in contention-based environments like Ethernet networks. The red plot above represents an increase in the request rate of 20% over that of the black. You can see that for the left side of the plots. the response time is nearly a linear function of the request rate. But as the request rate increases, the response times increase exponentially. By regularly plotting your network response times, you can recognize when they are beginning to rise exponentially, and can upgrade your network resources accordingly.

It is important to note that in any of these computations, larger data sets enable us to perform a much more reliable statistical analysis. For example, if we analyze 40 files instead of 12 from the same set of pictures, we obtain a mean file size of 866816 bytes with a standard deviation of 108265.5. By considering a larger data set, we find that our previous estimate of both the mean and the range were low.

It is also important to understand our data. The pictures we have been discussing were all taken indoors. A different set of pictures (this time 114 pictures taken outside) yields a mean of 1765376 bytes with a standard deviation of 480810.2. So with the understanding that JPEG files vary tremendously with the detail contained in the pictures, we know not to compare file sizes from two very different sets of subjects!

We hope that you have learned a great deal from this text, and that it will help you to be successful in your future coursework, as well as in your careers.


Go to:Title PageTable of ContentsIndex

©2005, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.