So far in this chapter we have been concerned with computing "snapshot" quantities: numbers that describe a single device, process or file. But it is often vitally important to give meaning to groups of numbers.
In the last section, we illustrated how multimedia applications require large capacity disk drives for data storage. Anyone working with digital photographs, music or video must be able to predict how much disk space will be required for any given application. Suppose you have taken a dozen pictures with a 3 megapixel digital camera (at high quality setting), and downloaded them to your computer. A list of the files, with their sizes in bytes, is:
Suppose you are preparing to go on vacation; you will be gone for three weeks, and anticipate taking 30 to 40 pictures each day. How much storage space will you need to take with you?806912 IMG_0760.jpg 868352 IMG_0763.jpg 774144 IMG_0766.jpg 815104 IMG_0769.jpg 626688 IMG_0761.jpg 872448 IMG_0764.jpg 806912 IMG_0767.jpg 831488 IMG_0770.jpg 634880 IMG_0762.jpg 884736 IMG_0765.jpg 835584 IMG_0768.jpg 798720 IMG_0771.jpg
We can characterize a group of numbers by their mean, or average. This is
simply the sum of their values divided by how many values there are. In this example, the mean
file size is
How good an idea is a function of how close our individual data items (the 12 file sizes) are
to the average. But how do we quantify that, so that we will know something about how good our estimates are?
One way is to compute the range of the data in terms of the mean. The lowest data value was 626688;
the highest was 884736. The difference between the mean and the lowest is
We can express the range as a percentage of the mean by using the maximum percentage deviation, which is
equal to the maximum absolute deviation divided by the mean, times 100%:
The most important of these distributions is the Normal or Gaussian Distribution.
It describes data which is randomly distributed about the mean, and is familiar to most students as
the "bell curve". Here is the normal distribution corresponding to our data (scaled by a factor of 1000 for
readability):
It is symmetric about the mean, and the "ends" asymptotically approach zero. It describes the
probability with which any given random value will be included in the data set: values near the
mean are much more probable than values far away. The width of the curve is described using
the standard deviation, and it is useful to know that approximately 68.27%, or just over two thirds,
of the data items will fall in the range described by the mean plus or minus one
standard deviation (within the red lines above). Thus the standard deviation gives us a convenient measure of
the "spread" of a set of random data items. It can be computed using the formula
For the file sizes in our example, the standard deviation is then
We can see from this that most of our data is clustered together. The two files which had much smaller
sizes were images with less detail than the others. This is a characteristic of JPEG
(Joint Picture Experts Group) files: as a lossy compression method,
the sizes of the files it generates are closely related to the amount of detail in the original pictures.
Another important distribution is the Poisson Distribution. It describes both service requests
and response times, and is a common model for simulation and analysis of device queuing and
access times (for instance, how requests for disk access are handled by an operating system, and
how long they take). It depends on the mean rate of occurrence and the number of events that can
occur during that time (ie., the number of servers available):
The red plot corresponds to a doubling of the rate of the black plot, while the blue plot
corresponds to a doubling of the number of servers. We can see that response times grow more quickly
when the requests arrive at a greater rate, and less quickly when there are more resources
to service the requests.
Finally, we will be interested in the exponential distribution:
This is not a probability distribution (although it is closely related to the inverse
of the exponential probability distribution) because the area under the plot is infinite.
However, it is extremely important in that it describes response times in contention-based
environments like Ethernet networks. The red plot above represents an increase in the
request rate of 20% over that of the black. You can see that for the left side of the plots.
the response time is nearly a linear function of the request rate. But as the request
rate increases, the response times increase exponentially. By regularly plotting your
network response times, you can recognize when they are beginning to rise exponentially, and
can upgrade your network resources accordingly.
It is important to note that in any of these computations, larger data sets enable us to perform a
much more reliable statistical analysis. For example, if we analyze 40 files instead of 12
from the same set of pictures, we obtain a mean file size of 866816 bytes with a standard
deviation of 108265.5. By considering a larger data set, we find that our previous estimate of both
the mean and the range were low.
It is also important to understand our data. The pictures we have
been discussing were all taken indoors. A different set of pictures (this time 114 pictures taken
outside) yields a mean of 1765376 bytes with a standard deviation of 480810.2. So with the understanding
that JPEG files vary tremendously with the detail contained in the pictures, we know not
to compare file sizes from two very different sets of subjects!
We hope that you have learned a great deal from this text, and that it will help you to be
successful in your future coursework, as well as in your careers.
©2005, Kenneth R. Koehler. All Rights Reserved. This document may be freely
reproduced provided that this
copyright notice is included.
Please send comments or suggestions to
the author.
(806912 + 626688 + 634880 + 868352 + 872448 + 884736 + 774144 + 806912 + 835584 +
815104 + 831488 + 798720) / 12
which we round up to 796331 bytes (since there must be a whole number of bytes). Now we can
predict that for 40 pictures per day each day for 21 days, we will need
= 796330.6667
796331 * 40 * 21 = 668918040 bytes
or approximately 638 MB of storage. Note that none of the files had a size that was equal to the
average, but that if we multiply the average by the number of pictures, we will get a pretty good
idea of how much storage we need.
796331 - 626688 = 169643
while the difference between the highest and the mean is
884736 - 796331 = 88405.
The larger of these differences is called the maximum absolute deviation. The range is then expressed as
796331 +- 169643,
that is, the mean plus or minus the maximum absolute deviation, since all of the data items fall within this range.
(169643 / 796331) * 100%
so that we can write the range as 796331 +- 21.3 %. Note that the maximum percentage deviation has no units:
we divided the absolute deviation by the mean, so the units canceled.
This result indicates that our estimate of how much storage space we will need
should not be off by much more than 20 %, and we can probably get by with three 256 MB flash cards for
our camera.
= 21.303 %
Distributions
There are several well-known functions which can help us to organize our data and make additional
predictions. These functions are called distributions because we will use them as models
of how various kinds of data are distributed.
s = ( S ( xi -
m )2 / n )1/2
where
( ( (806912 - 796331)2 + (626688 - 796331)2 + (634880 - 796331)2 +
(868352 - 796331)2 + (872448 - 796331)2 + (884736 - 796331)2 +
(774144 - 796331)2 + (806912 - 796331)2 + (835584 - 796331)2 +
(815104 - 796331)2 + (831488 - 796331)2 + (798720 - 796331)2 ) /
12 )1/2
This means that our range was significantly broader than the width of our normal distribution: our
estimate of how much storage we need is probably better than we at first thought. We can verify this
by examining a histogram of our data: we split the range into "bins", and plot the number
of data items in each bin:
= 80359.9
Go to: Title Page Table of Contents Index