I want to offer a way of thinking about statistical distributions that might help anybody– student or otherwise– who is struggling with the concept. That will lead into a discussion as to why statistics, compared to other branches of mathematics, is underdeveloped in regards to theoretical foundations. I also want to point out a crucial but overlooked operation of the human mind.
We use statistics when wrestling with data, and data tend to fall into characteristic patterns. In the field of geometry, characteristic patterns are called “shapes.” Circles, squares, triangles, and so on are all well understood and have distinctive properties that we understand and exploit. In statistics, the characteristic patterns are called “distributions.” A distribution is the shape of your data. Most well known is the normal distribution, but there are also the geometric distribution, the gamma distribution, the Bernoulli distribution, and so on. Just as we know that for all circles, the ratio of the circumference to the diameter is pi, we know that for all normal distributions, 4% of the data points are two standard deviations or more away from the mean.
So, when trying to make sense of statistical distributions, it can sometimes help to think about the “shape” of your data. Statistics is often a matter of assuming a certain distribution, fitting the available data to that distribution, and then filling in the blanks. It is as if we know there is a circle out there in the world, but we do not know its precise size and location. We start collecting tiny little pinpricks of data, sometimes getting a positive result and sometimes not. Soon enough, we are reasonably sure of our circle’s diameter and position, and the job is done. Understand, though, that if our shape actually was not a circle, but in fact was a hexagon, then we might easily delude ourselves. Similarly, if we assume the wrong statistical distribution, then from the small sample of data we collect, we might paint the wrong picture for ourselves.
Now, I am saying that the distribution is the shape of the data, but there is an additional twist: it depends what question you are asking. To ask a particular question is to look on a shape from a particular angle. Imagine a can of tomato soup. From one angle, that can appears rectangular. Depending on the specifics of your data-collecting apparatus, you’re going to need to look for a rectangle in order to find the can. But from another angle, the can is a circle. It depends on where you are coming at it from.
This is directly applicable to statistics. Let’s collect data: customer calls come in to our secretary at an average of five per hour, all independent of each other. Through the magic of hypothetical scenarios, the secretary can handle each call instantaneously, so we don’t have to worry about calls overlapping with each other. One call in seven, on average, results in a successful sale, while the others do not. If that’s our data, then what are the odds that any given ten-minute period will be free of calls? (This question is especially urgent to our secretary, who wishes to take a bathroom break.) What are the odds that two successive calls will both result in sales? Same data, different questions, and just as a soup can could be a rectangle or a circle, the first question is a Poisson distribution while the second question is a binomial distribution.
Now, the data are records–finite records, and often quite short– of historical fact. Distributions are an idealized picture of infinity, a description of what would happen if phenomena repeated for eternity. The problem with eternity is, our secretary doesn’t work those kinds of hours. Statistics lacks a full theoretical footing because statistics is not, ultimately, about theory. Statistics is about actual events, and those are all unique. Statistics is rooted in the real world, and that world has only one existence.
That last point is easy to overlook. The human mind generates simulations of plausible futures, and then it chooses an action. Those simulations are not real. Only one choice, ultimately, is made, and only one world, ultimately, exists. Our vivid imaginations– that is to say, our simulations– often delude us into thinking that our theories are “real.” We think that if another sample were drawn from a Gaussian distribution then it will have such-and-such characteristics, but hypothetical samples and the Gaussian distribution itself are merely phantoms of the mind. Only reality is real. There are no averages and there are no odds; statistics are sophisticated counting techniques and nothing more.