Showing posts with label Statistic Begin. Show all posts
Showing posts with label Statistic Begin. Show all posts

Friday, April 27, 2007

Regression and Correlation Analysis

As you develop Cause & Effect diagrams based on data, you may wish to examine the degree of correlation between variables. A statistical measurement of correlation can be calculated using the least squares method to quantify the strength of the relationship between two variables. The output of that calculation is the Correlation Coefficient, or (r), which ranges between -1 and 1. A value of 1 indicates perfect positive correlation - as one variable increases, the second increases in a linear fashion. Likewise, a value of -1 indicates perfect negative correlation - as one variable increases, the second decreases. A value of zero indicates zero correlation.

Before calculating the Correlation Coefficient, the first step is to construct a scatter diagram. Most spreadsheets, including Excel, can handle this task. Looking at the scatter diagram will give you a broad understanding of the correlation. Following is a scatter plot chart example based on an automobile manufacturer. In this case, the process improvement team is analyzing door closing efforts to understand what the causes could be. The Y-axis represents the width of the gap between the sealing flange of a car door and the sealing flange on the body - a measure of how tight the door is set to the body. The fishbone diagram indicated that variability in the seal gap could be a cause of variability in door closing efforts.

In this case, you can see a pattern in the data indicating a negative correlation (negative slope) between the two variables. In fact, the Correlation Coefficient is 0.78, indicating a strong relationship.

Simple Regression Analysis

While Correlation Analysis assumes no causal relationship between variables, Regression Analysis assumes that one variable is dependent upon: A) another single independent variable (Simple Regression) , or B) multiple independent variables (Multiple Regression). Regression plots a line of best fit to the data using the least-squares method. You can see an example below of linear regression using the same car door scatter plot:

You can see that the data is clustered closely around the line, and that the line has a downward slope. There is strong negative correlation expressed by two related statistics: the r value, as stated before is .78 - the r2 value is therefore 0.61. R2, called the Coefficient of Determination, expresses how much of the variability in the dependent variable is explained by variability in the independent variable. You may find that a non-linear equation such as an exponential or power function may provide a better fit, and higher r2 than a linear equation.

Multiple Regression Analysis
Multiple Regression Analysis uses a similar methodology as Simple Regression, but includes more than one independent variable. Econometric models are a good example, where the dependent variable of GNP may be analyzed in terms of multiple independent variables, such as interest rates, productivity growth, government spending, savings rates, consumer confidence, etc.
Many times historical data is used in multiple regression in an attempt to identify the most significant inputs to a process. The benefit of this type of analysis is that it can be done very quickly and relatively simply. However, there are several potential pitfalls:


The data may be inconsistent due to different measurement systems, calibration drift, different operators, or recording errors.

The range of the variables may be very limited, and can give a false indication of low correlation. For example, a process may have temperature controls because temperature has been found in the past to have an impact on the output. Using historical temperature data may therefore indicate low significance because the range of temperature is already controlled in tight tolerance.

There may be a time lag that influences the relationship - for example, temperature may be much more critical at an early point in the process than at a later point, or vice-versa. There also may be inventory effects that must be taken into account to make sure that all measurements are taken at a consistent point in the process.
Once again, it is critical to remember that correlation is not causality. As stated by Box, Hunter and Hunter: "Broadly speaking, to find out what happens when you change something, it is necessary to change it. To safely infer causality the experimenter cannot rely on natural happenings to choose the design for him; he must choose the design for himself and, in particular, must introduce randomization to break the links with possible lurking variables".1
Returning to our example of door closing efforts, you will recall that the door seal gap had an r2 of 0.61. Using multiple regression, and adding the additional variable "door weatherstrip durometer" (softness), the r2 rises to 0.66. So the durometer of the door weatherstrip added some explaining power, but minimal. Analyzed individually, durometer had much lower correlation with door closing efforts - only 0.41. This analysis was based on historical data, so as previously noted, the regression analysis only tells us what did have an impact on door efforts, not what could have an impact. If the range of durometer measurements was greater, we might have seen a stronger relationship with door closing efforts, and more variability in the output.

Sample Sizes

The best way to figure this one out is to think about it backwards. Let's say you picked a specific number of people in the United States at random. What then is the chance that the people you picked do not accurately represent the U.S. population as a whole? For example, what is the chance that the percentage of those people you picked who said their favorite color was blue does not match the percentage of people in the entire U.S. who like blue best?

(Of course, our little mental exercise here assumes you didn't do anything sneaky like phrase your question in a way to make people more or less likely to pick blue as their favorite color. Like, say, telling people "You know, the color blue has been linked to cancer. Now that I've told you that, what is your favorite color?" That's called a leading question, and it's a big no-no in surveying.)

Common sense will tell you (if you listen...) that the chance that your sample is off the mark will decrease as you add more people to your sample. In other words, the more people you ask, the more likely you are to get a representative sample. This is easy so far, right?

Okay, enough with the common sense. It's time for some math. (insert smirk here) The formula that describes the relationship I just mentioned is basically this:

The margin of error in a sample = 1 divided by the square root of the number of people in the sample

How did someone come up with that formula, you ask? Like most formulas in statistics, this one can trace it roots back to pathetic gamblers who were so desperate to hit the jackpot that they'd even stoop to mathematics for an "edge." If you really want to know the gory details, the formula is derived from the standard deviation of the proportion of times that a researcher gets a sample "right," given a whole bunch of samples.

Which is mathematical jargon for..."Trust me. It works, okay?"

So a sample of 1,600 people gives you a margin of error of 2.5 percent, which is pretty darn good for a poll. (See Margin of Error for more details on that term, and on polls in general.) Now, remember that the size of the entire population doesn't matter here. You could have a nation of 250,000 people or 250 million and that won't affect how big your sample needs to be to come within your desired margin of error. The Math Gods just don't care.

Of course, sometimes you'll see polls with anywhere from 600 to 1,800 people, all promising the same margin of error. That's because often pollsters want to break down their poll results by the gender, age, race or income of the people in the sample. To do that, the pollster needs to have enough women, for example, in the overall sample to ensure a reasonable margin or error among just the women. And the same goes for young adults, retirees, rich people, poor people, etc. That means that in order to have a poll with a margin of error of five percent among many different subgroups, a survey will need to include many more than the minimum 400 people in the overall sample.

Data Analysis

You wouldn't buy a car or a house without asking some questions about it first. So don't go buying into someone else's data without asking questions, either.

Okay, you're saying... but with data there are no tires to kick, no doors to slam, no basement walls to check for water damage. Just numbers, graphs and other scary statistical things that are causing you to have bad flashbacks to your last income tax return. What the heck can you ask about data?

Plenty. Here are a few standard questions you should ask any human beings who slap a pile of data in front of you and ask you write about it.

  1. Where did the data come from? Always ask this one first. You always want to know who did the research that created the data you're going to write about.

    You'd be surprised - sometimes it turns out that the person who is feeding you a bunch of numbers can't tell you where they came from. That should be your first hint that you need to be very skeptical about what you are being told.

    Even if your data have an identifiable source, you still want to know what it is. You might have some extra questions to ask about a medical study on the effects of secondhand smoking if you knew it came from a bunch of researchers employed by a tobacco company instead of from, say, a team of research physicians from a major medical school, for example. Or if you knew a study about water safety came from a political interest group that had been lobbying Congress for a ban on pesticides.

    Just because a report comes from a group with a vested interest in its results doesn't guarantee the report is a sham. But you should always be extra skeptical when looking at research generated by people with a political agenda. At the least, they have plenty of incentive NOT to tell you about data they found that contradict their organization's position.

    Which brings us to the next question:

  2. Have the data been peer-reviewed? Major studies that appear in journals like the New England Journal of Medicine undergo a process called "peer review" before they are published. That means that professionals - doctors, statisticians, etc. - have looked at the study before it was published and concluded that the study's authors pretty much followed the rules of good scientific research and didn't torture their data like a middle ages infidel to make the numbers conform to their conclusions.

    Always ask if research was formally peer reviewed. If it was, you know that the data you'll be looking at are at least minimally reliable.

    And if it wasn't peer-reviewed, ask why. It may be that the research just wasn't interesting to enough people to warrant peer review. Or it could mean that the research had as much chance of standing up to professional scrutiny as a $500 mobile home has of standing up in a tornado.

  3. How were the data collected? This one is real important to ask, especially if the data were not peer-reviewed. If the data come from a survey, for example, you want to know that the people who responded to the survey were selected at random.

    In 1997, the Orlando Sentinel released the results of a poll in which more than 90 percent of those people who responded said that Orlando's National Basketball Association team, the Orlando Magic, shouldn't re-sign its center, Shaquille O'Neal, for the amount of money he was asking. The results of that poll were widely reported as evidence that Shaq wasn't wanted in Orlando, and in fact, O'Neal signed with the Los Angeles Lakers a few days later.

    Unfortunately for Magic fans, that poll was about as trustworthy as one of those cheesy old "Magic 8 Balls." The survey was a call-in poll where anyone who wanted could call a telephone number at the paper and register his or her vote.

    This is what statisticians call a "self-selected sample." For all we know, two or three people who got laid off that morning and were ticked off at the idea of someone earning $100 million to play basketball could have flooded the Sentinel's phone lines, making it appear as though the people of Orlando despised Shaq.

    Another problem with data is "cherry-picking." This is the social-science equivalent of gerrymandering, where you draw up a legislative district so that all the people who are going to vote for your candidate are included in your district and everyone else is scattered among a bunch of other districts.

    Be on the lookout for cherry-picking, for example, in epidemiological (a fancy word for the study of disease that sometimes means: "We didn't go out and collect any data ourselves. We just used someone else's data and played 'connect the dots' with them in an attempt to find something interesting.") studies looking at illnesses in areas surrounding toxic-waste dumps, power lines, high school cafeterias, etc. It is all too easy for a lazy researcher to draw the boundaries of the area he or she is looking at to include several extra cases of the illness in question and exclude many healthy individuals in the same area.

    When in doubt, plot the subjects of a study on map and look for yourself to see if the boundaries make sense.

  4. Be skeptical when dealing with comparisons. Researchers like to do something called a "regression," a process that compares one thing to another to see if they are statistically related. They will call such a relationship a "correlation." Always remember that a correlation DOES NOT mean causation.

    A study might find that an increase in the local birth rate was correlated with the annual migration of storks over the town. This does not mean that the storks brought the babies. Or that the babies brought the storks.

    Statisticians call this sort of thing a "spurious correlation," which is a fancy term for "total coincidence."

    People who want something from others often use regression studies to try to support their cause. They'll say something along the lines of "a study shows that a new police policy that we want led to a 20 percent drop in crime over a 10-year period in (some city)."

    That might be true, but the drop in crime could be due to something other than that new policy. What if, say, the average age of those cities' residents increased significantly over that 10 year period? Since crime is believed to be age-dependent (meaning the more young men you have in an area, the more crime you have), the aging of the population could potentially be the cause of the drop in crime.

    The policy change and the drop in crime might have been correlated. But that does not mean that one caused the other.

  5. Finally, be aware of numbers taken out of context. Again, data that are "cherry picked" to look interesting might mean something else entirely once it is placed in a different context.

    Consider the following example from Eric Meyer, a professional reporter now working at the University of Illinois:

    My personal favorite was a habit we use to have years ago, when I was working in Milwaukee. Whenever it snowed heavily, we'd call the sheriff's office, which was responsible for patrolling the freeways, and ask how many fender-benders had been reported that day. Inevitably, we'd have a lede that said something like, "A fierce winter storm dumped 8 inches of snow on Milwaukee, snarled rush-hour traffic and caused 28 fender-benders on county freeways" -- until one day I dared to ask the sheriff's department how many fender-benders were reported on clear, sunny days. The answer -- 48 -- made me wonder whether in the future we'd run stories saying, "A fierce winter snowstorm prevented 20 fender-benders on county freeways today." There may or may not have been more accidents per mile traveled in the snow, but clearly there were fewer accidents when it snowed than when it did not.

It is easy for people to go into brain-lock when the see a stack of papers loaded with numbers, spreadsheets and graphs. (And some sleazy sources are counting on it.) But your readers are depending upon you to make sense of that data for them.

Use what you've learned on this page to look at data with a more critical attitude. (That's critical, not cynical. There is a great deal of excellent data out there.) The worst thing you can do as a writer is to pass along someone else's word about data without any idea whether that person's worth believing or not.

Margin of Error and Confidence Interval

Margin of Error deserves better than the throw-away line it gets in the bottom of stories about polling data. Writers who don't understand margin of error, and its importance in interpreting scientific research, can easily embarrass themselves and their news organizations.

Check out the following story that moved in the summer of 1996 on a major news wire:

WASHINGTON (Reuter) - President Clinton, hit by bad publicity recently over FBI files and a derogatory book, has slipped against Bob Dole in a new poll released Monday but still maintains a 15 percentage point lead.

The CNN/USA Today/Gallup poll taken June 27-30 of 818 registered voters showed Clinton would beat his Republican challenger if the election were held now, 54 to 39 percent, with seven percent undecided. The poll had a margin of error of plus or minus four percentage points.

A similar poll June 18-19 had Clinton 57 to 38 percent over Dole.

Unfortunately for the readers of this story, it is wrong. There is no statistical basis for claiming that Clinton's lead over Dole has slipped.

Why? The margin of error. In this case, the CNN et al. poll had a four percent margin of error. That means that if you asked a question from this poll 100 times, 95 of those times the percentage of people giving a particular answer would be within 4 points of the percentage who gave that same answer in this poll.

(WARNING: Math Geek Stuff!)
Why 95 times out of 100? In reality, the margin of error is what statisticians call a confidence interval. The math behind it is much like the math behind the standard deviation. So you can think of the margin of error at the 95 percent confidence interval as being equal to two standard deviations in your polling sample. Occasionally you will see surveys with a 99 percent confidence interval, which would correspond to 3 standard deviations and a much larger margin of error.
(End of Math Geek Stuff!)

So let's look at this particular week's poll as a repeat of the previous week's (which it was). The percentage of people who say they support Clinton is within 4 points of the percentage who said they supported Clinton the previous week (54 percent this week to 57 last week). Same goes for Dole. So statistically, there is no change from the previous week's poll. Dole has made up no measurable ground on Clinton.

And reporting anything different is just plain wrong.

Don't overlook that fact that the margin of error is a 95 percent confidence interval, either. That means that for every 20 times you repeat this poll, statistics say that one time you'll get an answer that is completely off the wall.

You might remember that just after Dole resigned from the U.S. Senate, the CNN et al. poll had Clinton's lead down to six points. Reports attributed this surge by Dole to positive public reaction to his resignation. But the next week, Dole's surge was gone.

Perhaps there never was a surge. It very well could be that that week's poll was the one in 20 where the results lie outside the margin of error. Who knows? Just remember to never place to much faith in one week's poll or survey. No matter what you are writing about, only by looking at many surveys can you get an accurate look at what is going on.

Standard Deviation and Normal Distribution

I'll be honest. Standard deviation is a more difficult concept than the others we've covered. And unless you are writing for a specialized, professional audience, you'll probably never use the words "standard deviation" in a story. But that doesn't mean you should ignore this concept.

The standard deviation is kind of the "mean of the mean," and often can help you find the story behind the data. To understand this concept, it can help to learn about what statisticians call normal distribution of data.

A normal distribution of data means that most of the examples in a set of data are close to the "average," while relatively few examples tend to one extreme or the other.

Let's say you are writing a story about nutrition. You need to look at people's typical daily calorie consumption. Like most data, the numbers for people's typical consumption probably will turn out to be normally distributed. That is, for most people, their consumption will be close to the mean, while fewer people eat a lot more or a lot less than the mean.

When you think about it, that's just common sense. Not that many people are getting by on a single serving of kelp and rice. Or on eight meals of steak and milkshakes. Most people lie somewhere in between.

If you looked at normally distributed data on a graph, it would look something like this:

The x-axis (the horizontal one) is the value in question... calories consumed, dollars earned or crimes committed, for example. And the y-axis (the vertical one) is the number of datapoints for each value on the x-axis... in other words, the number of people who eat x calories, the number of households that earn x dollars, or the number of cities with x crimes committed.

Now, not all sets of data will have graphs that look this perfect. Some will have relatively flat curves, others will be pretty steep. Sometimes the mean will lean a little bit to one side or the other. But all normally distributed data will have something like this same "bell curve" shape.

The standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. When the examples are pretty tightly bunched together and the bell-shaped curve is steep, the standard deviation is small. When the examples are spread apart and the bell curve is relatively flat, that tells you you have a relatively large standard deviation.

Computing the value of a standard deviation is complicated. But let me show you graphically what a standard deviation represents...

One standard deviation away from the mean in either direction on the horizontal axis (the red area on the above graph) accounts for somewhere around 68 percent of the people in this group. Two standard deviations away from the mean (the red and green areas) account for roughly 95 percent of the people. And three standard deviations (the red, green and blue areas) account for about 99 percent of the people.

If this curve were flatter and more spread out, the standard deviation would have to be larger in order to account for those 68 percent or so of the people. So that's why the standard deviation can tell you how spread out the examples in a set are from the mean.

Why is this useful? Here's an example: If you are comparing test scores for different schools, the standard deviation will tell you how diverse the test scores are for each school.

Let's say Springfield Elementary has a higher mean test score than Shelbyville Elementary. Your first reaction might be to say that the kids at Springfield are smarter.

But a bigger standard deviation for one school tells you that there are relatively more kids at that school scoring toward one extreme or the other. By asking a few follow-up questions you might find that, say, Springfield's mean was skewed up because the school district sends all of the gifted education kids to Springfield. Or that Shelbyville's scores were dragged down because students who recently have been "mainstreamed" from special education classes have all been sent to Shelbyville.

In this way, looking at the standard deviation can help point you in the right direction when asking why data is the way it is.

The standard deviation can also help you evaluate the worth of all those so-called "studies" that seem to be released to the press everyday. A large standard deviation in a study that claims to show a relationship between eating Twinkies and killing politicians, for example, might tip you off that the study's claims aren't all that trustworthy.

Of course, you'll want to seek the advice of a trained statistician whenever you try to evaluate the worth of any scientific research. But if you know at least a little about standard deviation going in, that will make your interview much more productive.

Okay, because so many of you asked nicely...
Here is one formula for computing the standard deviation. A warning, this is for math geeks only!
Writers and others seeking only a basic understanding of stats don't need to read any more in this chapter. Remember, a decent calculator and stats program will calculate this for you...

Terms you'll need to know
x = one value in your set of data
avg (x) = the mean (average) of all values x in your set of data
n = the number of values x in your set of data

For each value x, subtract the overall avg (x) from x, then multiply that result by itself (otherwise known as determining the square of that value). Sum up all those squared values. Then divide thatThat's result by (n-1). Got it? Then, there's one more step... find the square root of that last number. the standard deviation of your set of data.

Now, remember how I told you this was one way of computing this? Sometimes, you divide by (n) instead of (n-1). It's too complex to explain here. So don't try to go figuring out a standard deviation if you just learned about it on this page. Just be satisified that you've now got a grasp on the basic concept.

Simple Linear Correlation (Pearson r)

Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales (see Elementary Concepts), and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).

This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data (as we will later see).

How to Interpret the Values of Correlations

As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation.

Significance of Correlations. The significance level calculated for each correlation is a primary source of information about the reliability of the correlation. As explained before (see Elementary Concepts), the significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed. The test of significance is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution, and that the variability of the residual values is the same for all values of the independent variable x. However, Monte Carlo studies suggest that meeting those assumptions closely is not absolutely crucial if your sample size is not very small and when the departure from normality is not very large. It is impossible to formulate precise recommendations based on those Monte- Carlo results, but many researchers follow a rule of thumb that if your sample size is 50 or more then serious biases are unlikely, and if your sample size is over 100 then you should not be concerned at all with the normality assumptions. There are, however, much more common and serious threats to the validity of information that a correlation coefficient can provide; they are briefly discussed in the following paragraphs.

Outliers. Outliers are atypical (by definition), infrequent observations. Because of the way in which the regression line is determined (especially the fact that it is based on minimizing not the sum of simple distances but the sum of squares of distances of data points from the line), outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example. Note, that as shown on that illustration, just one outlier can be entirely responsible for a high value of the correlation that otherwise (without the outlier) would be close to zero. Needless to say, one should never base important conclusions on the value of the correlation coefficient alone (i.e., examining the respective scatterplot is always recommended).

Note that if the sample size is relatively small, then including or excluding specific data points that are not as clearly "outliers" as the one shown in the previous example may have a profound influence on the regression line (and the correlation coefficient). This is illustrated in the following example where we call the points being excluded "outliers;" one may argue, however, that they are not outliers but rather extreme values.

Typically, we believe that outliers represent a random error that we would like to be able to control. Unfortunately, there is no widely accepted method to remove outliers automatically (however, see the next paragraph), thus what we are left with is to identify any outliers by examining a scatterplot of each important correlation. Needless to say, outliers may not only artificially increase the value of a correlation coefficient, but they can also decrease the value of a "legitimate" correlation.

What is Correlation?

Correlation is a measure of the relation between two or more variables. The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.

The most widely-used type of correlation coefficient is Pearson r, also called linear or product- moment correlation.

Simple Linear Correlation (Pearson r). Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales (see Elementary Concepts), and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).

This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data (as we will later see).

Descriptive Statistics

"True" Mean and Confidence Interval. Probably the most often used descriptive statistic is the mean. The mean is a particularly informative measure of the "central tendency" of the variable if it is reported along with its confidence intervals. As mentioned earlier, usually we are interested in statistics (such as the mean) from our sample only to the extent to which they can infer information about the population. The confidence intervals for the mean give us a range of values around the mean where we expect the "true" (population) mean is located (with a given level of certainty, see also Elementary Concepts). For example, if the mean in your sample is 23, and the lower and upper limits of the p=.05 confidence interval are 19 and 27 respectively, then you can conclude that there is a 95% probability that the population mean is greater than 19 and lower than 27. If you set the p-level to a smaller value, then the interval would become wider thereby increasing the "certainty" of the estimate, and vice versa; as we all know from the weather forecast, the more "vague" the prediction (i.e., wider the confidence interval), the more likely it will materialize. Note that the width of the confidence interval depends on the sample size and on the variation of data values. The larger the sample size, the more reliable its mean. The larger the variation, the less reliable the mean (see also Elementary Concepts). The calculation of confidence intervals is based on the assumption that the variable is normally distributed in the population. The estimate may not be valid if this assumption is not met, unless the sample size is large, say n=100 or more.

Shape of the Distribution, Normality. An important aspect of the "description" of a variable is the shape of its distribution, which tells you the frequency of values from different ranges of the variable. Typically, a researcher is interested in how well the distribution can be approximated by the normal distribution (see the animation below for an example of this distribution) (see also Elementary Concepts). Simple descriptive statistics can provide some information relevant to this issue. For example, if the skewness (which measures the deviation of the distribution from symmetry) is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical. If the kurtosis (which measures "peakedness" of the distribution) is clearly different from 0, then the distribution is either flatter or more peaked than normal; the kurtosis of the normal distribution is 0.

More precise information can be obtained by performing one of the tests of normality to determine the probability that the sample came from a normally distributed population of observations (e.g., the so-called Kolmogorov-Smirnov test, or the Shapiro-Wilks' W test. However, none of these tests can entirely substitute for a visual examination of the data using a histogram (i.e., a graph that shows the frequency distribution of a variable).

The graph allows you to evaluate the normality of the empirical distribution because it also shows the normal curve superimposed over the histogram. It also allows you to examine various aspects of the distribution qualitatively. For example, the distribution could be bimodal (have 2 peaks). This might suggest that the sample is not homogeneous but possibly its elements came from two different populations, each more or less normally distributed. In such cases, in order to understand the nature of the variable in question, you should look for a way to quantitatively identify the two sub-samples.

[Basic Statistics] Data Analysis

You wouldn't buy a car or a house without asking some questions about it first. So don't go buying into someone else's data without asking questions, either.

Okay, you're saying... but with data there are no tires to kick, no doors to slam, no basement walls to check for water damage. Just numbers, graphs and other scary statistical things that are causing you to have bad flashbacks to your last income tax return. What the heck can you ask about data?

Plenty. Here are a few standard questions you should ask any human beings who slap a pile of data in front of you and ask you write about it.

  1. Where did the data come from? Always ask this one first. You always want to know who did the research that created the data you're going to write about.

    You'd be surprised - sometimes it turns out that the person who is feeding you a bunch of numbers can't tell you where they came from. That should be your first hint that you need to be very skeptical about what you are being told.

    Even if your data have an identifiable source, you still want to know what it is. You might have some extra questions to ask about a medical study on the effects of secondhand smoking if you knew it came from a bunch of researchers employed by a tobacco company instead of from, say, a team of research physicians from a major medical school, for example. Or if you knew a study about water safety came from a political interest group that had been lobbying Congress for a ban on pesticides.

    Just because a report comes from a group with a vested interest in its results doesn't guarantee the report is a sham. But you should always be extra skeptical when looking at research generated by people with a political agenda. At the least, they have plenty of incentive NOT to tell you about data they found that contradict their organization's position.

    Which brings us to the next question:

  2. Have the data been peer-reviewed? Major studies that appear in journals like the New England Journal of Medicine undergo a process called "peer review" before they are published. That means that professionals - doctors, statisticians, etc. - have looked at the study before it was published and concluded that the study's authors pretty much followed the rules of good scientific research and didn't torture their data like a middle ages infidel to make the numbers conform to their conclusions.

    Always ask if research was formally peer reviewed. If it was, you know that the data you'll be looking at are at least minimally reliable.

    And if it wasn't peer-reviewed, ask why. It may be that the research just wasn't interesting to enough people to warrant peer review. Or it could mean that the research had as much chance of standing up to professional scrutiny as a $500 mobile home has of standing up in a tornado.

  3. How were the data collected? This one is real important to ask, especially if the data were not peer-reviewed. If the data come from a survey, for example, you want to know that the people who responded to the survey were selected at random.

    In 1997, the Orlando Sentinel released the results of a poll in which more than 90 percent of those people who responded said that Orlando's National Basketball Association team, the Orlando Magic, shouldn't re-sign its center, Shaquille O'Neal, for the amount of money he was asking. The results of that poll were widely reported as evidence that Shaq wasn't wanted in Orlando, and in fact, O'Neal signed with the Los Angeles Lakers a few days later.

    Unfortunately for Magic fans, that poll was about as trustworthy as one of those cheesy old "Magic 8 Balls." The survey was a call-in poll where anyone who wanted could call a telephone number at the paper and register his or her vote.

    This is what statisticians call a "self-selected sample." For all we know, two or three people who got laid off that morning and were ticked off at the idea of someone earning $100 million to play basketball could have flooded the Sentinel's phone lines, making it appear as though the people of Orlando despised Shaq.

    Another problem with data is "cherry-picking." This is the social-science equivalent of gerrymandering, where you draw up a legislative district so that all the people who are going to vote for your candidate are included in your district and everyone else is scattered among a bunch of other districts.

    Be on the lookout for cherry-picking, for example, in epidemiological (a fancy word for the study of disease that sometimes means: "We didn't go out and collect any data ourselves. We just used someone else's data and played 'connect the dots' with them in an attempt to find something interesting.") studies looking at illnesses in areas surrounding toxic-waste dumps, power lines, high school cafeterias, etc. It is all too easy for a lazy researcher to draw the boundaries of the area he or she is looking at to include several extra cases of the illness in question and exclude many healthy individuals in the same area.

    When in doubt, plot the subjects of a study on map and look for yourself to see if the boundaries make sense.

  4. Be skeptical when dealing with comparisons. Researchers like to do something called a "regression," a process that compares one thing to another to see if they are statistically related. They will call such a relationship a "correlation." Always remember that a correlation DOES NOT mean causation.

    A study might find that an increase in the local birth rate was correlated with the annual migration of storks over the town. This does not mean that the storks brought the babies. Or that the babies brought the storks.

    Statisticians call this sort of thing a "spurious correlation," which is a fancy term for "total coincidence."

    People who want something from others often use regression studies to try to support their cause. They'll say something along the lines of "a study shows that a new police policy that we want led to a 20 percent drop in crime over a 10-year period in (some city)."

    That might be true, but the drop in crime could be due to something other than that new policy. What if, say, the average age of those cities' residents increased significantly over that 10 year period? Since crime is believed to be age-dependent (meaning the more young men you have in an area, the more crime you have), the aging of the population could potentially be the cause of the drop in crime.

    The policy change and the drop in crime might have been correlated. But that does not mean that one caused the other.

  5. Finally, be aware of numbers taken out of context. Again, data that are "cherry picked" to look interesting might mean something else entirely once it is placed in a different context.

    Consider the following example from Eric Meyer, a professional reporter now working at the University of Illinois:

    My personal favorite was a habit we use to have years ago, when I was working in Milwaukee. Whenever it snowed heavily, we'd call the sheriff's office, which was responsible for patrolling the freeways, and ask how many fender-benders had been reported that day. Inevitably, we'd have a lede that said something like, "A fierce winter storm dumped 8 inches of snow on Milwaukee, snarled rush-hour traffic and caused 28 fender-benders on county freeways" -- until one day I dared to ask the sheriff's department how many fender-benders were reported on clear, sunny days. The answer -- 48 -- made me wonder whether in the future we'd run stories saying, "A fierce winter snowstorm prevented 20 fender-benders on county freeways today." There may or may not have been more accidents per mile traveled in the snow, but clearly there were fewer accidents when it snowed than when it did not.

It is easy for people to go into brain-lock when the see a stack of papers loaded with numbers, spreadsheets and graphs. (And some sleazy sources are counting on it.) But your readers are depending upon you to make sense of that data for them.

Use what you've learned on this page to look at data with a more critical attitude. (That's critical, not cynical. There is a great deal of excellent data out there.) The worst thing you can do as a writer is to pass along someone else's word about data without any idea whether that person's worth believing or not.

Process Sigma Calculator Assumptions

Understanding The Basic And Advanced Modes
The Basic Mode of the Sigma Calculator automatically adds a 1.5 Sigma shift to the process Sigma value that is calculated. Why is this done? It's done because it is the "standard" way that Sigma is reported (note: this may be different in your company, but it is done in this manner by Motorola, GE and many other companies). By doing so, the calculator result assumes that you are providing long-term data and it is providing short-term Sigma. The 1.5 Sigma shift is based on the assumption that over time, and with a sufficiently large number of samples, a realistic Sigma value is 1.5 Sigma less than that calculated to show the success of your project (i.e. that shown in this calculator and in reports to your company).

If you want to calculate the process Sigma using data other than long-term, you should switch to the Advanced Mode where you can change the process Sigma shift value from 1.5 to whatever you feel is appropriate.

Here are a couple of examples to help illustrate the calculations. A long-term 93% yield (e.g. 100 opportunities, 7 defects) equates to a process Sigma long-term value of 1.48 (with no Sigma shift) or a process Sigma short-term value of 2.98 (with a 1.5 Sigma shift). A long-term 99.7% yield (e.g. 1,000 opportunities, 3 defects) equates to a process Sigma long-term value of 2.75 (with no Sigma shift) or a process Sigma short-term value of 4.25 (with the 1.5 sigma shift).

Final Thought: When we talk about a Six Sigma process, we are referring to the process short-term (now). When we talk about DPMO of the process, we are referring to long-term (the future). We refer to 3.4 defects per million opportunities as our goal. This means that we will have a 6 sigma process now in order to have in the future (with the correction of 1.5) a 4.5 sigma process -- which produces 3.4 defects per million opportunities.

Notice: Sigma with a capital "S" is used above to denote the process Sigma, which is different than the typical statistical reference to sigma with a small "s" which denotes the standard deviation.

Understanding The Formula
Defects Per Million Opportunities (DPMO) = ((Total Defects) / (Total Opportunities)) * 1,000,000
Defects (%) = ((Total Defects) / (Total Opportunities)) * 100
Yield (%) = 100 - (Defects Percentage)
process Sigma = NORMSINV(1-((Total Defects) / (Total Opportunities))) + 1.5

Alternatively,
process Sigma = 0.8406 + SQRT(29.37 - 2.221 * (ln(DPMO))).
Reference: Breyfogle, F., 1999. Implementing Six Sigma: Smarter Solutions Using Statistical Methods. 2nd ed. John Wiley & Sons.

Understanding Negative Sigma
Sigma value is simply a modified Z score (Table of the Standard Normal (z) Distribution). Sigma (with a capital "S") is not the same thing as the standard deviation of a process, referred to as sigma (with a lower case "s" or as the greek letter s). Consequently, it is quite possible to get a negative sigma value. A negative sigma value means that most of your product or service (process) is completely outside your customer's specification range. A full discussion can be found on negative Sigma.

1.5 Sigma Process Shift Explanation

iSixSigma recently released a process sigma calculator which allows the operator to input process opportunities and defects and easily calculate the process sigma to determine how close (or far) a process is from 6 sigma. One of the caveats written in fine print refers to the calculator using a default process shift of 1.5 sigma. From an earlier poll, greater than 50% of polled quality professionals indicated that they are not aware of why a process may shift 1.5 sigma. My goal is to explain it here.

I'm not going to bore you with the hard core statistics. There's a whole statistical section dealing with this issue, and every green, black and master black belt learns the calculation process in class. If you didn't go to class (or you forgot!), the table of the standard normal distribution is used in calculating the process sigma. Most of these tables, however, end at a z value of about 3 (see the iSixSigma table for an example). In 1992, Motorola published a book (see chapter 6) entitled Six Sigma Producibility Analysis and Process Characterizationbuy it now!, written by Mikel J. Harry and J. Ronald Lawson. In it is one of the only tables showing the standard normal distribution table out to a z value of 6.

Using this table you'll find that 6 sigma actually translates to about 2 defects per billionmillion opportunities, which we normally define as 6 sigma, really corresponds to a sigma value of 4.5. Where does this 1.5 sigma difference come from? Motorola has determined, through years of process and data collection, that processes vary and drift over time - what they call the Long-Term Dynamic Mean Variation. This variation typically falls between 1.4 and 1.6. opportunities, and 3.4 defects per

After a process has been improved using the Six Sigma DMAIC methodology, we calculate the process standard deviation and sigma value. These are considered to be short-term values because the data only contains common cause variation -- DMAIC projects and the associated collection of process data occur over a period of months, rather than years. Long-term data, on the other hand, contains common cause variation and special (or assignable) cause variation. Because short-term data does not contain this special cause variation, it will typically be of a higher process capability than the long-term data. This difference is the 1.5 sigma shift. Given adequate process data, you can determine the factor most appropriate for your process.

In Six Sigma, The Breakthrough Management Strategy Revolutionizing The World's Top Corporations, Harry and Schroeder write:

"By offsetting normal distribution by a 1.5 standard deviation on either side, the adjustment takes into account what happens to every process over many cycles of manufacturing?Simply put, accommodating shift and drift is our 'fudge factor,' or a way to allow for unexpected errors or movement over time. Using 1.5 sigma as a standard deviation gives us a strong advantage in improving quality not only in industrial process and designs, but in commercial processes as well. It allows us to design products and services that are relatively impervious, or 'robust,' to natural, unavoidable sources of variation in processes, components, and materials."

Statistical Take Away: The reporting convention of Six Sigma requires the process capability to be reported in short-term sigma -- without the presence of special cause variation. Long-term sigma is determined by subtracting 1.5 sigma from our short-term sigma calculation to account for the process shift that is known to occur over time.

Thursday, April 26, 2007

Introduction of Statistics

Population vs Sample

The population includes all objects of interest whereas the sample is only a portion of the population. Parameters are associated with populations and statistics with samples. Parameters are usually denoted using Greek letters (mu, sigma) while statistics are usually denoted using Roman letters (x, s).

There are several reasons why we don't work with populations. They are usually large, and it is often impossible to get data for every object we're studying. Sampling does not usually occur without cost, and the more items surveyed, the larger the cost.

We compute statistics, and use them to estimate parameters. The computation is the first part of the statistics course (Descriptive Statistics) and the estimation is the second part (Inferential Statistics)

Discrete vs Continuous

Discrete variables are usually obtained by counting. There are a finite or countable number of choices available with discrete data. You can't have 2.63 people in the room.

Continuous variables are usually obtained by measuring. Length, weight, and time are all examples of continuous variables. Since continuous variables are real numbers, we usually round them. This implies a boundary depending on the number of decimal places. For example: 64 is really anything 63.5 <= x <>

Levels of Measurement

There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have.

  • Nominal is the lowest level. Only names are meaningful here.
  • Ordinal adds an order to the names.
  • Interval adds meaningful differences, but there is no starting point (0).
  • Ratio adds a zero so that ratios are meaningful.

Types of Sampling

There are five types of sampling: Random, Systematic, Convenience, Cluster, and Stratified.

  • Random sampling is analogous to putting everyone's name into a hat and drawing out several names. Each element in the population has an equal chance of occurring. While this is the preferred way of sampling, it is often difficult to do. It requires that a complete list of every element in the population be obtained. Computer generated lists are often used with random sampling. You can generate random numbers using the TI82 calculator.
  • Systematic sampling is easier to do than random sampling. In systematic sampling, the list of elements is "counted off". That is, every kth element is taken. This is similar to lining everyone up and numbering off "1,2,3,4; 1,2,3,4; etc". When done numbering, all people numbered 4 would be used.
  • Convenience sampling is very easy to do, but it's probably the worst technique to use. In convenience sampling, readily available data is used. That is, the first people the surveyor runs into.
  • Cluster sampling is accomplished by dividing the population into groups -- usually geographically. These groups are called clusters or blocks. The clusters are randomly selected, and each element in the selected clusters are used.
  • Stratified sampling also divides the population into groups called strata. However, this time it is by some characteristic, not geographically. For instance, the population might be separated into males and females. A sample is taken from each of these strata using either random, systematic, or convenience sampling.