Anyone who uses Six Sigma also applies the Central Limit Theorem (CLT for short), although she/he may not know it at all. CLT is the basic theorem of probability, but it is also widely used in statistics, determining population mean values based on sample mean values, taking into account the confidence interval, justifies the wide use of the normal distribution, enables hypothesis testing, ANOVA, DoE, regression analysis, etc.
Let us see how it works in practice. Fig. 1 shows by default a trimodal distribution (completely non normal distribution) from a population of 30,000 elements. You can change the number of intervals and also population data (using form placed below Fig. 2). The histogram in Fig. 2 can take a different form each time, because random samples are generated from fixed population every time the page is refreshed.
Each rectangle that appears is a new sample drawn from the population. The position of the rectangle is defined by the mean value of this sample. The program from the sample means builds the histogram shown in Fig. 2. You can determine samples size (n - how many pieces are in the sample), how many samples (k - number of samples) and the number of intervals in the histogram. For example, if you set the sample size to 16 and number of samples to 100, the program will show the data of 16 samples drawn from a population of 30,000 items, will calculate the mean and draw one rectangle to the mean histogram in Fig. 2 and then repeat this action 100 times. The more samples and/or the larger the sample size, the closer the histogram is to the normal distribution. However, with fixed settings, after every refresh the mean histogram will have a different shape, as the other data will be drawn.
I encourage you to do multiple trials to see what actually happens with the histogram of the distribution of means under different parameters and for different populations.
Example of population data is available in XLSX format. In the file you can find 4 data sets, each with 30,000 rows: 1) a trimodal distribution (default data set), 2) Chi-square distribution, 3) uniform distribution and 4) normal distribution. In order to change the data for the population, enter the data into the field of the form "Data in txt format", which is under Fig. 2 using "Copy" (from Excel file) and "Paste" (to the form under histogram diagram). Please be aware that program accept dot as decimal separator.
Histograms for population data can also be made in Minitab or later versions of Excel. Anyone using older versions of Excel can use my Excel macro file to create histograms. The effect of this macro applied to the data from the presented population is visible here.
Feeding you with statistical equations is the last thought that comes to my mind, but I have to show some concretes. The Central Limit Theorem (short CLT) says that regardless of the distribution of data in the population (the distribution can be normal, right-, left skewed, chi-square, uniform - any kind), the distribution of the sample means tends a normal distribution. This is the first piece of good news, because we have broad knowledge about normal distribution. The second good news is that if σ is the standard deviation of the population, the standard deviation s of distribution of the samples with the sizes n from this population equals σ / √ n (the larger the n, the smaller s).
If we have a random variable X with any distribution (equation (1)), then the distribution of a new random variable X, the values of which are sample mean and where the sample size is n, tends to the normal distribution (equation (2)). The standard deviation in the distribution of sample means is given by equation (3).
The random variable is nothing but our measurements, for example, the height of all adult people in a given country. There is probably nobody who would like to measure the entire population, but there are a lot of people who would like to estimate it, for example, clothing companies (in the end they need to know what sizes will be most needed - the nearest to the mean).
The time has come for practical applications. The data on the mean histogram (see Fig. 2) is in fact a representation of the means. Some of them are closer and others further away to the population mean, which is marked by a green line. The question is, how good of an estimator of population mean is the selected sample mean (of one sample)?
Thanks to the normal distribution properties, we know that in the range +/- 2 s (where s - standard deviation for sample means) there is 95% of all data (okay for accuracy in the range +/- 1.96 s). If I add to each mean a range of +/- 2 s (let us call it the confidence interval for fun), which in Fig. 3 is marked with a black horizontal segment, then it turns out that for every mean this interval contains the population mean. For example, if we assume that the mean of sample number 1 = m1, then in the interval [m1 - 2s, m1 + 2s] contains the population mean.
Statistically, we can say this as follows: based on the mean mi (each sample mean) from the n-element sample, at 95% we assume that the population average is within the range [mi - 2s, mi + 2s]. This claim is true regardless of the type of population distribution. Warning! There is a 5% probability that the population mean is outside this range (95% confidence interval).
There is also an issue of knowing the standard deviation of the population. If it is not known, then we make a far-reaching assumption that the good estimator of this deviation is deviation calculated based on the sample (this assumption is not always valid, just between us). One more source of serious errors may be the sampling method. If, for example, it would turn out that all samples were taken only from the area represented by the left part of the histogram from Fig. 1, then there is no chance that the mean of the sample would be a good estimator of population mean. We can call it a problem of a representative sample.
Author: Adam Cetera (LeanSigma.pl)
Creation date: 2018-09-10
Modification date: 2021-10-01
Leave a comment below if you would like to add something?
Ta strona używa cookies aby ulepszyć serwis.
[więcej informacji o cookies ]
This website uses cookies to provide better service.
[More info about cookies ]