Gaussian (Normal) Distributions: For populations often assume a Gaussian distribution, that is we assume that we in fact have a completely random distribution of values around a central value, as in the Gaussian curve below.

| Because of the wide occurrence and common knowledge, and mathematical exactness of the Gaussian function other distributions are often transformed to get a Gaussian shape. Thus a log transformation will often convert a skewed distribution into a Gaussian. One should exercise caution however. The Gaussian is too nice, it is seductive because of its convenience etc., but it is not always followed. If at all in doubt one should test a distribution to be certain it is in fact Gaussian. Linear transformations are very useful in this respect. Thus one may plot data on "probability paper" or in Probit units for frequency (below) vs. log value. In each case deviation from linearity indicates a non-Gaussian distribution. This deviation in turn can be analyzed statistically etc. |
|
As it turns out the Gaussian curve can be
described by an equation containing just two parameters the population
mean (mu,
), a measure of centrality, and the population standard deviation
(sigma,
) a measure of precision or spread of the data:
![]()
Note the exponential form of this equation, accounting for the rapid decay of frequency on either side of the mean.
[Note the meaning of parameter in statistics, which is a number that describes a population, such as the mean. For any given population it will be a constant. (For samples of a population the mean will be variable and thus is not considered a parameter, rather it is now a statistic.) In math, physics, etc. parameter has a somewhat different, but related meaning: it is a variable that is held constant for a particular calculation etc.]
So what is a population vs. a sample?
So let's look at the Gaussian distribution for Populations and for Samples.
Population Mean vs. Sample Mean: in this case both are represented by the same mathematical equation, but with differing symbols.
For a population:

Typical Gaussian curves for populations are shown above for different values of sigma and mu.
For a sample:
![]()
Gaussian curves for large samples will closely approximate the population curves, while small samples will be "noisy" (not smooth).
(Because the Gaussian distribution curve is so common and important it is often nice to check if data is Gaussian-use Probability paper: plots percentile vs std dev. Gaussian gives a straight line plot. We won't worry about what to do if its not Gaussian, suffice it to say there are methods to deal with other distributions if needed.)
Let's look at the distribution of randomly arrayed data (follow Gaussian curve).
Population Standard Deviation. For measurement, this is a measure of the precision of the population of data:

Two curves demonstrating differing precision are shown in Fig. 3.4 of your text (Skoog, West & Holler, 7 th ed, pg 26) {Overhead}. Note that you can "normalize" this distribution as shown in the same figure by plotting in terms of the standard deviation. (Probit units give essentially the same plot.)
For the Sample Standard Deviation a correction must be made, replacing the population number with N - 1, the number of degrees of freedom. This is necessary because we are no longer dealing with a defined system, so we have to take our measurements relative to something (one of the members of the set), thus we divide by one fewer member. In other words, N - 1 represents the number of independent data in the sample set. The sample equation also differs because we no longer know the true mean, so we substitute the mean as the best estimate:

Note that sample statistics approach population parameters as N approaches infinity.
Sometimes it is useful to estimate the standard deviation of the distribution of samples taken from a single population. For this purpose we can use the Standard Error of the mean. This estimate is based on the standard error of our sample:
![]()
This is an estimate of the standard deviation of x-bar we would see if we took a series of samples of size N from a single population.
Standard deviations from different sets of data from the same population can be pooled to give a better estimate of s. For t sets of data:

Note that Nt is the total number of data sets in the pool, thus giving us pooled degrees of freedom.

This has the advantage that it may be added or subtracted.
![]()
Expressed in %, the RSD has a special name-the Coefficient of Variation:
![]()
Need to look at the distribution of samples in a population. Different samples will generally have different means and standard deviations! However there will be less variance in the sample means then there will be in population values. We can consider each sample mean to be an estimate of the population mean. And remember, what we want is the population mean and standard deviation. Thus we want to know how closely our samples reflect the population, and thus how confident we can be of our data.
Note this applies only in absence of bias and only if we can assume that s is approximately sigma.
CL for
= x ± z![]()
Note for 95% confidence z = 2, 99.7%, z = 3, etc. And for N observations substitute standard error of the mean for std dev:
![]()
Note rapid approach to true limits (as one over sq root).
Confidence interval when std dev not known. Use Student's t, defined as for z, but substitute s for sigma
![]()
t values are tabulated in stats books etc.
Confidence limits are then :
![]()
Null hypothesis assumes two values are the same, can then calculate probability that difference observed is due to random error
critical value for rejecting null hypothesis is calculated as (remembering
that x1 is =
)
Use the t test for equality of means (test the null hypothesis that there is no difference between means. Use table of t values as above.
![]()
(Use table for 5% values of F for 95% confidence). Compare variances for two data sets.
![]()
Note the F test can be used to test two questions:
Statistical tests for rejecting outliers assume a normal distribution of the data. Of course for small data sets (<50) this can't be known with any certainty. Thus tests should be used with caution. It has been said that the best reason for rejecting data is probably the analysts intuition! Consider statistical tests a security blanket for the inexperienced, and a way to communicate to others why you felt justified in rejecting a data point. And remember, the only absolutely valid reason for rejecting a data point is that you know you messed up!
So what should you do with outliers? Can use the following criteria to evaluate:
What we want to do here is find the minimum detectable difference between two populations of data: the population for the blank and the population for the concentration or amount of sample. So how do we detect whether the means of two data sets come from different populations? Use t-test. Thus (for subscript b referring to blank determinations, thus looking at s for blanks):
![]()
Notice as the number of determinations increases, the limit decreases as the inverse of the square root Get a rapid initial sensitivity enhancement (3 runs double sensitivity vs. 1 run), then levels off (an infinity of runs only increases sensitivity by about 20% over 3 runs). This contributes to the sensitivity enhancements in Fourier Transform techniques as are used in our NMR.
Two assumptions:
For the least squares method then want to minimize the squares of the residuals (the deviation of the points from the straight line).
The method not only provides the best fit line, it also gives standard deviations for a (the slope) and b (the intercept on the y axis)
© R A Paselk
Last modified 6 September 2009