## Notes Week 3

### Chapter 12 - Statistics Review

Gaussian (Normal) Distributions: For populations often assume a Gaussian distribution, that is we assume that we in fact have a completely random distribution of values around a central value, as in the Gaussian curve below.

Recall that for a Gaussian distribution the the mean (arithmetic average), median (central value), and mode (most frequent value) coincide. It is also useful to note that the area under the curve between ± 1s (1 sigma) includes 68% of all values (of the population), while ± 2s (2 sigma) includes 95% and ± 3s (3 sigma) includes 99.7% of all values, also known as the empirical rule or three sigma rule. A plot demonstrating these relations is shown below:

Of course not all distributions are randomly distributed around a central point. For example the distribution of concentrations around an average value near zero will be skewed toward high values since we cannot have concentrations below zero. This is similar to the situation we see for the speeds of gas molecules, cholesterol concentrations (cannot go below zero) etc.
 Because of the wide occurrence and common knowledge, and mathematical exactness of the Gaussian function other distributions are often transformed to get a Gaussian shape. Thus a log transformation will often convert a skewed distribution into a Gaussian. One should exercise caution however. The Gaussian is too nice, it is seductive because of its convenience etc., but it is not always followed. If at all in doubt one should test a distribution to be certain it is in fact Gaussian. Linear transformations are very useful in this respect. Thus one may plot data on "probability paper" or in Probit units for frequency (below) vs. log value. In each case deviation from linearity indicates a non-Gaussian distribution. This deviation in turn can be analyzed statistically etc.
Sigmoidal plot of normal distributions: If the values in a normal distribution are plotted vs. cumulative frequency instead of frequency, a sigmoidal plot results (Ideally the curve should intersect 0 at 50%, artifacts in the data set used in preparing the figure resulted in the error seen in this representation.):
Probit Analysis gives a linear response from sigmoidal (Gaussian) Dose-response curves. For probit analysis divide data into multiples of s ("Probit units") from the mean. But define mean = 5. Thus -1s = 4, 1s =6 etc. The standard deviation is sometimes called NED (normal equivalent deviation). So for 50% response (mean value in Gaussian) NED = 0, for 15.9% (50 - 68.2/2) response NED = -1, for 84.1% (68.2 + 15.9) NED =1. So Probit = NED + 5.

# An Aside on Interpreting Statistics

[Examples modified from: Ulrich Hoffrage, Samuel Lindsey, Ralph Hertwig, and Gerd Gizerenzer. (22 Dec. 2000) "Communicating Statistical information." Science 290, pp 2261-2.]

A common situation seen in news reports, and even the literature is the interpretation of how likely a particular event is given a prevalence, and for example, the rate of false positives.

For instance. let's assume a drug is used by 0.1% of the students at HSU, and in the standard test for this drug has a false positive rate of 2%. What is the likelyhood that a student who tests positive in a random drug test is actually a user? Think about this for a few minutes and write down your answer, assuming there are 7,000 students at HSU.

Now let's solve this problem.

• First we need to determine how many users there are: Users = (0.001) (7000) = 7
• Next, we determine the number of false positives: False +'s = non-users x rate = (7000 - 7) (0.02) = 139.9
• The total number of positive tests will then be; 140 + 7 = 147
• And the probabity of a positive test indicating actual use then is: users / positives = 7/147 = 0.048, or 5%!

As another example let's consider the situation for HIV testing. If we assume a false positive rate of 0.01% and a sensitivity of 99.9%, what is the probability that you have AIDS if you test positive, and you do not belong in a known risk group (assume a prevalence of 0.01% HIV in the general, non-risk group population).

Solving:

• First, the prevelance is 1 in 10,000
• Assuming a population of 10,000 for ease of calculation, positives = 1 + 1 false = 2
• probability of the positive test indicating the disease state = 1/2, or 50%

Similar reasoning can be used to analyse the probability that an individual's disease state or symptoms are due to a particular exposure.

Another important consideration is the recognition that "hot spots" occur spontaneously as a result of probabilities and natural background rates. That random events don't occur evenly over a population, there is an expected "clumping" or noise. (Over a long sampling time the noise will even out, but over short spans, such as a human lifetime for rare phenomena, noise spikes will be common.)

As it turns out the Gaussian curve can be described by an equation containing just two parameters the population mean (mu, ), a measure of centrality, and the population standard deviation (sigma, ) a measure of precision or spread of the data:

Note the exponential form of this equation, accounting for the rapid decay of frequency on either side of the mean.

[Note the meaning of parameter in statistics, which is a number that describes a population, such as the mean. For any given population it will be a constant. (For samples of a population the mean will be variable and thus is not considered a parameter, rather it is now a statistic.) In math, physics, etc. parameter has a somewhat different, but related meaning: it is a variable that is held constant for a particular calculation etc.]

So what is a population vs. a sample?

• Population is the complete set of all possible data for a given experiment, species etc. For theoretical purposes it is an infinite set. As mentioned above the characteristics of a population such as the mean will have fixed, constant values (why?). Population parameters are given greek letter symbols.
• Sample is a subgroup of the population, and thus may be more or less representative of the population. The characteristics of a sample are thus variable, that is they may differ in one sample vs. another. Sample statistics are given roman letter symbols. Note that a series of analytical samples, such as replicates, represent a single statistical sample!

So let's look at the Gaussian distribution for Populations and for Samples.

Population Mean vs. Sample Mean: in this case both are represented by the same mathematical equation, but with differing symbols.

For a population:

Typical Gaussian curves for populations are shown above for different values of sigma and mu.

For a sample:

Gaussian curves for large samples will closely approximate the population curves, while small samples will be "noisy" (not smooth).

(Because the Gaussian distribution curve is so common and important it is often nice to check if data is Gaussian-use Probability paper: plots percentile vs std dev. Gaussian gives a straight line plot. We won't worry about what to do if its not Gaussian, suffice it to say there are methods to deal with other distributions if needed.)

Let's look at the distribution of randomly arrayed data (follow Gaussian curve).

### Standard Deviation

Population Standard Deviation. For measurement, this is a measure of the precision of the population of data:

Two curves demonstrating differing precision are shown in Fig. 3.4 of your text (Skoog, West & Holler, 7 th ed, pg 26) {Overhead}. Note that you can "normalize" this distribution as shown in the same figure by plotting in terms of the standard deviation. (Probit units give essentially the same plot.)

For the Sample Standard Deviation a correction must be made, replacing the population number with N - 1, the number of degrees of freedom. This is necessary because we are no longer dealing with a defined system, so we have to take our measurements relative to something (one of the members of the set), thus we divide by one fewer member. In other words, N - 1 represents the number of independent data in the sample set. The sample equation also differs because we no longer know the true mean, so we substitute the mean as the best estimate:

Note that sample statistics approach population parameters as N approaches infinity.

Sometimes it is useful to estimate the standard deviation of the distribution of samples taken from a single population. For this purpose we can use the Standard Error of the mean. This estimate is based on the standard error of our sample:

This is an estimate of the standard deviation of x-bar we would see if we took a series of samples of size N from a single population.

Standard deviations from different sets of data from the same population can be pooled to give a better estimate of s. For t sets of data:

Note that Nt is the total number of data sets in the pool, thus giving us pooled degrees of freedom.

### Variance:

This has the advantage that it may be added or subtracted.

### Relative Standard Deviation:

Expressed in %, the RSD has a special name-the Coefficient of Variation:

## Statistical Evaluation of Data

### (How good are my numbers?)

Need to look at the distribution of samples in a population. Different samples will generally have different means and standard deviations! However there will be less variance in the sample means then there will be in population values. We can consider each sample mean to be an estimate of the population mean. And remember, what we want is the population mean and standard deviation. Thus we want to know how closely our samples reflect the population, and thus how confident we can be of our data.

### Confidence limits

Note this applies only in absence of bias and only if we can assume that s is approximately sigma.

CL for = x ± z

Note for 95% confidence z = 2, 99.7%, z = 3, etc. And for N observations substitute standard error of the mean for std dev:

Note rapid approach to true limits (as one over sq root).

### Confidence Interval for Samples

Confidence interval when std dev not known. Use Student's t, defined as for z, but substitute s for sigma

t values are tabulated in stats books etc.

Confidence limits are then :

## HYPOTHESIS TESTING

Null hypothesis assumes two values are the same, can then calculate probability that difference observed is due to random error

### Experimental mean vs. true mean

critical value for rejecting null hypothesis is calculated as (remembering that x1 is = )

### Comparison of two experimental means (comparison of accuracy):

Use the t test for equality of means (test the null hypothesis that there is no difference between means. Use table of t values as above.

### Comparison of Precision of Measurements (F-test)

(Use table for 5% values of F for 95% confidence). Compare variances for two data sets.

Note the F test can be used to test two questions:

1. is one data set more precise than the other: in this case the variance of the supposedly more precise data set is placed in the denominator (bottom)
2. is there a difference in the precision of the two methods: always place larger in numerator (on top). Note since question can go both positive and negative, precision is uncertain to twice the value, that is F values are now for 10% instead of 5% confidence levels.

## Gross Errors and the Rejection of Outliers

### The Q Test

Statistical tests for rejecting outliers assume a normal distribution of the data. Of course for small data sets (<50) this can't be known with any certainty. Thus tests should be used with caution. It has been said that the best reason for rejecting data is probably the analysts intuition! Consider statistical tests a security blanket for the inexperienced, and a way to communicate to others why you felt justified in rejecting a data point. And remember, the only absolutely valid reason for rejecting a data point is that you know you messed up!

So what should you do with outliers? Can use the following criteria to evaluate:

• Reexamine your data and notes in the lab notebook to see if in fact there is reason to suspect the validity of the value (is it likely a gross error has been made).
• Can you estimate the expected precision of your data, and if so is the point really out of the expected range.
• Repeat the analysis if possible. Do(es) the new point(s) fit with the non-outlier points? If so safer to reject, if not, moment of outlier reduced.
• If you can't reasonably reject point, consider using the median instead of mean since outliers don't affect as much, and since it is a better measure of centrality for non-Gaussian distributions.

### Estimating Detection Limits

What we want to do here is find the minimum detectable difference between two populations of data: the population for the blank and the population for the concentration or amount of sample. So how do we detect whether the means of two data sets come from different populations? Use t-test. Thus (for subscript b referring to blank determinations, thus looking at s for blanks):

Notice as the number of determinations increases, the limit decreases as the inverse of the square root Get a rapid initial sensitivity enhancement (3 runs double sensitivity vs. 1 run), then levels off (an infinity of runs only increases sensitivity by about 20% over 3 runs). This contributes to the sensitivity enhancements in Fourier Transform techniques as are used in our NMR.

### Finding the "Best Fit" Straight Line for a Data Set: the Least Squares Method of Regression Analysis

Two assumptions:

• First: there is a linear relationship (y = ax + b)
• Second : assume all error is in measurement (y) with none in the x values.

For the least squares method then want to minimize the squares of the residuals (the deviation of the points from the straight line).

The method not only provides the best fit line, it also gives standard deviations for a (the slope) and b (the intercept on the y axis)

© R A Paselk

Last modified 6 September 2009