Chemometrics and Statistics

Chemometrics & Statistics

04. Similarity Analysis 2

Distribution Tests & Analysis of Variances

by Gerrit Renner

Initial Thoughts on Distributions

When working with data, it is important to know the distribution of the data.

Knowing the distribution, we can describe and apply models to the data.

E.g., knowing the data is normally distributed, we can apply t-tests to compare means.

Let's asume we have a dataset x with the following values:

 
                            x: [36.9, 38.0, 36.1, 34.1, 39.8,
    35.0, 35.9, 38.7, 38.4, 35.1,
    37.0, 36.7, 32.4, 38.6, 34.3,
    36.9, 35.5, 36.0, 37.5, 37.0,
    38.1, 36.7, 37.8, 38.3, 38.8]

What is the distribution of the data? (challenging)

Is the data normally distributed? (most common)

Testing for Normality: Chi-square Test

To determine if our dataset x follows a specific distribution, we can use statistical tests.

The Chi-square test helps us test if observed data fit an expected distribution.

Chi-square Test Formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \( O_i \) are observed frequencies, and \( E_i \) are expected frequencies.

Let's apply the Chi-square test to see if x is normally distributed.

Step 1: Determine observed frequencies (e.g., counts within each range in x).

Step 2: Calculate expected frequencies based on a normal distribution.

Step 3: Compute \( \chi^2 \) and compare with the critical value for our confidence level.

Step 1: Determine Observed Frequencies

Observed frequencies represent the actual counts of data points within specific ranges (bins).

For our dataset, we will divide the values into intervals and count how many data points fall into each interval.

Here is the dataset x:

                                x: [36.9, 38.0, 36.1, 34.1, 39.8,
    35.0, 35.9, 38.7, 38.4, 35.1,
    37.0, 36.7, 32.4, 38.6, 34.3,
    36.9, 35.5, 36.0, 37.5, 37.0,
    38.1, 36.7, 37.8, 38.3, 38.8]

Example: Divide into intervals such as 32-34, 34-36, 36-38, 38-40.

Observed Frequencies: [1, 6, 10, 8] (representing counts in each interval)

Step 2: Calculate Expected Frequencies

Expected frequencies represent the counts we would expect in each interval if the data followed a specified distribution (e.g., Normal).

To calculate expected frequencies, assume a normal distribution with the mean = 36.8 and standard deviation = 1.7.

Use the normal distribution to estimate the proportion of values that should fall in each interval (CDF).

For our dataset x with mean = 36.8 and standard deviation = 1.7:

                            Interval: 32-34
CDF_34 = CDF((34-36.8)/1.7) = 0.054
CDF_32 = CDF((32-36.8)/1.7) = 0.003
Exp_Proportion = CDF_34 - CDF_32 = 0.05
Exp_Frequency = Exp_Proportion * 25 = 1.3

                    Intervals | Exp. Proportion | Exp. Frequency
32-34     |      ~5.1%      |     ~1.3
34-36     |      ~27.0%     |     ~6.8
36-38     |      ~43.2%     |     ~10.8
38-40     |      ~21.1%     |     ~5.3

These values represent our expected frequencies for each interval.

Are the observed frequencies similar to the expected ones?

Step 3: Calculate the Chi-square Statistic

The Chi-square statistic measures how well the observed frequencies match the expected frequencies.

Use the formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \( O_i \) are observed frequencies and \( E_i \) are expected frequencies for each interval.

A higher Chi-square value suggests a larger deviation from the expected distribution.

For each interval, calculate
\((O_i - E_i)^2 / E_i\):

                                
Intervals | Observed | Expected | (O_i - E_i)^2 / E_i
32-34     |    1     |  ~1.3    |   0.07
34-36     |    6     |  ~6.8    |   0.09
36-38     |   10     | ~10.8    |   0.06
38-40     |    8     |  ~5.3    |   1.38

Sum these values to get the Chi-square statistic: \( \chi^2 = 1.60 \), \( df = 3 \), \( p = 0.66 \).

Is this Chi-square value high or low? What does it suggest about our data?

Kolmogorov-Smirnov Test for Normality

The Kolmogorov-Smirnov (K-S) test is another way to test if a dataset follows a specified distribution.

Unlike the Chi-square test, the K-S test compares the cumulative distribution function (CDF) of the data with the CDF of the specified distribution.

The test statistic, \( D \), is the maximum difference between the empirical CDF of the data and the CDF of the expected distribution.

The K-S test statistic is calculated as: \[ D = \max | F_{observed}(x) - F_{expected}(x) | \] where \( F_{observed}(x) \) is the empirical CDF of the data, and \( F_{expected}(x) \) is the CDF of the specified distribution. \[ F_{observed}(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x) \] where \( I(x_i \leq x) \) is the indicator function, i.e. 1 if \( x_i \leq x \) and 0

Step 1: Calculate the Empirical Cumulative Distribution Function (ECDF)

The ECDF represents the proportion of data points that are less than or equal to each value in the dataset.

For each value \( x_i \) in the dataset, the ECDF is calculated as: \[ F_{observed}(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x) \] where \( n \) is the total number of data points, and \( I(x_i \leq x) \) is 1 if \( x_i \leq x \), and 0 otherwise.

For our dataset:

                                x: [32.4, 34.1, 34.3, 35.0, 35.1, 
    35.5, 35.9, 36.0, 36.1, 36.7, 
    36.7, 36.9, 36.9, 37.0, 37.0, 
    37.5, 37.8, 38.0, 38.1, 38.3, 
    38.4, 38.6, 38.7, 38.8, 39.8]

Example ECDF Calculation:

                                Value (x) | Proportion <= x | ECDF
32.4      | 1/25            | 0.04
34.1      | 2/25            | 0.08
34.3      | 3/25            | 0.12
...       | ...             | ...
38.8      | 24/25           | 0.96
39.8      | 25/25           | 1.00

The ECDF provides the cumulative probability at each data point.

Step 2: Calculate the Kolmogorov-Smirnov Test Statistic

The ECDF is a step function that increases by \( \frac{1}{n} \) at each data point.

The ECDF is a non-parametric way to visualize the distribution of the data.

It can be used to compare the distribution of two datasets or to check if a dataset follows a specific distribution.

                            D = max(F_obs(x)-F_exp(x))
                            >> D = 0.121

ECDF

Add Model

Harmonize Data

Step 3: Calculate the p-value for the Kolmogorov-Smirnov Test

The \( p \)-value helps us determine if the observed differences between the ECDF of our data and the expected distribution are statistically significant.

Using the Kolmogorov-Smirnov test statistic \( D \) and the sample size \( n \), we can calculate the \( p \)-value.

Formula to approximate \( p \)-value: \[ p \approx Q_{KS} \left( \sqrt{n} \cdot D \right) \] where \( Q_{KS} \) is a function related to the distribution of \( D \) under the null hypothesis.

\[ Q_{KS}(\lambda) = 2 \sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 \lambda^2} \] Here, \( \lambda = \sqrt{n} \cdot D \).

If \( p < 0.05 \), we reject the null hypothesis, suggesting our data does not follow the expected distribution.

Example Calculation:

                                D = 0.121, n = 25
sqrt(n) * D = 0.121 * 5 = 0.605
p_value = 0.73

Since \( p = 0.73 \), we do not reject the null hypothesis, indicating the data likely follows the expected distribution.

Moving from t-tests to ANOVA

We used the t-test to compare means of two groups.

ANOVA (Analysis of Variance) extends this to three or more groups.

Instead of just comparing means, ANOVA uses variances to uncover group differences.

This gives insight into whether observed differences are meaningful or due to random variation.

Adjust Variance:

When Are Groups Separately Identifiable?

Group separability in ANOVA depends on two types of variance:

1. Between-Group Variance: High values indicate distinct group means.

2. Within-Group Variance: Low values indicate tight clustering around group means.

Adjust Variance:

Total, Within-Group, and Between-Group Sum of Squares

Definition:
Sum of Squares = \( \sum (x_i - \bar{x})^2 \)

In ANOVA, Total Sum of Squares (T) is divided into two components:

1. Between-Group (B): Differences between group means. High values indicate distinct groups.

2. Within-Group (W): Differences within each group around its mean. Low values indicate tightly clustered data within groups.

In ANOVA, we analyze how much of T can be attributed to B , which helps determine if groups are meaningfully distinct.

To do so, we calculate variances \(\bar{B}\) and \(\bar{W}\) from \(B\) and \(W\) and compare them.

To get \( \bar{B} \) and \( \bar{W} \), we divide the sum of squares by the degrees of freedom.

The comparison is done using the F-statistic, which is the ratio of \(\bar{B}\) to \(\bar{W}\).

Steps in Detail: Calculating Total Sum of Squares

Total Sum of Squares \(T\) represents the overall spread of all data points from the overall mean \(\bar{X}\).

Formula: \[ T = \sum_{i=1}^{N} (x_i - \bar{X})^2 \] where \( N \) is the total number of data points.

Steps to Calculate:

Calculate the overall mean of all data points across all groups.

For each data point, find the squared distance from the overall mean.

Sum these squared distances.

Steps in Detail: Calculating Within-Group Sum of Squares

Within-Group Sum of Squares \(W\) measures how data points vary within each group around their group mean \(\bar{X}_j\).

Formula: \[ W = \sum_{j=1}^{G} \sum_{i=1}^{n_j} (x_{ij} - \bar{X}_j)^2 \] where \( G \) is the number of groups and \( n_j \) is the size of group \( j \).

Steps to Calculate:

Calculate the mean for each group.

For each data point, find the squared distance from its group mean.

Sum the squared distances within each group.

Sum these values across all groups.

Steps in Detail: Calculating Between-Group Sum of Squares

Between-Group Sum of Squares \(B\) measures how each group's mean \(\bar{X}_j\) differs from the overall mean \(\bar{X}\).

Formula: \[ B = \sum_{j=1}^{G} n_j (\bar{X}_j - \bar{X})^2 \] where \( G \) is the number of groups and \( n_j \) is the size of each group.

Steps to Calculate:

Calculate the overall mean of all data points.

For each group, find the squared distance between the group mean and the overall mean.

Multiply each squared distance by the number of data points in that group.

Sum these values across all groups.

Steps in Detail: Degree of Freedom

Degrees of Freedom (df) are used to calculate variances and test statistics.

In ANOVA, we have three types of degrees of freedom:

1. Total (Tdf): \( N - 1 \) where \( N \) is the total number of data points.

2. Within-Group (Wdf): \( N - G \) where \( G \) is the number of groups.

3. Between-Group (Bdf): \( G - 1 \) where \( G \) is the number of groups.

Example: Calculating for Three Groups

Let's assume we investigate degradation rates using three different pH levels.

Here are the measured rates for each group:

A: [8, 9, 7, 10]
B: [15, 14, 16, 15, 14]
C: [23, 21, 24, 22, 23, 22]

Degrees of Freedom:

Tdf = N - 1 = 14
Wdf = N - G = 12
Bdf = G - 1 = 2

Calculated Sum of Squares:

T = 498.4
W = 13.3
B = 485.1

Calculated Variances:

T_mean = 35.61
W_mean = 1.11
B_mean = 242.55

F-Statistic:

F = 218.67

ANOVA Table for Example Data


Source      | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS)   | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Between (B) | 485.1               | 2                       | 242.55             | 218.67      | < 0.001
Within (W)  | 13.3                | 12                      | 1.11               |             |
---------------------------------------------------------------------------------------------------------
Total (T)   | 498.4               | 14                      | 35.61

The ANOVA is commonly presented in a table to summarize the results.

The Within (W) is often called the Error term, representing variance not explained by group.

Interpretation:

The high F-statistic and low p-value suggest that the group means are significantly different. I.e., the pH level has a significant effect on the degradation rate.

Extending ANOVA to Multiple Factors

In a One-Way ANOVA, we analyze the effect of a single factor (variable) on the data.

However, when we want to study the effects of two or more factors, we need to use a Two-Way ANOVA (or higher).

For example, if we have two factors, like Sampling Location and Water Treatment Type, we can investigate:

The effect of each factor independently (main effects)

The interaction effect between factors (combined effect)

Example Data with Two Factors:

Factor 1 (Sampling Location): 
  >> River, Lake, Groundwater
Factor 2 (Water Treatment): 
  >> Untreated, Treated with Filter, Treated with UV

Example Measurements 
(e.g., pollutant concentration in mg/L):

Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
River       | Untreated  | "2.5, 2.8, 3.1"
River       | Filter     | "1.8, 2.0, 2.2"
River       | UV         | "1.4, 1.6, 1.8"
Lake        | Untreated  | "3.0, 3.2, 3.4"
Lake        | Filter     | "2.1, 2.3, 2.5"
Lake        | UV         | "1.9, 2.0, 2.1"
Groundwater | Untreated  | "1.5, 1.7, 1.8"
Groundwater | Filter     | "1.1, 1.2, 1.3"
Groundwater | UV         | "0.9, 1.0, 1.1"

How does multi way ANOVA works?