Chemometrics & Statistics

04. Similarity Analysis 2

Distribution Tests & Analysis of Variances

by Gerrit Renner

Initial Thoughts on Distributions

When working with data, it is important to know the distribution of the data.

Knowing the distribution, we can describe and apply models to the data.

E.g., knowing the data is normally distributed, we can apply t-tests to compare means.

Let's asume we have a dataset x with the following values:

 
                            x: [36.9, 38.0, 36.1, 34.1, 39.8,
    35.0, 35.9, 38.7, 38.4, 35.1,
    37.0, 36.7, 32.4, 38.6, 34.3,
    36.9, 35.5, 36.0, 37.5, 37.0,
    38.1, 36.7, 37.8, 38.3, 38.8]
                            

What is the distribution of the data? (challenging)

Is the data normally distributed? (most common)

Testing for Normality: Chi-square Test

To determine if our dataset x follows a specific distribution, we can use statistical tests.

The Chi-square test helps us test if observed data fit an expected distribution.

Chi-square Test Formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \( O_i \) are observed frequencies, and \( E_i \) are expected frequencies.

Let's apply the Chi-square test to see if x is normally distributed.

Step 1: Determine observed frequencies (e.g., counts within each range in x).

Step 2: Calculate expected frequencies based on a normal distribution.

Step 3: Compute \( \chi^2 \) and compare with the critical value for our confidence level.

Step 1: Determine Observed Frequencies

Observed frequencies represent the actual counts of data points within specific ranges (bins).

For our dataset, we will divide the values into intervals and count how many data points fall into each interval.

Here is the dataset x:

                                x: [36.9, 38.0, 36.1, 34.1, 39.8,
    35.0, 35.9, 38.7, 38.4, 35.1,
    37.0, 36.7, 32.4, 38.6, 34.3,
    36.9, 35.5, 36.0, 37.5, 37.0,
    38.1, 36.7, 37.8, 38.3, 38.8]
                            

Example: Divide into intervals such as 32-34, 34-36, 36-38, 38-40.

Observed Frequencies: [1, 6, 10, 8] (representing counts in each interval)

Step 2: Calculate Expected Frequencies

Expected frequencies represent the counts we would expect in each interval if the data followed a specified distribution (e.g., Normal).

To calculate expected frequencies, assume a normal distribution with the mean = 36.8 and standard deviation = 1.7.

Use the normal distribution to estimate the proportion of values that should fall in each interval (CDF).

For our dataset x with mean = 36.8 and standard deviation = 1.7:

                            Interval: 32-34
CDF_34 = CDF((34-36.8)/1.7) = 0.054
CDF_32 = CDF((32-36.8)/1.7) = 0.003
Exp_Proportion = CDF_34 - CDF_32 = 0.05
Exp_Frequency = Exp_Proportion * 25 = 1.3
                        
                    Intervals | Exp. Proportion | Exp. Frequency
32-34     |      ~5.1%      |     ~1.3
34-36     |      ~27.0%     |     ~6.8
36-38     |      ~43.2%     |     ~10.8
38-40     |      ~21.1%     |     ~5.3
                

These values represent our expected frequencies for each interval.

Are the observed frequencies similar to the expected ones?

Step 3: Calculate the Chi-square Statistic

The Chi-square statistic measures how well the observed frequencies match the expected frequencies.

Use the formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \( O_i \) are observed frequencies and \( E_i \) are expected frequencies for each interval.

A higher Chi-square value suggests a larger deviation from the expected distribution.

For each interval, calculate
\((O_i - E_i)^2 / E_i\):

                                
Intervals | Observed | Expected | (O_i - E_i)^2 / E_i
32-34     |    1     |  ~1.3    |   0.07
34-36     |    6     |  ~6.8    |   0.09
36-38     |   10     | ~10.8    |   0.06
38-40     |    8     |  ~5.3    |   1.38
                                
                            

Sum these values to get the Chi-square statistic: \( \chi^2 = 1.60 \), \( df = 3 \), \( p = 0.66 \).

Is this Chi-square value high or low? What does it suggest about our data?

Kolmogorov-Smirnov Test for Normality

The Kolmogorov-Smirnov (K-S) test is another way to test if a dataset follows a specified distribution.

Unlike the Chi-square test, the K-S test compares the cumulative distribution function (CDF) of the data with the CDF of the specified distribution.

The test statistic, \( D \), is the maximum difference between the empirical CDF of the data and the CDF of the expected distribution.

The K-S test statistic is calculated as: \[ D = \max | F_{observed}(x) - F_{expected}(x) | \] where \( F_{observed}(x) \) is the empirical CDF of the data, and \( F_{expected}(x) \) is the CDF of the specified distribution. \[ F_{observed}(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x) \] where \( I(x_i \leq x) \) is the indicator function, i.e. 1 if \( x_i \leq x \) and 0

Step 1: Calculate the Empirical Cumulative Distribution Function (ECDF)

The ECDF represents the proportion of data points that are less than or equal to each value in the dataset.

For each value \( x_i \) in the dataset, the ECDF is calculated as: \[ F_{observed}(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x) \] where \( n \) is the total number of data points, and \( I(x_i \leq x) \) is 1 if \( x_i \leq x \), and 0 otherwise.

For our dataset:

                                x: [32.4, 34.1, 34.3, 35.0, 35.1, 
    35.5, 35.9, 36.0, 36.1, 36.7, 
    36.7, 36.9, 36.9, 37.0, 37.0, 
    37.5, 37.8, 38.0, 38.1, 38.3, 
    38.4, 38.6, 38.7, 38.8, 39.8]
                            

Example ECDF Calculation:

                                Value (x) | Proportion <= x | ECDF
32.4      | 1/25            | 0.04
34.1      | 2/25            | 0.08
34.3      | 3/25            | 0.12
...       | ...             | ...
38.8      | 24/25           | 0.96
39.8      | 25/25           | 1.00
                                
                            

The ECDF provides the cumulative probability at each data point.

Step 2: Calculate the Kolmogorov-Smirnov Test Statistic

The ECDF is a step function that increases by \( \frac{1}{n} \) at each data point.

The ECDF is a non-parametric way to visualize the distribution of the data.

It can be used to compare the distribution of two datasets or to check if a dataset follows a specific distribution.

                            D = max(F_obs(x)-F_exp(x))
                            >> D = 0.121
                        
ECDF
Add Model
Harmonize Data

Step 3: Calculate the p-value for the Kolmogorov-Smirnov Test

The \( p \)-value helps us determine if the observed differences between the ECDF of our data and the expected distribution are statistically significant.

Using the Kolmogorov-Smirnov test statistic \( D \) and the sample size \( n \), we can calculate the \( p \)-value.

Formula to approximate \( p \)-value: \[ p \approx Q_{KS} \left( \sqrt{n} \cdot D \right) \] where \( Q_{KS} \) is a function related to the distribution of \( D \) under the null hypothesis.

\[ Q_{KS}(\lambda) = 2 \sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 \lambda^2} \] Here, \( \lambda = \sqrt{n} \cdot D \).

If \( p < 0.05 \), we reject the null hypothesis, suggesting our data does not follow the expected distribution.

Example Calculation:

                                D = 0.121, n = 25
sqrt(n) * D = 0.121 * 5 = 0.605
p_value = 0.73
                            

Since \( p = 0.73 \), we do not reject the null hypothesis, indicating the data likely follows the expected distribution.

Moving from t-tests to ANOVA

We used the t-test to compare means of two groups.

ANOVA (Analysis of Variance) extends this to three or more groups.

Instead of just comparing means, ANOVA uses variances to uncover group differences.

This gives insight into whether observed differences are meaningful or due to random variation.

When Are Groups Separately Identifiable?

Group separability in ANOVA depends on two types of variance:

1. Between-Group Variance: High values indicate distinct group means.

2. Within-Group Variance: Low values indicate tight clustering around group means.

Total, Within-Group, and Between-Group Sum of Squares

Definition:
Sum of Squares = \( \sum (x_i - \bar{x})^2 \)

In ANOVA, Total Sum of Squares (T) is divided into two components:

1. Between-Group (B): Differences between group means. High values indicate distinct groups.

2. Within-Group (W): Differences within each group around its mean. Low values indicate tightly clustered data within groups.

In ANOVA, we analyze how much of T can be attributed to B , which helps determine if groups are meaningfully distinct.

To do so, we calculate variances \(\bar{B}\) and \(\bar{W}\) from \(B\) and \(W\) and compare them.

To get \( \bar{B} \) and \( \bar{W} \), we divide the sum of squares by the degrees of freedom.

The comparison is done using the F-statistic, which is the ratio of \(\bar{B}\) to \(\bar{W}\).

Steps in Detail: Calculating Total Sum of Squares

Total Sum of Squares \(T\) represents the overall spread of all data points from the overall mean \(\bar{X}\).

Formula: \[ T = \sum_{i=1}^{N} (x_i - \bar{X})^2 \] where \( N \) is the total number of data points.

Steps to Calculate:

Calculate the overall mean of all data points across all groups.

For each data point, find the squared distance from the overall mean.

Sum these squared distances.

Steps in Detail: Calculating Within-Group Sum of Squares

Within-Group Sum of Squares \(W\) measures how data points vary within each group around their group mean \(\bar{X}_j\).

Formula: \[ W = \sum_{j=1}^{G} \sum_{i=1}^{n_j} (x_{ij} - \bar{X}_j)^2 \] where \( G \) is the number of groups and \( n_j \) is the size of group \( j \).

Steps to Calculate:

Calculate the mean for each group.

For each data point, find the squared distance from its group mean.

Sum the squared distances within each group.

Sum these values across all groups.

Steps in Detail: Calculating Between-Group Sum of Squares

Between-Group Sum of Squares \(B\) measures how each group's mean \(\bar{X}_j\) differs from the overall mean \(\bar{X}\).

Formula: \[ B = \sum_{j=1}^{G} n_j (\bar{X}_j - \bar{X})^2 \] where \( G \) is the number of groups and \( n_j \) is the size of each group.

Steps to Calculate:

Calculate the overall mean of all data points.

For each group, find the squared distance between the group mean and the overall mean.

Multiply each squared distance by the number of data points in that group.

Sum these values across all groups.

Steps in Detail: Degree of Freedom

Degrees of Freedom (df) are used to calculate variances and test statistics.

In ANOVA, we have three types of degrees of freedom:

1. Total (Tdf): \( N - 1 \) where \( N \) is the total number of data points.

2. Within-Group (Wdf): \( N - G \) where \( G \) is the number of groups.

3. Between-Group (Bdf): \( G - 1 \) where \( G \) is the number of groups.

Example: Calculating for Three Groups

Let's assume we investigate degradation rates using three different pH levels.

Here are the measured rates for each group:

A: [8, 9, 7, 10]
B: [15, 14, 16, 15, 14]
C: [23, 21, 24, 22, 23, 22]
                        

Degrees of Freedom:

Tdf = N - 1 = 14
Wdf = N - G = 12
Bdf = G - 1 = 2
                        

Calculated Sum of Squares:

T = 498.4
W = 13.3
B = 485.1
                        

Calculated Variances:

T_mean = 35.61
W_mean = 1.11
B_mean = 242.55
                        

F-Statistic:

F = 218.67
                    

ANOVA Table for Example Data


Source      | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS)   | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Between (B) | 485.1               | 2                       | 242.55             | 218.67      | < 0.001
Within (W)  | 13.3                | 12                      | 1.11               |             |
---------------------------------------------------------------------------------------------------------
Total (T)   | 498.4               | 14                      | 35.61              
                    

The ANOVA is commonly presented in a table to summarize the results.

The Within (W) is often called the Error term, representing variance not explained by group.

Interpretation:

The high F-statistic and low p-value suggest that the group means are significantly different. I.e., the pH level has a significant effect on the degradation rate.

Extending ANOVA to Multiple Factors

In a One-Way ANOVA, we analyze the effect of a single factor (variable) on the data.

However, when we want to study the effects of two or more factors, we need to use a Two-Way ANOVA (or higher).

For example, if we have two factors, like Sampling Location and Water Treatment Type, we can investigate:

The effect of each factor independently (main effects)

The interaction effect between factors (combined effect)

Example Data with Two Factors:

Factor 1 (Sampling Location): 
  >> River, Lake, Groundwater
Factor 2 (Water Treatment): 
  >> Untreated, Treated with Filter, Treated with UV

Example Measurements 
(e.g., pollutant concentration in mg/L):

Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
River       | Untreated  | "2.5, 2.8, 3.1"
River       | Filter     | "1.8, 2.0, 2.2"
River       | UV         | "1.4, 1.6, 1.8"
Lake        | Untreated  | "3.0, 3.2, 3.4"
Lake        | Filter     | "2.1, 2.3, 2.5"
Lake        | UV         | "1.9, 2.0, 2.1"
Groundwater | Untreated  | "1.5, 1.7, 1.8"
Groundwater | Filter     | "1.1, 1.2, 1.3"
Groundwater | UV         | "0.9, 1.0, 1.1"

How does multi way ANOVA works?

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:

Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
River       |            | "2.5, 2.8, 3.1"
River       |            | "1.8, 2.0, 2.2"
River       |            | "1.4, 1.6, 1.8"
Lake        |            | "3.0, 3.2, 3.4"
Lake        |            | "2.1, 2.3, 2.5"
Lake        |            | "1.9, 2.0, 2.1"
Groundwater |            | "1.5, 1.7, 1.8"
Groundwater |            | "1.1, 1.2, 1.3"
Groundwater |            | "0.9, 1.0, 1.1"

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:

Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
River       |            | "2.5, 2.8, 3.1"
River       |            | "1.8, 2.0, 2.2"
River       |            | "1.4, 1.6, 1.8"

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:

Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
River       |            | "2.5, 2.8, 3.1"
River       |            | "1.8, 2.0, 2.2"
River       |            | "1.4, 1.6, 1.8"


Location| Concentration (mg/L)
-----------------------------------------------
River   | 2.5, 2.8, 3.1, 1.8, 2.0, 2.2, 1.4, 1.6, 1.8

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:

Location| Concentration (mg/L)
-----------------------------------------------
River   | 2.5, 2.8, 3.1, 1.8, 2.0, 2.2, 1.4, 1.6, 1.8
Lake    | 3.0, 3.2, 3.4, 2.1, 2.3, 2.5, 1.9, 2.0, 2.1
Ground- | 1.5, 1.7, 1.8, 1.1, 1.2, 1.3, 0.9, 1.0, 1.1
water

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:


Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
.           | Untreated  | "2.5, 2.8, 3.1"
.           | Filter     | "1.8, 2.0, 2.2"
.           | UV         | "1.4, 1.6, 1.8"
.           | Untreated  | "3.0, 3.2, 3.4"
.           | Filter     | "2.1, 2.3, 2.5"
.           | UV         | "1.9, 2.0, 2.1"
.           | Untreated  | "1.5, 1.7, 1.8"
.           | Filter     | "1.1, 1.2, 1.3"
.           | UV         | "0.9, 1.0, 1.1"
                    

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:


Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
.           | Untreated  | "2.5, 2.8, 3.1"
. 
.            
.           | Untreated  | "3.0, 3.2, 3.4"
.            
.           
.           | Untreated  | "1.5, 1.7, 1.8"
.            
.            
                        

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:


Location    | Treatment  | Concentration (mg/L)
-----------------------------------------------
.           | Untreated  | "2.5, 2.8, 3.1"
. 
.            
.           | Untreated  | "3.0, 3.2, 3.4"
.            
.           
.           | Untreated  | "1.5, 1.7, 1.8"
.            
.            


Treatment| Concentration (mg/L)
-----------------------------------------------
Untr.| 2.5, 2.8, 3.1, 3.0, 3.2, 3.4, 1.5, 1.7, 1.8
                        

Adjustments for Multi-Way ANOVA: Sum of Squares

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Consider Factors one by one:


Treatment| Concentration (mg/L)
-----------------------------------------------
Untr.| 2.5, 2.8, 3.1, 3.0, 3.2, 3.4, 1.5, 1.7, 1.8
Filt.| 1.8, 2.0, 2.2, 2.1, 2.3, 2.5, 1.1, 1.2, 1.3
UV   | 1.4, 1.6, 1.8, 1.9, 2.0, 2.1, 0.9, 1.0, 1.1
                        

Adjustments for Multi-Way ANOVA: DF

In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:

Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).

Degrees of Freedom: Each factor and interaction adds df:

df_Factor A = Levels of Factor A - 1
df_Factor B = Levels of Factor B - 1

Example Degrees of Freedom Calculation:

Factors: Sampling Location, Treatment Type 
df_Factor A = 3 - 1 = 2
df_Factor B = 3 - 1 = 2

df_Total = df_Factor A + df_Factor B
df_Error = Total observations - df_Total

Error Term: Variance not explained by main or interaction effects, calculated as Within-Group Sum of Squares (W).

ANOVA Table for Multi-Way ANOVA

                        Source      | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS)   | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Location    | 6.943               | 2                       | 3.47148            | 103         | < 0.001
Treatment   | 4.9696              | 2                       | 2.4848             | 73.7        | < 0.001
---------------------------------------------------------------------------------------------------------
Error       | 0.6067              | 18                      | 0.0337              
Total       | 12.8329             | 26                      | 0.4936              
                        
                    

Which factors have a significant effect on the data?

Helpful Relationships in ANOVA

In ANOVA, we often encounter relationships between different components:

1. Relationship between Total, Within-Group, and Between-Group Sum of Squares:

\[ T = W + B + I\] where \( I \) is the interaction term.

2. Relationship between Degrees of Freedom:

\[ df_{T} = df_{W} + df_{B} + df_{I} \] where \( df_I \) is the interaction degrees of freedom.

We can use these relationships to check the consistency of our ANOVA results.

What is an Interaction in ANOVA?

An interaction in ANOVA occurs when the effect of one factor depends on the level of another factor.

Interactions reveal combined effects that are not captured by examining each factor alone.

In an interaction, the influence of one factor on the response variable changes based on the level of the other factor.

Example: Plant growth may depend on both Soil Type and Fertilizer Type.

Without an interaction: Fertilizer Type has the same effect on all Soil Types.

With an interaction: The effect of Fertilizer Type changes depending on Soil Type.
For example, Organic fertilizer may work better in Sandy soil, while Synthetic fertilizer is more effective in Clay soil.

Interactions highlight complex relationships and help identify when factors are dependent on one another.

Interpreting ANOVA Results: Main Effects and Interactions

1. No Significant Interaction, Significant Main Effects:

Each factor (e.g., Soil Type, Fertilizer Type) has a significant, independent effect.
No interaction means each factor’s effect is consistent across levels of the other.
Interpretation: Each factor impacts the outcome independently.

2. Significant Interaction, Significant Main Effects:

Both factors affect the outcome, with an interaction.
The effect of one factor depends on the level of the other.
Interpretation: Focus on interaction, as combined effects are more informative.

3. Significant Interaction, No Significant Main Effects:

Neither factor alone affects the outcome, but their interaction does.
Interpretation: Impact emerges only with both factors combined.

Calculating Interaction Sum of Squares

The Interaction Sum of Squares \( I \) shows how much the effect of one factor depends on the level of another.

Steps to Calculate:

Calculate the overall mean \( \bar{X} \) across all observations.

Find main effect means for each factor (\( \bar{X}_{A_i} \) and \( \bar{X}_{B_j} \)).

Calculate the expected mean for each cell, assuming no interaction:

\[ \text{Expected}_{A_iB_j} = \bar{X} + (\bar{X}_{A_i} - \bar{X}) + (\bar{X}_{B_j} - \bar{X}) \]

Find the observed mean for each cell \( \text{Observed Mean}_{A_iB_j} \).

Compute \( I \) by summing squared differences between observed and expected means, multiplied by the number of observations in each cell:

\[ I = \sum_{i,j} n_{A_iB_j} \left( \text{Obs. Mean}_{A_iB_j} - \text{Expected}_{A_iB_j} \right)^2 \]

Example Calculation

Suppose we have two factors, Soil Type (Clay, Sand) and Fertilizer Type (Organic, Synthetic), with observed plant growth:


Soil Type | Fertilizer Type | Obs. Mean
----------------------------------------
Clay      | Organic         | 20
Clay      | Synthetic       | 25
Sand      | Organic         | 15
Sand      | Synthetic       | 30
                        

Steps:

Calculate the overall mean \( \bar{X} \) across all observations.

                                
Total_mean = (20 + 25 + 15 + 30) / 4 = 22.5
                                
                            

Calculate the main effect means:

                                
Clay_mean = (20 + 25) / 2 = 22.5
Sand_mean = (15 + 30) / 2 = 22.5
Organic_mean = (20 + 15) / 2 = 17.5
Synthetic_mean = (25 + 30) / 2 = 27.5
                                
                            

Calculate expected means (assuming no interaction):

                                
Exp_Clay_Organic=22.5+(20-22.5)+(17.5-22.5)   = 17.5
Exp_Clay_Synthetic=22.5+(25-22.5)+(27.5-22.5) = 27.5
Exp_Sand_Organic=22.5+(15-22.5)+(17.5-22.5)   = 17.5
Exp_Sand_Synthetic=22.5+(30-22.5)+(27.5-22.5) = 27.5
                                
                            

Calculate \( I \):

                                
I = (20-17.5)^2 + (25-27.5)^2 
    + (15-17.5)^2 + (30-27.5)^2 = 27
                                
                            

Degrees of Freedom for Interaction

The degrees of freedom for the interaction term measure how many independent comparisons can be made for interactions between factor levels.

Formula for \( dfI \):

\[ dfI = (a - 1)(b - 1) \] where: \( a \) is the number of levels for Factor A, and \( b \) is the number of levels for Factor B.

Example For 2 Soil Types and 2 Fertilizer Types: \[ dfI = (2 - 1)(2 - 1) = 1 \]

ANOVA Table for Multi-Way ANOVA

                        Source      | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS)   | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Location    | 6.943               | 2                       | 3.47148            | 103         | < 0.001
Treatment   | 4.9696              | 2                       | 2.4848             | 73.7        | < 0.001
Loc * Treat | 0.3126              | 4                       | 0.07815            | 2.32        | 0.096
---------------------------------------------------------------------------------------------------------
Error       | 0.6067              | 18                      | 0.0337              
Total       | 12.8329             | 26                      | 0.4936              
                        
                    

What does this ANOVA table tell us about the factors?

ANOVA Table for Multi-Way ANOVA

                        
Source        | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-Statistic  | p-value
-----------------------------------------------------------------------------------------------------------
Factor A      | 10.123              | 2                       | 5.0615           | 45.6         | < 0.001
Factor B      | 8.456               | 3                       | 2.8187           | 25.4         | < 0.001
Factor C      | 5.789               | 2                       | 2.8945           | 22.3         | < 0.001
A * B         | 2.567               | 6                       | 0.4278           | 3.4          | 0.005
A * C         | 1.987               | 4                       | 0.4968           | 3.9          | 0.003
B * C         | 2.234               | 6                       | 0.3723           | 3.0          | 0.010
A * B * C     | 0.876               | 12                      | 0.0730           | 1.8          | 0.081
-----------------------------------------------------------------------------------------------------------
Error         | 6.432               | 80                      | 0.0804           
Total         | 38.464              | 115                     | 0.3345           
                        
                    

What insights does this ANOVA table provide about the three factors and their interactions?

Interactive One-Way ANOVA Table

                        
Source      | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS)   | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Between (B) | 0.0                 | 2                       | 0.0                | 0.0         | 1.000
Within (W)  | 13.3                | 12                      | 1.11               |             |
---------------------------------------------------------------------------------------------------------
Total (T)   | 13.3                | 14                      | 0.95              
                        
                    

This experiment starts with three groups having similar means, yielding no significant difference.

Use the sliders to introduce an offset to one group’s mean and observe how the ANOVA table updates to show significance.

If p < 0.05, how to locate the group(s) with significant differences?

Post-Hoc Analysis: Understanding Group Differences

ANOVA shows that at least one group differs, but it doesn’t specify which ones.

Post-hoc analysis identifies specific group differences after a significant ANOVA result.

Tukey’s HSD* is our primary method for comparing all pairs of groups while controlling for error rates.

*HSD: Honestly Significant Difference

Purpose of Tukey’s HSD:

Detect where specific group differences lie.

Control for increased Type I error* in multiple comparisons.

When to Use: After a significant ANOVA result.

*Type I error: Incorrectly rejecting a true null hypothesis (false positive).

Tukey’s Honestly Significant Difference (HSD) Test

Purpose: To identify specific group differences by comparing all pairs of means.

Tukey’s HSD calculates a critical value for each pairwise difference to determine significance.

Formula: \[ HSD = q(g,N-g) \cdot \sqrt{\frac{\bar{W}}{n}} \]

where \( q(g,N-g) \) is the critical value based on the number of groups \( g \) and total observations \( N \), \( \bar{W} \) is the mean square error, and \( n \) is the number of observations per group.

How it Works:

Calculates pairwise differences between group means.

Compares each difference to the HSD value; if it exceeds this, the pair is significantly different.

Benefit: Controls Type I error across multiple comparisons.

The critical value \( q(g,N-g) \) is based on the Studentized range distribution (similar to the t-distribution). Here is an example of such a table: Studentized Range Distribution Table

Example: Applying Tukey’s HSD in Environmental Analytics

Suppose we measured pollutant concentrations in three types of water sources:

Sources: River, Lake, Groundwater

Objective: Use Tukey's HSD to find which sources have significantly different pollutant levels after a significant ANOVA result.

                            
Mean Pollutant Concentrations (mg/L):
River:       3.2
Lake:        5.1
Groundwater: 4.0

Calculated HSD: 1.5

Comparisons:
River vs. Lake        
    ↳ Diff. = 1.9 > HSD → Significant

River vs. Groundwater 
    ↳ Diff. = 0.8 < HSD → Not Significant

Lake vs. Groundwater  
    ↳ Diff. = 1.1 < HSD → Not Significant
                            
                        

Interpretation: Pollutant levels are significantly higher in lakes compared to rivers.

Interpreting Tukey’s HSD Results

Each comparison tells us if a specific pair of groups has a significant difference.

Significant Difference: Difference exceeds HSD critical value.

No Significant Difference: Difference is less than HSD critical value.

In our example, only the comparison between River and Lake shows a significant difference.

Tukey's HSD allows us to focus on the most impactful differences and interpret how pollutant levels vary by source.

Why Use ANOVA + Post-Hoc Analysis Instead of Multiple t-Tests?

Problem with Multiple t-Tests: Each t-test has a 5% false-positive rate. When comparing multiple groups, this compounds, increasing the chance of finding a false difference.

Example with Three Groups: For groups A, B, and C, three pairwise comparisons (A vs. B, A vs. C, B vs. C) lead to an overall false-positive rate of: \[ 1 - (1 - 0.05)^{g=3} = 14.26\% \] This means a 14.26% chance of a false positive across tests.

Solution with ANOVA + Post-Hoc: ANOVA controls Type I error by first checking for overall differences among groups.

If significant, post-hoc tests like Tukey’s HSD ensure accurate pairwise comparisons while keeping the overall false-positive rate at 5%.

Takeaway: ANOVA + post-hoc tests provide reliable multi-group comparison without inflating errors.

Seminar Materials

The concentration of a critical environmental chemical was measured in five rivers across two distinct region types: urban and landside.

These regions represent contrasting land uses, likely influencing pollution levels in nearby water bodies.

Measurements were collected seasonally over one year to capture potential seasonal variations.

Goal: Determine which factors significantly impact chemical concentration levels.

                            
Region Type | River    | Season | Diclofenac (µg/L)
---------------------------------------------------
Urban       | River1   | Winter | 2.5
Urban       | River1   | Spring | 2.1
Urban       | River1   | Summer | 3.0
Urban       | River1   | Fall   | 2.8
Urban       | River2   | Winter | 2.7
Urban       | River2   | Spring | 2.0
Urban       | River2   | Summer | 3.1
Urban       | River2   | Fall   | 2.9
Landside    | River3   | Winter | 1.5
Landside    | River3   | Spring | 1.7
Landside    | River3   | Summer | 1.8
Landside    | River3   | Fall   | 1.6
Landside    | River4   | Winter | 1.6
Landside    | River4   | Spring | 1.8
Landside    | River4   | Summer | 1.9
Landside    | River4   | Fall   | 1.7
Landside    | River5   | Winter | 2.2
Landside    | River5   | Spring | 2.1
Landside    | River5   | Summer | 2.0
Landside    | River5   | Fall   | 1.9
                            
                            

Which properties can be considered as factors in this dataset?

What are the limitations of this dataset?

Seminar Materials

In multi-way ANOVA, we examine "cells," which represent combinations of factor levels. For the factors Region Type, River, and Season, a cell like Urban-River1-Winter would indicate the chemical concentration measured in River1 during winter in an urban area.

To consider all three factors in this cell, we need multiple measurements.

If this is not the case, we may need to aggregate data, i.e., neglect one or more factors.

                            
Region Type | River    | Season | Diclofenac (µg/L)
---------------------------------------------------
Urban       | River1   | Winter | 2.5
Urban       | River1   | Spring | 2.1
Urban       | River1   | Summer | 3.0
Urban       | River1   | Fall   | 2.8
Urban       | River2   | Winter | 2.7
Urban       | River2   | Spring | 2.0
Urban       | River2   | Summer | 3.1
Urban       | River2   | Fall   | 2.9
Landside    | River3   | Winter | 1.5
Landside    | River3   | Spring | 1.7
Landside    | River3   | Summer | 1.8
Landside    | River3   | Fall   | 1.6
Landside    | River4   | Winter | 1.6
Landside    | River4   | Spring | 1.8
Landside    | River4   | Summer | 1.9
Landside    | River4   | Fall   | 1.7
Landside    | River5   | Winter | 2.2
Landside    | River5   | Spring | 2.1
Landside    | River5   | Summer | 2.0
Landside    | River5   | Fall   | 1.9
                            
                            

Which factors can form a valid cell in this dataset?

All three, two, or only one of the factors?

--:--