04. Similarity Analysis 2
When working with data, it is important to know the distribution of the data.
Knowing the distribution, we can describe and apply models to the data.
E.g., knowing the data is normally distributed, we can apply t-tests to compare
means.
Let's asume we have a dataset x with the following values:
x: [36.9, 38.0, 36.1, 34.1, 39.8,
35.0, 35.9, 38.7, 38.4, 35.1,
37.0, 36.7, 32.4, 38.6, 34.3,
36.9, 35.5, 36.0, 37.5, 37.0,
38.1, 36.7, 37.8, 38.3, 38.8]
What is the distribution of the data? (challenging)
Is the data normally distributed? (most common)
To determine if our dataset x follows a specific distribution, we can use
statistical tests.
The Chi-square test helps us test if observed data fit an expected
distribution.
Chi-square Test Formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \( O_i \) are observed frequencies, and \( E_i \) are expected frequencies.
Let's apply the Chi-square test to see if x is normally distributed.
Step 1: Determine observed frequencies (e.g., counts within each range in x).
Step 2: Calculate expected frequencies based on a normal distribution.
Step 3: Compute \( \chi^2 \) and compare with the critical value for our confidence level.
Observed frequencies represent the actual counts of data points within specific ranges (bins).
For our dataset, we will divide the values into intervals and count how many data points fall into each interval.
Here is the dataset x:
x: [36.9, 38.0, 36.1, 34.1, 39.8,
35.0, 35.9, 38.7, 38.4, 35.1,
37.0, 36.7, 32.4, 38.6, 34.3,
36.9, 35.5, 36.0, 37.5, 37.0,
38.1, 36.7, 37.8, 38.3, 38.8]
Example: Divide into intervals such as 32-34, 34-36, 36-38, 38-40.
Observed Frequencies:
[1, 6, 10, 8] (representing counts in each interval)
Expected frequencies represent the counts we would expect in each interval if the data followed a specified distribution (e.g., Normal).
To calculate expected frequencies, assume a normal distribution with the mean = 36.8 and standard deviation = 1.7.
Use the normal distribution to estimate the proportion of values that should fall in each interval (CDF).
For our dataset x with mean = 36.8 and standard deviation = 1.7:
Interval: 32-34
CDF_34 = CDF((34-36.8)/1.7) = 0.054
CDF_32 = CDF((32-36.8)/1.7) = 0.003
Exp_Proportion = CDF_34 - CDF_32 = 0.05
Exp_Frequency = Exp_Proportion * 25 = 1.3
Intervals | Exp. Proportion | Exp. Frequency
32-34 | ~5.1% | ~1.3
34-36 | ~27.0% | ~6.8
36-38 | ~43.2% | ~10.8
38-40 | ~21.1% | ~5.3
These values represent our expected frequencies for each interval.
Are the observed frequencies similar to the expected ones?
The Chi-square statistic measures how well the observed frequencies match the expected frequencies.
Use the formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \( O_i \) are observed frequencies and \( E_i \) are expected frequencies for each interval.
A higher Chi-square value suggests a larger deviation from the expected distribution.
For each interval, calculate
\((O_i - E_i)^2 / E_i\):
Intervals | Observed | Expected | (O_i - E_i)^2 / E_i
32-34 | 1 | ~1.3 | 0.07
34-36 | 6 | ~6.8 | 0.09
36-38 | 10 | ~10.8 | 0.06
38-40 | 8 | ~5.3 | 1.38
Sum these values to get the Chi-square statistic: \( \chi^2 = 1.60 \), \( df = 3 \), \( p = 0.66 \).
Is this Chi-square value high or low? What does it suggest about our data?
The Kolmogorov-Smirnov (K-S) test is another way to test if a dataset follows a specified distribution.
Unlike the Chi-square test, the K-S test compares the cumulative distribution function (CDF) of the data with the CDF of the specified distribution.
The test statistic, \( D \), is the maximum difference between the empirical CDF of the data and the CDF of the expected distribution.
The K-S test statistic is calculated as: \[ D = \max | F_{observed}(x) - F_{expected}(x) | \] where \( F_{observed}(x) \) is the empirical CDF of the data, and \( F_{expected}(x) \) is the CDF of the specified distribution. \[ F_{observed}(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x) \] where \( I(x_i \leq x) \) is the indicator function, i.e. 1 if \( x_i \leq x \) and 0
The ECDF represents the proportion of data points that are less than or equal to each value in the dataset.
For each value \( x_i \) in the dataset, the ECDF is calculated as: \[ F_{observed}(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x) \] where \( n \) is the total number of data points, and \( I(x_i \leq x) \) is 1 if \( x_i \leq x \), and 0 otherwise.
For our dataset:
x: [32.4, 34.1, 34.3, 35.0, 35.1,
35.5, 35.9, 36.0, 36.1, 36.7,
36.7, 36.9, 36.9, 37.0, 37.0,
37.5, 37.8, 38.0, 38.1, 38.3,
38.4, 38.6, 38.7, 38.8, 39.8]
Example ECDF Calculation:
Value (x) | Proportion <= x | ECDF
32.4 | 1/25 | 0.04
34.1 | 2/25 | 0.08
34.3 | 3/25 | 0.12
... | ... | ...
38.8 | 24/25 | 0.96
39.8 | 25/25 | 1.00
The ECDF provides the cumulative probability at each data point.
The ECDF is a step function that increases by \( \frac{1}{n} \) at each data point.
The ECDF is a non-parametric way to visualize the distribution of the data.
It can be used to compare the distribution of two datasets or to check if a dataset follows a specific distribution.
D = max(F_obs(x)-F_exp(x))
>> D = 0.121
The \( p \)-value helps us determine if the observed differences between the ECDF of our data and the expected distribution are statistically significant.
Using the Kolmogorov-Smirnov test statistic \( D \) and the sample size \( n \), we can calculate the \( p \)-value.
Formula to approximate \( p \)-value: \[ p \approx Q_{KS} \left( \sqrt{n} \cdot D \right) \] where \( Q_{KS} \) is a function related to the distribution of \( D \) under the null hypothesis.
\[ Q_{KS}(\lambda) = 2 \sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 \lambda^2} \] Here, \( \lambda = \sqrt{n} \cdot D \).
If \( p < 0.05 \), we reject the null hypothesis, suggesting our data does not follow the expected distribution.
Example Calculation:
D = 0.121, n = 25
sqrt(n) * D = 0.121 * 5 = 0.605
p_value = 0.73
Since \( p = 0.73 \), we do not reject the null hypothesis, indicating the data likely follows the expected distribution.
We used the t-test to compare means of two groups.
ANOVA (Analysis of Variance) extends this to three or more groups.
Instead of just comparing means, ANOVA uses variances to uncover group differences.
This gives insight into whether observed differences are meaningful or due to random variation.
Group separability in ANOVA depends on two types of variance:
1. Between-Group Variance: High values indicate distinct group means.
2. Within-Group Variance: Low values indicate tight clustering around group means.
Definition:
Sum of Squares = \( \sum (x_i - \bar{x})^2 \)
In ANOVA, Total Sum of Squares (T) is divided into two components:
1. Between-Group (B): Differences between group means. High values indicate distinct groups.
2. Within-Group (W): Differences within each group around its mean. Low values indicate tightly clustered data within groups.
In ANOVA, we analyze how much of T can be attributed to B , which helps determine if groups are meaningfully distinct.
To do so, we calculate variances \(\bar{B}\) and \(\bar{W}\) from \(B\) and \(W\) and compare them.
To get \( \bar{B} \) and \( \bar{W} \), we divide the sum of squares by the degrees of freedom.
The comparison is done using the F-statistic, which is the ratio of \(\bar{B}\) to \(\bar{W}\).
Total Sum of Squares \(T\) represents the overall spread of all data points from the overall mean \(\bar{X}\).
Formula: \[ T = \sum_{i=1}^{N} (x_i - \bar{X})^2 \] where \( N \) is the total number of data points.
Steps to Calculate:
Calculate the overall mean of all data points across all groups.
For each data point, find the squared distance from the overall mean.
Sum these squared distances.
Within-Group Sum of Squares \(W\) measures how data points vary within each group around their group mean \(\bar{X}_j\).
Formula: \[ W = \sum_{j=1}^{G} \sum_{i=1}^{n_j} (x_{ij} - \bar{X}_j)^2 \] where \( G \) is the number of groups and \( n_j \) is the size of group \( j \).
Steps to Calculate:
Calculate the mean for each group.
For each data point, find the squared distance from its group mean.
Sum the squared distances within each group.
Sum these values across all groups.
Between-Group Sum of Squares \(B\) measures how each group's mean \(\bar{X}_j\) differs from the overall mean \(\bar{X}\).
Formula: \[ B = \sum_{j=1}^{G} n_j (\bar{X}_j - \bar{X})^2 \] where \( G \) is the number of groups and \( n_j \) is the size of each group.
Steps to Calculate:
Calculate the overall mean of all data points.
For each group, find the squared distance between the group mean and the overall mean.
Multiply each squared distance by the number of data points in that group.
Sum these values across all groups.
Degrees of Freedom (df) are used to calculate variances and test statistics.
In ANOVA, we have three types of degrees of freedom:
1. Total (Tdf): \( N - 1 \) where \( N \) is the total number of data points.
2. Within-Group (Wdf): \( N - G \) where \( G \) is the number of groups.
3. Between-Group (Bdf): \( G - 1 \) where \( G \) is the number of groups.
Let's assume we investigate degradation rates using three different pH levels.
Here are the measured rates for each group:
A: [8, 9, 7, 10]
B: [15, 14, 16, 15, 14]
C: [23, 21, 24, 22, 23, 22]
Degrees of Freedom:
Tdf = N - 1 = 14
Wdf = N - G = 12
Bdf = G - 1 = 2
Calculated Sum of Squares:
T = 498.4
W = 13.3
B = 485.1
Calculated Variances:
T_mean = 35.61
W_mean = 1.11
B_mean = 242.55
F-Statistic:
F = 218.67
Source | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Between (B) | 485.1 | 2 | 242.55 | 218.67 | < 0.001
Within (W) | 13.3 | 12 | 1.11 | |
---------------------------------------------------------------------------------------------------------
Total (T) | 498.4 | 14 | 35.61
The ANOVA is commonly presented in a table to summarize the results.
The Within (W) is often called the Error term, representing variance not explained by group.
Interpretation:
The high F-statistic and low p-value suggest that the group means are significantly different. I.e., the pH level has a significant effect on the degradation rate.
In a One-Way ANOVA, we analyze the effect of a single factor (variable) on
the data.
However, when we want to study the effects of two or more factors, we need to use a
Two-Way ANOVA (or higher).
For example, if we have two factors, like Sampling Location and Water Treatment Type, we can investigate:
The effect of each factor independently (main effects)
The interaction effect between factors (combined effect)
Example Data with Two Factors:
Factor 1 (Sampling Location):
>> River, Lake, Groundwater
Factor 2 (Water Treatment):
>> Untreated, Treated with Filter, Treated with UV
Example Measurements
(e.g., pollutant concentration in mg/L):
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
River | Untreated | "2.5, 2.8, 3.1"
River | Filter | "1.8, 2.0, 2.2"
River | UV | "1.4, 1.6, 1.8"
Lake | Untreated | "3.0, 3.2, 3.4"
Lake | Filter | "2.1, 2.3, 2.5"
Lake | UV | "1.9, 2.0, 2.1"
Groundwater | Untreated | "1.5, 1.7, 1.8"
Groundwater | Filter | "1.1, 1.2, 1.3"
Groundwater | UV | "0.9, 1.0, 1.1"
How does multi way ANOVA works?
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
River | | "2.5, 2.8, 3.1"
River | | "1.8, 2.0, 2.2"
River | | "1.4, 1.6, 1.8"
Lake | | "3.0, 3.2, 3.4"
Lake | | "2.1, 2.3, 2.5"
Lake | | "1.9, 2.0, 2.1"
Groundwater | | "1.5, 1.7, 1.8"
Groundwater | | "1.1, 1.2, 1.3"
Groundwater | | "0.9, 1.0, 1.1"
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
River | | "2.5, 2.8, 3.1"
River | | "1.8, 2.0, 2.2"
River | | "1.4, 1.6, 1.8"
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
River | | "2.5, 2.8, 3.1"
River | | "1.8, 2.0, 2.2"
River | | "1.4, 1.6, 1.8"
Location| Concentration (mg/L)
-----------------------------------------------
River | 2.5, 2.8, 3.1, 1.8, 2.0, 2.2, 1.4, 1.6, 1.8
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location| Concentration (mg/L)
-----------------------------------------------
River | 2.5, 2.8, 3.1, 1.8, 2.0, 2.2, 1.4, 1.6, 1.8
Lake | 3.0, 3.2, 3.4, 2.1, 2.3, 2.5, 1.9, 2.0, 2.1
Ground- | 1.5, 1.7, 1.8, 1.1, 1.2, 1.3, 0.9, 1.0, 1.1
water
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
. | Untreated | "2.5, 2.8, 3.1"
. | Filter | "1.8, 2.0, 2.2"
. | UV | "1.4, 1.6, 1.8"
. | Untreated | "3.0, 3.2, 3.4"
. | Filter | "2.1, 2.3, 2.5"
. | UV | "1.9, 2.0, 2.1"
. | Untreated | "1.5, 1.7, 1.8"
. | Filter | "1.1, 1.2, 1.3"
. | UV | "0.9, 1.0, 1.1"
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
. | Untreated | "2.5, 2.8, 3.1"
.
.
. | Untreated | "3.0, 3.2, 3.4"
.
.
. | Untreated | "1.5, 1.7, 1.8"
.
.
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Location | Treatment | Concentration (mg/L)
-----------------------------------------------
. | Untreated | "2.5, 2.8, 3.1"
.
.
. | Untreated | "3.0, 3.2, 3.4"
.
.
. | Untreated | "1.5, 1.7, 1.8"
.
.
Treatment| Concentration (mg/L)
-----------------------------------------------
Untr.| 2.5, 2.8, 3.1, 3.0, 3.2, 3.4, 1.5, 1.7, 1.8
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Consider Factors one by one:
Treatment| Concentration (mg/L)
-----------------------------------------------
Untr.| 2.5, 2.8, 3.1, 3.0, 3.2, 3.4, 1.5, 1.7, 1.8
Filt.| 1.8, 2.0, 2.2, 2.1, 2.3, 2.5, 1.1, 1.2, 1.3
UV | 1.4, 1.6, 1.8, 1.9, 2.0, 2.1, 0.9, 1.0, 1.1
In Multi-Way ANOVA, we analyze multiple factors, which requires accounting for:
Main Effects: The impact of each factor alone (e.g., effect of Sampling Location or Treatment Type independently).
Degrees of Freedom: Each factor and interaction adds df:
df_Factor A = Levels of Factor A - 1
df_Factor B = Levels of Factor B - 1
Example Degrees of Freedom Calculation:
Factors: Sampling Location, Treatment Type
df_Factor A = 3 - 1 = 2
df_Factor B = 3 - 1 = 2
df_Total = df_Factor A + df_Factor B
df_Error = Total observations - df_Total
Error Term: Variance not explained by main or interaction effects, calculated as Within-Group Sum of Squares (W).
Source | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Location | 6.943 | 2 | 3.47148 | 103 | < 0.001
Treatment | 4.9696 | 2 | 2.4848 | 73.7 | < 0.001
---------------------------------------------------------------------------------------------------------
Error | 0.6067 | 18 | 0.0337
Total | 12.8329 | 26 | 0.4936
Which factors have a significant effect on the data?
In ANOVA, we often encounter relationships between different components:
1. Relationship between Total, Within-Group, and Between-Group Sum of Squares:
\[ T = W + B + I\] where \( I \) is the interaction term.
2. Relationship between Degrees of Freedom:
\[ df_{T} = df_{W} + df_{B} + df_{I} \] where \( df_I \) is the interaction degrees of freedom.
We can use these relationships to check the consistency of our ANOVA results.
An interaction in ANOVA occurs when the effect of one factor depends on the level of another factor.
Interactions reveal combined effects that are not captured by examining each factor alone.
In an interaction, the influence of one factor on the response variable changes based on the level of the other factor.
Example: Plant growth may depend on both Soil Type and Fertilizer Type.
Without an interaction: Fertilizer Type has the same effect on all Soil Types.
With an interaction: The effect of Fertilizer Type changes depending on Soil Type.
For example, Organic fertilizer may work better in Sandy soil, while Synthetic fertilizer is
more effective in Clay soil.
Interactions highlight complex relationships and help identify when factors are dependent on one another.
1. No Significant Interaction, Significant Main Effects:
Each factor (e.g., Soil Type, Fertilizer Type) has a significant, independent effect.
No interaction means each factor’s effect is consistent across levels of the other.
Interpretation: Each factor impacts the outcome independently.
2. Significant Interaction, Significant Main Effects:
Both factors affect the outcome, with an interaction.
The effect of one factor depends on the level of the other.
Interpretation: Focus on interaction, as combined effects are more
informative.
3. Significant Interaction, No Significant Main Effects:
Neither factor alone affects the outcome, but their interaction does.
Interpretation: Impact emerges only with both factors combined.
The Interaction Sum of Squares \( I \) shows how much the effect of one factor depends on the level of another.
Steps to Calculate:
Calculate the overall mean \( \bar{X} \) across all observations.
Find main effect means for each factor (\( \bar{X}_{A_i} \) and \( \bar{X}_{B_j} \)).
Calculate the expected mean for each cell, assuming no interaction:
\[ \text{Expected}_{A_iB_j} = \bar{X} + (\bar{X}_{A_i} - \bar{X}) + (\bar{X}_{B_j} - \bar{X}) \]
Find the observed mean for each cell \( \text{Observed Mean}_{A_iB_j} \).
Compute \( I \) by summing squared differences between observed and expected means, multiplied by the number of observations in each cell:
\[ I = \sum_{i,j} n_{A_iB_j} \left( \text{Obs. Mean}_{A_iB_j} - \text{Expected}_{A_iB_j} \right)^2 \]
Suppose we have two factors, Soil Type (Clay, Sand) and Fertilizer Type (Organic, Synthetic), with observed plant growth:
Soil Type | Fertilizer Type | Obs. Mean
----------------------------------------
Clay | Organic | 20
Clay | Synthetic | 25
Sand | Organic | 15
Sand | Synthetic | 30
Steps:
Calculate the overall mean \( \bar{X} \) across all observations.
Total_mean = (20 + 25 + 15 + 30) / 4 = 22.5
Calculate the main effect means:
Clay_mean = (20 + 25) / 2 = 22.5
Sand_mean = (15 + 30) / 2 = 22.5
Organic_mean = (20 + 15) / 2 = 17.5
Synthetic_mean = (25 + 30) / 2 = 27.5
Calculate expected means (assuming no interaction):
Exp_Clay_Organic=22.5+(20-22.5)+(17.5-22.5) = 17.5
Exp_Clay_Synthetic=22.5+(25-22.5)+(27.5-22.5) = 27.5
Exp_Sand_Organic=22.5+(15-22.5)+(17.5-22.5) = 17.5
Exp_Sand_Synthetic=22.5+(30-22.5)+(27.5-22.5) = 27.5
Calculate \( I \):
I = (20-17.5)^2 + (25-27.5)^2
+ (15-17.5)^2 + (30-27.5)^2 = 27
The degrees of freedom for the interaction term measure how many independent comparisons can be made for interactions between factor levels.
Formula for \( dfI \):
\[ dfI = (a - 1)(b - 1) \] where: \( a \) is the number of levels for Factor A, and \( b \) is the number of levels for Factor B.
Example For 2 Soil Types and 2 Fertilizer Types: \[ dfI = (2 - 1)(2 - 1) = 1 \]
Source | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Location | 6.943 | 2 | 3.47148 | 103 | < 0.001
Treatment | 4.9696 | 2 | 2.4848 | 73.7 | < 0.001
Loc * Treat | 0.3126 | 4 | 0.07815 | 2.32 | 0.096
---------------------------------------------------------------------------------------------------------
Error | 0.6067 | 18 | 0.0337
Total | 12.8329 | 26 | 0.4936
What does this ANOVA table tell us about the factors?
Source | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-Statistic | p-value
-----------------------------------------------------------------------------------------------------------
Factor A | 10.123 | 2 | 5.0615 | 45.6 | < 0.001
Factor B | 8.456 | 3 | 2.8187 | 25.4 | < 0.001
Factor C | 5.789 | 2 | 2.8945 | 22.3 | < 0.001
A * B | 2.567 | 6 | 0.4278 | 3.4 | 0.005
A * C | 1.987 | 4 | 0.4968 | 3.9 | 0.003
B * C | 2.234 | 6 | 0.3723 | 3.0 | 0.010
A * B * C | 0.876 | 12 | 0.0730 | 1.8 | 0.081
-----------------------------------------------------------------------------------------------------------
Error | 6.432 | 80 | 0.0804
Total | 38.464 | 115 | 0.3345
What insights does this ANOVA table provide about the three factors and their interactions?
Source | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-Statistic | p-value
---------------------------------------------------------------------------------------------------------
Between (B) | 0.0 | 2 | 0.0 | 0.0 | 1.000
Within (W) | 13.3 | 12 | 1.11 | |
---------------------------------------------------------------------------------------------------------
Total (T) | 13.3 | 14 | 0.95
This experiment starts with three groups having similar means, yielding no significant difference.
Use the sliders to introduce an offset to one group’s mean and observe how the ANOVA table updates to show significance.
If p < 0.05, how to locate the group(s) with significant differences?
ANOVA shows that at least one group differs, but it doesn’t specify which ones.
Post-hoc analysis identifies specific group differences after a significant ANOVA result.
Tukey’s HSD* is our primary method for comparing all pairs of groups while
controlling for error rates.
*HSD: Honestly Significant Difference
Purpose of Tukey’s HSD:
Detect where specific group differences lie.
Control for increased Type I error* in multiple comparisons.
When to Use: After a significant ANOVA result.
*Type I error: Incorrectly rejecting a true null hypothesis (false positive).
Purpose: To identify specific group differences by comparing all pairs of means.
Tukey’s HSD calculates a critical value for each pairwise difference to determine significance.
Formula: \[ HSD = q(g,N-g) \cdot \sqrt{\frac{\bar{W}}{n}} \]
where \( q(g,N-g) \) is the critical value based on the number of groups \( g \) and total observations \( N \), \( \bar{W} \) is the mean square error, and \( n \) is the number of observations per group.
How it Works:
Calculates pairwise differences between group means.
Compares each difference to the HSD value; if it exceeds this, the pair is significantly different.
Benefit: Controls Type I error across multiple comparisons.
The critical value \( q(g,N-g) \) is based on the Studentized range distribution (similar to the t-distribution). Here is an example of such a table: Studentized Range Distribution Table
Suppose we measured pollutant concentrations in three types of water sources:
Sources: River, Lake, Groundwater
Objective: Use Tukey's HSD to find which sources have significantly different pollutant levels after a significant ANOVA result.
Mean Pollutant Concentrations (mg/L):
River: 3.2
Lake: 5.1
Groundwater: 4.0
Calculated HSD: 1.5
Comparisons:
River vs. Lake
↳ Diff. = 1.9 > HSD → Significant
River vs. Groundwater
↳ Diff. = 0.8 < HSD → Not Significant
Lake vs. Groundwater
↳ Diff. = 1.1 < HSD → Not Significant
Interpretation: Pollutant levels are significantly higher in lakes compared to rivers.
Each comparison tells us if a specific pair of groups has a significant difference.
Significant Difference: Difference exceeds HSD critical value.
No Significant Difference: Difference is less than HSD critical value.
In our example, only the comparison between River and Lake shows a significant difference.
Tukey's HSD allows us to focus on the most impactful differences and interpret how pollutant levels vary by source.
Problem with Multiple t-Tests: Each t-test has a 5% false-positive rate. When comparing multiple groups, this compounds, increasing the chance of finding a false difference.
Example with Three Groups: For groups A, B, and C, three pairwise comparisons (A vs. B, A vs. C, B vs. C) lead to an overall false-positive rate of: \[ 1 - (1 - 0.05)^{g=3} = 14.26\% \] This means a 14.26% chance of a false positive across tests.
Solution with ANOVA + Post-Hoc:
ANOVA controls Type I error by first checking for overall differences among groups.
If significant, post-hoc tests like Tukey’s HSD ensure accurate pairwise comparisons while
keeping the overall false-positive rate at 5%.
Takeaway: ANOVA + post-hoc tests provide reliable multi-group comparison without inflating errors.
The concentration of a critical environmental chemical was measured in five rivers across two distinct region types: urban and landside.
These regions represent contrasting land uses, likely influencing pollution levels in nearby water bodies.
Measurements were collected seasonally over one year to capture potential seasonal variations.
Goal: Determine which factors significantly impact chemical concentration levels.
Region Type | River | Season | Diclofenac (µg/L)
---------------------------------------------------
Urban | River1 | Winter | 2.5
Urban | River1 | Spring | 2.1
Urban | River1 | Summer | 3.0
Urban | River1 | Fall | 2.8
Urban | River2 | Winter | 2.7
Urban | River2 | Spring | 2.0
Urban | River2 | Summer | 3.1
Urban | River2 | Fall | 2.9
Landside | River3 | Winter | 1.5
Landside | River3 | Spring | 1.7
Landside | River3 | Summer | 1.8
Landside | River3 | Fall | 1.6
Landside | River4 | Winter | 1.6
Landside | River4 | Spring | 1.8
Landside | River4 | Summer | 1.9
Landside | River4 | Fall | 1.7
Landside | River5 | Winter | 2.2
Landside | River5 | Spring | 2.1
Landside | River5 | Summer | 2.0
Landside | River5 | Fall | 1.9
Which properties can be considered as factors in this dataset?
What are the limitations of this dataset?
In multi-way ANOVA, we examine "cells," which represent combinations of factor levels. For the factors Region Type, River, and Season, a cell like Urban-River1-Winter would indicate the chemical concentration measured in River1 during winter in an urban area.
To consider all three factors in this cell, we need multiple measurements.
If this is not the case, we may need to aggregate data, i.e., neglect one or more factors.
Region Type | River | Season | Diclofenac (µg/L)
---------------------------------------------------
Urban | River1 | Winter | 2.5
Urban | River1 | Spring | 2.1
Urban | River1 | Summer | 3.0
Urban | River1 | Fall | 2.8
Urban | River2 | Winter | 2.7
Urban | River2 | Spring | 2.0
Urban | River2 | Summer | 3.1
Urban | River2 | Fall | 2.9
Landside | River3 | Winter | 1.5
Landside | River3 | Spring | 1.7
Landside | River3 | Summer | 1.8
Landside | River3 | Fall | 1.6
Landside | River4 | Winter | 1.6
Landside | River4 | Spring | 1.8
Landside | River4 | Summer | 1.9
Landside | River4 | Fall | 1.7
Landside | River5 | Winter | 2.2
Landside | River5 | Spring | 2.1
Landside | River5 | Summer | 2.0
Landside | River5 | Fall | 1.9
Which factors can form a valid cell in this dataset?
All three, two, or only one of the factors?