Chemometrics and Statistics

Chemometrics & Statistics

01. Data Characterization 1

Basic Statistical Measures

by Gerrit Renner

Introduction to Data Characterization

You are asked to anaylze diclofenac in a river.

Introduction to Data Characterization

Therefore, you collect five samples from a sampling location.

Introduction to Data Characterization

You obtained the following concentration data:

sample id	concentration [mg/L]
#1	0.5
#2	0.7
#3	0.6
#4	0.8
#5	0.4

>> What's the

river's

true diclofenac concentration?

Introduction to Data Characterization

You obtained the following concentration data:

sample id	concentration [mg/L]
#1	0.5
#2	0.7
#3	0.6
#4	0.8
#5	0.4

Compounds in rivers are often not homogeneously distributed. Therefore, the concentration of diclofenac can vary significantly depending on the sampling location.

>> What's the

sample's

true diclofenac concentration?

The true value cannot be determined exatcly from the finite number of samples.
However, we can estimate the true concentration using the arithmetic mean.

\[\begin{equation} \text{mean} = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x} \end{equation}\]

The Arithmetic Mean

Let's have a quick look at the equation:

\[\begin{equation} \text{mean} = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x} \end{equation}\]

Where \(n\) is the number of samples and \(x_i\) is the concentration of the \(i\)-th sample.

mean = (0.5 + 0.7 + 0.6 + 0.8 + 0.4) / 5

>> mean = 0.6

But why is this only an estimation of the true value?

The Arithmetic Mean

Let's do a theoretical experiment:

We repeat the sampling (

n=5

) and the analysis 200 times.

The Arithmetic Mean

Let's do a theoretical experiment:

We repeat the sampling (

n=5

) and the analysis 200 times.

Let's increase n.

The Arithmetic Mean

Let's do a theoretical experiment:

We repeat the sampling (

n=50

) and the analysis 200 times.

The Arithmetic Mean

>> The arithmetic mean is an estimation of the true value.
>> The more samples you take, the more robust the estimation becomes.
>> I.e., the true value would be obtained, if you could take an infinite number of samples (\(n \rightarrow \infty\)).

\[\text{mean} = \frac{1}{N} \sum_{i=1}^{N} x_i = \bar{x} = \mu\] where \(N\) means all samples of a population, which is often infinite. In practice, we use a finite number of samples \(n\) assuming: \[\mu \approx \bar{x}\]

The Median

Next to the arithmetic mean, the median is another measure of central tendency.

Let's check look at an example to understand the difference between the mean and the medianmean.

The Median

How to calculate the median? How is the median defined?

>> The median is the value that separates the higher half from the lower half of a data set. I.e. it is the middle value of a data set - when it is sorted in ascending or descending order.

Let's calculate the median for the following data set:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

First, we need to sort the data set:

data_sorted = sort(data)

>> data_sorted = [0.4, 0.5, 0.6, 0.7, 0.8]

Then, we can calculate the median by extracting the middle value:

median = 0.6

The Median

What if the data set has an even number of values?

>> When the data set has an even number of values, the median is the average of the two middle values.

Let's calculate the median for the following data set:

data = [0.5, 0.7, 0.9, 0.4]

First, we need to sort the data set:

data_sorted = sort(data)

>> data_sorted = [0.4, 0.5, 0.7, 0.9]

Then, we can calculate the median by averaging the two middle values:

median = (0.5 + 0.7) / 2

>> median = 0.6

The Mode

Next to the mean and the median, the mode is another measure of central tendency.

Let's check look at an example to understand the difference between them.

The Mode

How to calculate the mode? How is the mode defined?

>> The mode is the value that appears most frequently in a data set.

Let's calculate the mode for the following data set:

data = [0.5, 0.6, 0.7, 0.8, 0.4, 0.6]

We can calculate the mode by counting the frequency of each value:

>> 0.4: 1
>> 0.5: 1
>> 0.6: 2 <-- mode
>> 0.7: 1
>> 0.8: 1

The Mode

What if we have continous data?

data = [0.5265, 0.6324, 0.7263, 0.8324, 0.4265, 0.6682]

>> When we have continous data, we can define intervals and calculate the mode for each interval. Or we may use the Pearson's mode skewness approach.

Let's calculate the mode for the given dataset using the interval approach:

intervals = [0.4, 0.5, 0.6, 0.7, 0.8]

We can calculate the mode by counting the frequency of each interval:

>> 0.4-0.5: 1
>> 0.5-0.6: 1
>> 0.6-0.7: 2 <-- mode
>> 0.7-0.8: 1

The Mode

What if we have continous data?

data = [0.5265, 0.6324, 0.7263, 0.8324, 0.4265, 0.6682]

>> When we have continous data, we can define intervals and calculate the mode for each interval. Or we may use the Pearson's mode skewness approach.

Let's calculate the mode for the given dataset using the interval approach:

intervals = [0.4, 0.5, 0.6, 0.7, 0.8]

We can also use rounding for intervaling the data to calculate the mode

data_rounded = round(data, 1)

>> data_rounded = [0.5, 0.6, 0.7, 0.8, 0.4, 0.7]

>> 0.4: 1
>> 0.5: 1
>> 0.6: 1
>> 0.7: 2 <-- mode
>> 0.8: 1

The Mode

What if we have continous data?

data = [0.5265, 0.6324, 0.7263, 0.8324, 0.4265, 0.6682]

>> When we have continous data, we can define intervals and calculate the mode for each interval. Or we may use the Pearson's mode skewness approach.

As an alternative, we can use the Pearson's mode skewness approach:

median(data)

>> 0.6503

mean(data)

>> 0.6354

mode = 3 * median - 2 * mean

>> mode = 0.6801

The Mode

What if we have multiple modes?

data = [0.5, 0.6, 0.7, 0.8, 0.4, 0.6, 0.7]

>> When a data set has multiple modes, it is called multimodal. I.e. the mode is a set of values.

Let's calculate the mode for the given dataset:

>> 0.4: 1
>> 0.5: 1
>> 0.6: 2 <-- mode
>> 0.7: 2 <-- mode
>> 0.8: 1

If all values appear with the same frequency, the data set is called uniform (i.e., no mode), except if we assume sufficient intervaling.

Range of the Data

Next to the measures of central tendency, we need measures of dispersion to describe the spread of a data set. A very simple approach for this is the range.

The range is defined as: \[ \text{range} = \text{max}(data) - \text{min}(data) \]

Example:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

data_max = max(data)

>> data_max = 0.8

data_min = min(data)

>> data_min = 0.4

data_range = data_max - data_min

>> data_range = 0.4

Range of the Data

The range is a very simple measure of the spread of the data set. However, it is very sensitive to outliers.

Here is an example for such an outlier:

data = [0.5, 0.6, 0.7, 0.8, 0.4, 3.1]

data_max = max(data)

>> data_max = 3.1

data_min = min(data)

>> data_min = 0.4

data_range = data_max - data_min

>> data_range = 2.7

The large range would suggest a large spread of the data set, which is not true in this case.

Range of the Data

One typical approach for using the range is for Normalizations, e.g., Min-Max Normalization: \[ \text{normalized} = \frac{data - \text{min}(data)}{\text{max}(data) - \text{min}(data)} \]

Example:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

data_normalized = (data - min(data)) / (max(data) - min(data))

>> data_normalized = [0.25, 0.5, 0.75, 1.0, 0.0]

Example with outlier:

data = [0.5, 0.6, 0.7, 0.8, 3.1]

data_normalized = (data - min(data)) / (max(data) - min(data))

>> data_normalized = [0.00, 0.04, 0.08, 0.12, 1.0]

Variance & Standard Deviation

A more advanced approach to describe the spread of a data set is the variance and the standard deviation.

Let's have a look at the variance and the standard deviation.

>> The variance \(\sigma^2\) describes the average squared deviation
from the expected (i.e. true) value \(\mu\).
>> The standard deviation \(\sigma\) is the square root of the variance \(\sigma^2\).

\[\begin{equation} \text{variance} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 = \sigma^2 \end{equation}\]

However, in pratice we mostly estimate the variance using the sample variance \(s^2\) due to the unknown true value \(\mu\) but using the arithmetic mean \(\bar{x}\) instead. \[\begin{equation} \text{sample variance} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 = s^2 \end{equation}\]

Variance & Standard Deviation

\[s^2=\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\] Why do we divide by \(n-1\) instead of \(n\)?

>> To answer this question, we need to understand
the concept of degrees of freedom (df).
>> df is the number of independent values in your data set.
Means that there is no influence between the values.
>> However, when we do some calculations,
we lose df due to adding dependencies like y=f(x).

Let's have a look at an example:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

data_mean = mean(data)

Variance & Standard Deviation

Let's have a look at an example:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

data_mean = mean(data)

>> data_mean = 0.6

We could see that the mean depends on the data. This adds some dependency to the data set. As we now have an equation that must be fulfilled by the data.

\[data\_mean = \frac{1}{5} \left(data_1 + data_2 + data_3 + data_4 + data_5\right)\] \[data_1 = data\_mean \cdot 5 - data_2 - data_3 - data_4 - data_5\] We can calculate the value of \(data_1\) by using the equation above, which indicates that at least one value is dependent on the other values. Therefore, we reduce the degree of freedom by one (\(df = n-1 = 4\)).

Variance & Standard Deviation

\[s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\]

>> In most real systems, we use the sample standard deviation \(s\) to estimate the data's spread. As this function uses an estimation of the true value, we need to consider this in the degree of freedom and divide by \(n-1\).

Let's calculate the variance and the standard deviation:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

data_mean = mean(data)

data_variance = sum((data - data_mean)^2) / (n-1)

>> data_variance = 0.025

data_std = sqrt(data_variance)

>> data_std = 0.158

Variance & Standard Deviation

Let's calculate the variance and the standard deviation:

data = [0.5, 0.6, 0.7, 0.8, 0.4]

>> data_variance = 0.025

>> data_std = 0.158

Interpretation: The average concentration is 0.6 mg/L
with an average spread of 0.158 mg/L.

>> Note: This is not the uncertainty of the average concentration!

>> The spread describes the precision of the data set
and is proportional to the uncertainty of the mean.

>> The spread can give you feedback on your measurement system/method
and the quality of your data.

The Interquartile Range

Another measure of the spread of the data set is the interquartile range (IQR). The IQR covers the middle 50% of the data set, i.e., from the 25th to the 75th percentile.

Some helpful terms for the IQR:

>> Percentile / Quantile: The value below which a given percentage of observations fall.
E.g., the 13th percentile or 0.13 quantile is the value below which 13% of the data fall.

sorted data:

>> Quartile: The 25th, 50th, and 75th percentiles are called the first, second, and third quartiles, respectively.

sorted data:

The Interquartile Range

Another measure of the spread of the data set is the interquartile range (IQR). The IQR covers the middle 50% of the data set, i.e., from the 25th to the 75th percentile.

sorted data:

How to calculate the IQR?

\[ q = (n+1) \cdot p \] Where \(q\) is the quantile position, \(n\) is the number of data points, and \(p\) is the quantile.

The Interquartile Range

Let's calculate the IQR for the given dataset:

data = [0.573, 0.712, 0.594, 0.631, 0.682,...
				0.800, 0.471, 0.499, 0.538, 0.697,...
				0.612, 0.556, 0.644, 0.524, 0.528,...
				0.728, 0.656, 0.507, 0.512, 0.552,...
				0.668, 0.803, 0.750, 0.774]

First, we need to sort the data:

data = [0.471, 0.499, 0.507, 0.512, 0.524,...
				0.528, 0.538, 0.552, 0.556, 0.573,...
				0.594, 0.612, 0.631, 0.644, 0.656,...
				0.668, 0.682, 0.697, 0.712, 0.728,...
				0.750, 0.774, 0.800, 0.803]

The Interquartile Range

First, we need to sort the data:

data = [0.471, 0.499, 0.507, 0.512, 0.524,...
				0.528, 0.538, 0.552, 0.556, 0.573,...
				0.594, 0.612, 0.631, 0.644, 0.656,...
				0.668, 0.682, 0.697, 0.712, 0.728,...
				0.750, 0.774, 0.800, 0.803]

Next, we calculate the quantile positions q:

q25 = (24 + 1) * 0.25

q75 = (24 + 1) * 0.75

>> q25 = 6.25, q75 = 18.75

The Interquartile Range

First, we need to sort the data:

data = [0.471, 0.499, 0.507, 0.512, 0.524,...
				0.528, 0.538, 0.552, 0.556, 0.573,...
				0.594, 0.612, 0.631, 0.644, 0.656,...
				0.668, 0.682, 0.697, 0.712, 0.728,...
				0.750, 0.774, 0.800, 0.803]

Next, we calculate the quantile positions q:

>> q25 = 6.25, q75 = 18.75

>> Now we know that the 25th percentile is between the 6th and 7th value and the 75th percentile is between the 18th and 19th value. To calculate the IQR, we need to interpolate the values:

\[ x_p = x_{\lfloor q\rfloor} + (x_{\lceil q\rceil} - x_{\lfloor q\rfloor}) \cdot (q - \lfloor q\rfloor)\] Where \(x_p\) is the quantile value, \(x_{\lfloor q\rfloor}\) is the lower value, \(x_{\lceil q\rceil}\) is the upper value, and \(q\) is the quantile. \(\lfloor \rfloor\) is the floor function (rounds down) and \(\lceil \rceil\) is the ceiling function (rounds up).

The Interquartile Range

First, we need to sort the data:

data = [0.471, 0.499, 0.507, 0.512, 0.524,...
				0.528, 0.538, 0.552, 0.556, 0.573,...
				0.594, 0.612, 0.631, 0.644, 0.656,...
				0.668, 0.682, 0.697, 0.712, 0.728,...
				0.750, 0.774, 0.800, 0.803]

Next, we calculate the quantile positions q:

>> q25 = 6.25, q75 = 18.75

>> Now we know that the 25th percentile is between the 6th and 7th value and the 75th percentile is between the 18th and 19th value. To calculate the IQR, we need to interpolate the values:

\[ x_p = x_{\lfloor q\rfloor} + (x_{\lceil q\rceil} - x_{\lfloor q\rfloor}) \cdot (q - \lfloor q\rfloor)\]

q25 = 0.528 + (0.538 - 0.528) * (6.25 - 6)

q75 = 0.697 + (0.712 - 0.697) * (18.75 - 18)

>> q25 = 0.503, q75 = 0.708

The Interquartile Range

First, we need to sort the data:

data = [0.471, 0.499, 0.507, 0.512, 0.524,...
				0.528, 0.538, 0.552, 0.556, 0.573,...
				0.594, 0.612, 0.631, 0.644, 0.656,...
				0.668, 0.682, 0.697, 0.712, 0.728,...
				0.750, 0.774, 0.800, 0.803]

Next, we calculate the quantile positions q:

>> q25 = 6.25, q75 = 18.75

>> To calculate the IQR, we need to interpolate the values:

>> q25 = 0.503, q75 = 0.708

Finally, we can calculate the IQR:

IQR = q75 - q25

>> IQR = 0.205

The Interquartile Range

data = [0.471, 0.499, 0.507, 0.512, 0.524,...
				0.528, 0.538, 0.552, 0.556, 0.573,...
				0.594, 0.612, 0.631, 0.644, 0.656,...
				0.668, 0.682, 0.697, 0.712, 0.728,...
				0.750, 0.774, 0.800, 0.803]

>> IQR = 0.205

Interpretation: The middle 50% of the data set is between 0.503 and 0.708 mg/L with a spread of 0.205 mg/L.

Seminar Material

Case 1:
You want to compare to two analytical methods for the determination chloride concentration in surface water. You have the following data:

data_1 = [43.13, 45.12, 44.98, 40.23, 42.11,...
					41.98, 43.12, 44.01, 42.98, 43.12]

data_2 = [39.12, 45.12, 44.98, 43.23, 44.78,...
					43.56, 44.61, 45.01, 46.02, 49.73]

Which method is more precise?