01. Data Characterization 1
You are asked to anaylze diclofenac in a river.
Therefore, you collect five samples from a sampling location.
You obtained the following concentration data:
| sample id | concentration [mg/L] |
|---|---|
| #1 | 0.5 |
| #2 | 0.7 |
| #3 | 0.6 |
| #4 | 0.8 |
| #5 | 0.4 |
>> What's the
river's
true diclofenac concentration?
You obtained the following concentration data:
| sample id | concentration [mg/L] |
|---|---|
| #1 | 0.5 |
| #2 | 0.7 |
| #3 | 0.6 |
| #4 | 0.8 |
| #5 | 0.4 |
>> What's the
sample's
true diclofenac concentration?
true value cannot be determined
exatcly from the finite number of samples. \[\begin{equation} \text{mean} = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x} \end{equation}\]
\[\begin{equation} \text{mean} = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x} \end{equation}\]
Where \(n\) is the number of samples and \(x_i\) is the concentration of the \(i\)-th sample.
mean = (0.5 + 0.7 + 0.6 + 0.8 + 0.4) / 5
>> mean = 0.6
But why is this only an estimation of the true
value?
Let's do a theoretical experiment:
We repeat the sampling (
n=5
) and the analysis 200 times.
Let's do a theoretical experiment:
We repeat the sampling (
n=5
) and the analysis 200 times.
Let's increase n.
Let's do a theoretical experiment:
We repeat the sampling (
n=50
) and the analysis 200 times.
\[\text{mean} = \frac{1}{N} \sum_{i=1}^{N} x_i = \bar{x} = \mu\] where \(N\) means all samples of a population, which is often infinite. In practice, we use a finite number of samples \(n\) assuming: \[\mu \approx \bar{x}\]
Next to the arithmetic mean, the median is another measure of central tendency.
Let's check look at an example to understand the difference between the mean and the medianmean.
How to calculate the median? How is the median defined?
Let's calculate the median for the
following data set:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
First, we need to sort the data set:
data_sorted = sort(data)
>> data_sorted = [0.4, 0.5, 0.6, 0.7, 0.8]
Then, we can calculate the median by
extracting the middle value:
median = 0.6
What if the data set has an even number of values?
Let's calculate the median for the
following data set:
data = [0.5, 0.7, 0.9, 0.4]
First, we need to sort the data set:
data_sorted = sort(data)
>> data_sorted = [0.4, 0.5, 0.7, 0.9]
Then, we can calculate the median by
averaging the two middle values:
median = (0.5 + 0.7) / 2
>> median = 0.6
Next to the mean and the median, the mode is another measure of central tendency.
Let's check look at an example to understand the difference between them.
How to calculate the mode? How is the mode defined?
Let's calculate the mode for the following
data set:
data = [0.5, 0.6, 0.7, 0.8, 0.4, 0.6]
We can calculate the mode by counting the
frequency of each value:
>> 0.4: 1
>> 0.5: 1
>> 0.6: 2 <-- mode
>> 0.7: 1
>> 0.8: 1
What if we have continous data?
data = [0.5265, 0.6324, 0.7263, 0.8324, 0.4265, 0.6682]
Let's calculate the mode for the given
dataset using the interval approach:
intervals = [0.4, 0.5, 0.6, 0.7, 0.8]
We can calculate the mode by counting the
frequency of each interval:
>> 0.4-0.5: 1
>> 0.5-0.6: 1
>> 0.6-0.7: 2 <-- mode
>> 0.7-0.8: 1
What if we have continous data?
data = [0.5265, 0.6324, 0.7263, 0.8324, 0.4265, 0.6682]
Let's calculate the mode for the given
dataset using the interval approach:
intervals = [0.4, 0.5, 0.6, 0.7, 0.8]
We can also use rounding for intervaling the data to calculate the mode
data_rounded = round(data, 1)
>> data_rounded = [0.5, 0.6, 0.7, 0.8, 0.4, 0.7]
>> 0.4: 1
>> 0.5: 1
>> 0.6: 1
>> 0.7: 2 <-- mode
>> 0.8: 1
What if we have continous data?
data = [0.5265, 0.6324, 0.7263, 0.8324, 0.4265, 0.6682]
As an alternative, we can use the Pearson's mode skewness approach:
median(data)
>> 0.6503
mean(data)
>> 0.6354
mode = 3 * median - 2 * mean
>> mode = 0.6801
What if we have multiple modes?
data = [0.5, 0.6, 0.7, 0.8, 0.4, 0.6, 0.7]
Let's calculate the mode for the given
dataset:
>> 0.4: 1
>> 0.5: 1
>> 0.6: 2 <-- mode
>> 0.7: 2 <-- mode
>> 0.8: 1
If all values appear with the same frequency, the data set is called uniform (i.e., no mode), except if we assume sufficient intervaling.
Next to the measures of central tendency, we
need measures of dispersion to describe the spread of a data set. A very simple approach for
this is the range.
The range is defined as:
\[
\text{range} = \text{max}(data) - \text{min}(data)
\]
Example:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
data_max = max(data)
>> data_max = 0.8
data_min = min(data)
>> data_min = 0.4
data_range = data_max - data_min
>> data_range = 0.4
The range is a very simple measure of the spread of the data set. However, it is very sensitive to outliers.
Here is an example for such an outlier:
data = [0.5, 0.6, 0.7, 0.8, 0.4, 3.1]
data_max = max(data)
>> data_max = 3.1
data_min = min(data)
>> data_min = 0.4
data_range = data_max - data_min
>> data_range = 2.7
The large range would suggest a large spread of the data set, which is not true in this case.
One typical approach for using the range is for Normalizations, e.g., Min-Max Normalization: \[ \text{normalized} = \frac{data - \text{min}(data)}{\text{max}(data) - \text{min}(data)} \]
Example:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
data_normalized = (data - min(data)) / (max(data) - min(data))
>> data_normalized = [0.25, 0.5, 0.75, 1.0, 0.0]
Example with outlier:
data = [0.5, 0.6, 0.7, 0.8, 3.1]
data_normalized = (data - min(data)) / (max(data) - min(data))
>> data_normalized = [0.00, 0.04, 0.08, 0.12, 1.0]
A more advanced approach to describe the spread of a data set is the variance and the standard deviation.
Let's have a look at the variance and the
standard deviation.
\[\begin{equation} \text{variance} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 = \sigma^2 \end{equation}\]
However, in pratice we mostly estimate the variance using the sample variance \(s^2\) due to the unknown true value \(\mu\) but using the arithmetic
mean \(\bar{x}\) instead.
\[\begin{equation}
\text{sample variance} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 = s^2
\end{equation}\]
\[s^2=\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\] Why do we divide by \(n-1\) instead of \(n\)?
>> To answer this question, we need to understand
the
concept of degrees of freedom (df).
>> df is
the number of independent values in your data set.
Means that there is no influence between
the values.
>> However, when we do some calculations,
we lose df due to adding dependencies like y=f(x).
Let's have a look at an example:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
data_mean = mean(data)
Let's have a look at an example:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
data_mean = mean(data)
>> data_mean = 0.6
We could see that the mean depends on the
data. This adds some dependency to the data set. As we now have an equation that must be
fulfilled by the data.
\[data\_mean = \frac{1}{5} \left(data_1 + data_2 + data_3 + data_4 + data_5\right)\] \[data_1 = data\_mean \cdot 5 - data_2 - data_3 - data_4 - data_5\] We can calculate the value of \(data_1\) by using the equation above, which indicates that at least one value is dependent on the other values. Therefore, we reduce the degree of freedom by one (\(df = n-1 = 4\)).
\[s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\]
Let's calculate the variance and the standard deviation:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
data_mean = mean(data)
data_variance = sum((data - data_mean)^2) / (n-1)
>> data_variance = 0.025
data_std = sqrt(data_variance)
>> data_std = 0.158
Let's calculate the variance and the standard deviation:
data = [0.5, 0.6, 0.7, 0.8, 0.4]
>> data_variance = 0.025
>> data_std = 0.158
Interpretation: The average concentration is 0.6 mg/L
with an average spread of 0.158 mg/L.
>> Note: This is not the uncertainty of the average concentration!
>> The spread describes the precision of the data set
and is proportional to the uncertainty of the mean.
>> The spread can give you feedback on your measurement
system/method
and the quality of your data.
Another measure of the spread of the data set is the interquartile range (IQR). The IQR covers the middle 50% of the data set, i.e.,
from the 25th to the 75th percentile.
Some helpful terms for the IQR:
>> Percentile / Quantile: The value below which a given
percentage of
observations fall.
E.g., the 13th percentile or 0.13 quantile is the value below which 13%
of the data fall.
sorted data:
>> Quartile: The 25th, 50th, and 75th percentiles are called the first, second, and third quartiles, respectively.
sorted data:
Another measure of the spread of the data set is the interquartile range (IQR). The IQR covers the middle 50% of the data set, i.e.,
from the 25th to the 75th percentile.
sorted data:
How to calculate the IQR?
\[ q = (n+1) \cdot p \] Where \(q\) is the quantile position, \(n\) is the number of data points, and \(p\) is the quantile.
Let's calculate the IQR for the given dataset:
data = [0.573, 0.712, 0.594, 0.631, 0.682,...
0.800, 0.471, 0.499, 0.538, 0.697,...
0.612, 0.556, 0.644, 0.524, 0.528,...
0.728, 0.656, 0.507, 0.512, 0.552,...
0.668, 0.803, 0.750, 0.774]
First, we need to sort the data:
data = [0.471, 0.499, 0.507, 0.512, 0.524,...
0.528, 0.538, 0.552, 0.556, 0.573,...
0.594, 0.612, 0.631, 0.644, 0.656,...
0.668, 0.682, 0.697, 0.712, 0.728,...
0.750, 0.774, 0.800, 0.803]
First, we need to sort the data:
data = [0.471, 0.499, 0.507, 0.512, 0.524,...
0.528, 0.538, 0.552, 0.556, 0.573,...
0.594, 0.612, 0.631, 0.644, 0.656,...
0.668, 0.682, 0.697, 0.712, 0.728,...
0.750, 0.774, 0.800, 0.803]
Next, we calculate the quantile positions q:
q25 = (24 + 1) * 0.25
q75 = (24 + 1) * 0.75
>> q25 = 6.25, q75 = 18.75
First, we need to sort the data:
data = [0.471, 0.499, 0.507, 0.512, 0.524,...
0.528, 0.538, 0.552, 0.556, 0.573,...
0.594, 0.612, 0.631, 0.644, 0.656,...
0.668, 0.682, 0.697, 0.712, 0.728,...
0.750, 0.774, 0.800, 0.803]
Next, we calculate the quantile positions q:
>> q25 = 6.25, q75 = 18.75
>>
Now we know that the 25th percentile is between the 6th and 7th value and the 75th percentile is
between the 18th and 19th value. To calculate the IQR, we need to interpolate the values:
\[ x_p = x_{\lfloor q\rfloor} + (x_{\lceil q\rceil} - x_{\lfloor q\rfloor}) \cdot (q - \lfloor q\rfloor)\] Where \(x_p\) is the quantile value, \(x_{\lfloor q\rfloor}\) is the lower value, \(x_{\lceil q\rceil}\) is the upper value, and \(q\) is the quantile. \(\lfloor \rfloor\) is the floor function (rounds down) and \(\lceil \rceil\) is the ceiling function (rounds up).
First, we need to sort the data:
data = [0.471, 0.499, 0.507, 0.512, 0.524,...
0.528, 0.538, 0.552, 0.556, 0.573,...
0.594, 0.612, 0.631, 0.644, 0.656,...
0.668, 0.682, 0.697, 0.712, 0.728,...
0.750, 0.774, 0.800, 0.803]
Next, we calculate the quantile positions q:
>> q25 = 6.25, q75 = 18.75
>>
Now we know that the 25th percentile is between the 6th and 7th value and the 75th percentile is
between the 18th and 19th value. To calculate the IQR, we need to interpolate the values:
\[ x_p = x_{\lfloor q\rfloor} + (x_{\lceil q\rceil} - x_{\lfloor q\rfloor}) \cdot (q - \lfloor q\rfloor)\]
q25 = 0.528 + (0.538 - 0.528) * (6.25 - 6)
q75 = 0.697 + (0.712 - 0.697) * (18.75 - 18)
>> q25 = 0.503, q75 = 0.708
First, we need to sort the data:
data = [0.471, 0.499, 0.507, 0.512, 0.524,...
0.528, 0.538, 0.552, 0.556, 0.573,...
0.594, 0.612, 0.631, 0.644, 0.656,...
0.668, 0.682, 0.697, 0.712, 0.728,...
0.750, 0.774, 0.800, 0.803]
Next, we calculate the quantile positions q:
>> q25 = 6.25, q75 = 18.75
>>
To calculate the IQR, we need to interpolate
the values:
>> q25 = 0.503, q75 = 0.708
Finally, we can calculate the IQR:
IQR = q75 - q25
>> IQR = 0.205
data = [0.471, 0.499, 0.507, 0.512, 0.524,...
0.528, 0.538, 0.552, 0.556, 0.573,...
0.594, 0.612, 0.631, 0.644, 0.656,...
0.668, 0.682, 0.697, 0.712, 0.728,...
0.750, 0.774, 0.800, 0.803]
>> IQR = 0.205
Interpretation: The middle 50% of the data set is between 0.503 and 0.708 mg/L with a spread of 0.205 mg/L.
data_1 = [43.13, 45.12, 44.98, 40.23, 42.11,...
41.98, 43.12, 44.01, 42.98, 43.12]
data_2 = [39.12, 45.12, 44.98, 43.23, 44.78,...
43.56, 44.61, 45.01, 46.02, 49.73]
Which method is more precise?