Modeling Black Carbon Pollution Sources in Europe using Linear Regression
(2025) MASK11 20251Mathematical Statistics
- Abstract
- The aim achieved through this thesis was to ascertain how much the pollutions of black carbon (BC)
from each European country systematically deviate from their expected values obtained by inverting
a transport model. The model used is based on meteorological observations as well as black carbon
values obtained in the past at different locations. It was achieved by fitting a linear regression model
to a data set concerning observations of BC and the transport models’ predictions of its amount.
The data used consisted of hourly weight measurements of black carbon at 12 different measure-
ment stations across Europe (our original dependent variable, Y ) and a transport model used on 95
countries around the world that answered the... (More) - The aim achieved through this thesis was to ascertain how much the pollutions of black carbon (BC)
from each European country systematically deviate from their expected values obtained by inverting
a transport model. The model used is based on meteorological observations as well as black carbon
values obtained in the past at different locations. It was achieved by fitting a linear regression model
to a data set concerning observations of BC and the transport models’ predictions of its amount.
The data used consisted of hourly weight measurements of black carbon at 12 different measure-
ment stations across Europe (our original dependent variable, Y ) and a transport model used on 95
countries around the world that answered the following question for each country, site and time: how
much of black carbon emitted from a given country would end up at the given measurement site at a
given corresponding time?
This data did not seem to be fulfilling all of the requirements needed for a linear regression model
to be reasonably analysed. Therefore we had to use transforms as well as regularized regression of
different kinds to combat that. We also had to account for collinearity within response variable as
well as within the predictors.
After transforming the data, we selected the variables that were the most influential in our model
by conducting stepwise forward selection and Lasso together with k-fold cross-validation.
The number of influential parameters turned out to be 8, of which Black Sea and its surrounding
areas were definitely causing the most divergence from the considered transport model. Other than
that, a significant amount of difference was caused by (ordered from the biggest to the smallest ab-
solute value of its coefficient estimate): Ukraine, Serbia, Poland, Romania, Croatia, Italy, Bosnia and
Herzegovina, and lastly biomass burning and Hungary.
The conclusions that we drew from those results were that the reported value of black carbon
pollution was not true for those areas. (Less) - Popular Abstract
- Black carbon (BC), commonly known as soot, has devastating effects on global warming, human
health and agriculture. It is thus crucial to keep monitoring how much of it is in the air we breathe
and what are its biggest sources.
However, since it is a fine particulate gas, it is quite hard to do so. Therefore a transport and an
inverse model is constructed to explain how much countries or areas of the world contributed to the
measured amount of black carbon at measuring sites located all over Europe. The transport model is
based on meteorological research and the amount of pollutions that each country or area reported.
Unfortunately, the amounts calculated by the transport model do not add up to the measured
quantities at those... (More) - Black carbon (BC), commonly known as soot, has devastating effects on global warming, human
health and agriculture. It is thus crucial to keep monitoring how much of it is in the air we breathe
and what are its biggest sources.
However, since it is a fine particulate gas, it is quite hard to do so. Therefore a transport and an
inverse model is constructed to explain how much countries or areas of the world contributed to the
measured amount of black carbon at measuring sites located all over Europe. The transport model is
based on meteorological research and the amount of pollutions that each country or area reported.
Unfortunately, the amounts calculated by the transport model do not add up to the measured
quantities at those stations and it is therefore needed to check which of the contributions deviate
from observations and on what scale.
The aim of this project is to find an answer for that question by deducing the best linear regression
model and analysing it. Linear regression models require, however, our data to fit with certain as-
sumptions. First of them is independence within our response variable (observations of black carbon
at each station s at times t). Since the data is a time serie, correlation between observations that oc-
curred one after another is high. We combat this by taking daily averages of our hourly observations.
Independence is also checked within our predictors (estimated contributions of BC from each
country that, according to the transport model, would end up at the station s at time t). As expected,
countries close to each other do not fulfil that requirement and it is accounted for by grouping highly
correlated countries together by adding the estimations of their contributions of BC, for certain site
s at time t, together.
Another assumption is that the difference between predicted by linear model values and actual
observations is normally distributed with mean zero and common variance. Checking that assumption
after fitting the model is a standard procedure of validating it. While our residuals seem to be normally
distributed with mean zero, they do not have the same variance. Errors for higher observations have
larger variance. Therefore, we first perform the selection of influential variables on a model with log-
transformed response variable which works great, but we cannot analyse that model appropriately
in the end. Therefore, we try other method on already subsetted data set. Namely, we add weights to
observations with more information, that is, with less variance.
That is not working as intended, so we go back to the original model with subsetted variables and
estimate how much the influential variables are ”off” with 95% confidence intervals.
As it turns out, it is the eight following variables that either have wrong reported values or are
not accounted for appropriately by the transport model: Black Sea with surrounding areas, Ukraine,
Serbia, Poland, Romania, Croatia, Italy, Bosnia and Herzegovina, and lastly biomass burning and Hun-
gary. The countries were given in a decreasing order according to how influential for the difference
between transport model estimations and real-world observations of BC they are. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9207728
- author
- Jedrzejak, Urszula Barbara LU
- supervisor
-
- Ted Kronvall LU
- organization
- course
- MASK11 20251
- year
- 2025
- type
- M2 - Bachelor Degree
- subject
- keywords
- Black Carbon, pollution, linear regression
- language
- English
- id
- 9207728
- date added to LUP
- 2025-07-02 15:56:11
- date last changed
- 2025-07-02 15:56:11
@misc{9207728, abstract = {{The aim achieved through this thesis was to ascertain how much the pollutions of black carbon (BC) from each European country systematically deviate from their expected values obtained by inverting a transport model. The model used is based on meteorological observations as well as black carbon values obtained in the past at different locations. It was achieved by fitting a linear regression model to a data set concerning observations of BC and the transport models’ predictions of its amount. The data used consisted of hourly weight measurements of black carbon at 12 different measure- ment stations across Europe (our original dependent variable, Y ) and a transport model used on 95 countries around the world that answered the following question for each country, site and time: how much of black carbon emitted from a given country would end up at the given measurement site at a given corresponding time? This data did not seem to be fulfilling all of the requirements needed for a linear regression model to be reasonably analysed. Therefore we had to use transforms as well as regularized regression of different kinds to combat that. We also had to account for collinearity within response variable as well as within the predictors. After transforming the data, we selected the variables that were the most influential in our model by conducting stepwise forward selection and Lasso together with k-fold cross-validation. The number of influential parameters turned out to be 8, of which Black Sea and its surrounding areas were definitely causing the most divergence from the considered transport model. Other than that, a significant amount of difference was caused by (ordered from the biggest to the smallest ab- solute value of its coefficient estimate): Ukraine, Serbia, Poland, Romania, Croatia, Italy, Bosnia and Herzegovina, and lastly biomass burning and Hungary. The conclusions that we drew from those results were that the reported value of black carbon pollution was not true for those areas.}}, author = {{Jedrzejak, Urszula Barbara}}, language = {{eng}}, note = {{Student Paper}}, title = {{Modeling Black Carbon Pollution Sources in Europe using Linear Regression}}, year = {{2025}}, }