Bias correction of diagnostics data from IoT devices
(2025) In Master's Thesis in Mathematical Sciences MASM02 20251Mathematical Statistics
- Abstract
- Selection bias in non-probability samples presents a significant challenge in IoT device di- agnostics data analysis, where incomplete or systematically missing data can compromise the validity of analytical insights. Our project proposes a comprehensive statistical frame- work to correct for that bias, using real-world data from Axis Communications AB. The project explores three primary methodologies: pseudo-weight based on selection models, like the logistic regression, the random forest, and the light gradient boosting machine, raking adjustment aligned with known categorical proportions of the auxiliary variables, and Bayesian model-based method that accounts for uncertainty in the selection process. Two sampling scenarios are... (More)
- Selection bias in non-probability samples presents a significant challenge in IoT device di- agnostics data analysis, where incomplete or systematically missing data can compromise the validity of analytical insights. Our project proposes a comprehensive statistical frame- work to correct for that bias, using real-world data from Axis Communications AB. The project explores three primary methodologies: pseudo-weight based on selection models, like the logistic regression, the random forest, and the light gradient boosting machine, raking adjustment aligned with known categorical proportions of the auxiliary variables, and Bayesian model-based method that accounts for uncertainty in the selection process. Two sampling scenarios are simulated, one dependent solely on the auxiliary variables and another incorporating both the auxiliary and response variables, to evaluate method performance. Empirical results show that raking adjustments yield the most accurate cor- rection result when the categorical proportions of the auxiliary variables in the population are available, while pseudo-weight is effective when given enough information of the aux- iliary variables in the population. The Bayesian model-based method, although sensitive to the assumptions of the model, demonstrates substantial correction in cases of severe selection bias. Our project contributes a practical bias correction toolkit for improving the representativeness and reliability of inferences drawn from non-random IoT datasets. (Less)
- Popular Abstract
- In today’s world of smart devices—such as security cameras, sensors, and connected electronics—massive amounts of data are continuously collected to support diagnostics and decision-making. However, not all data are captured equally. Some devices may not report all the information due to technical, regional, or policy limitations, leading to what is known as selection bias. This bias means that the collected sample does not represent the full population correctly, and that any conclusions drawn from it could be misleading.
Our thesis addresses this challenge by developing statistical correction methods that help fix the bias in Internet of Things (IoT) device diagnostic data. Using real-world data from Axis Communications AB, we propose... (More) - In today’s world of smart devices—such as security cameras, sensors, and connected electronics—massive amounts of data are continuously collected to support diagnostics and decision-making. However, not all data are captured equally. Some devices may not report all the information due to technical, regional, or policy limitations, leading to what is known as selection bias. This bias means that the collected sample does not represent the full population correctly, and that any conclusions drawn from it could be misleading.
Our thesis addresses this challenge by developing statistical correction methods that help fix the bias in Internet of Things (IoT) device diagnostic data. Using real-world data from Axis Communications AB, we propose and compare three approaches:
• Pseudo-weight: Estimating the likelihood that each data point is included in the sample dataset using selection models, then reweighting the data accordingly.
• Raking Adjustment: Refining the correction results using categorical proportions of the auxiliary variables in the population.
• Bayesian Model-Based Method: Considering the relationship between the se- lection process and the auxiliary and response variables at the method-level.
Simulated sampling experiments show that when categorical proportions of the auxiliary variables in the population are available, as well as population information on some of the auxiliary variables, the raking adjustment provides the most accurate corrections. When population information on the auxiliary variables is known, the pseudo-weight correction method is effective. The Bayesian model-based method, which is more sensitive to model assumptions, performs well in the cases with uncertainty in the selection mechanism.
Overall, the project contributes practical methods for improving the reliability of analyses based on non-random data missing mechanism in IoT datasets. These methods allow for more accurate, fair, and representative insights, supporting better product evaluation and smarter decisions in connected technology environments. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9206833
- author
- Wang, Xiaofei LU and Ye, Yizhong LU
- supervisor
- organization
- course
- MASM02 20251
- year
- 2025
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- Non-Probability Sampling, Pseudo-weights, Raking Adjustment, Bayesian Model-Based Method, Bias Correction, Survey Statistics
- publication/series
- Master's Thesis in Mathematical Sciences
- report number
- LUNFMS-3135-2025
- ISSN
- 1404-6342
- other publication id
- 2025:E87
- language
- English
- id
- 9206833
- date added to LUP
- 2025-06-30 11:23:53
- date last changed
- 2025-06-30 11:23:53
@misc{9206833, abstract = {{Selection bias in non-probability samples presents a significant challenge in IoT device di- agnostics data analysis, where incomplete or systematically missing data can compromise the validity of analytical insights. Our project proposes a comprehensive statistical frame- work to correct for that bias, using real-world data from Axis Communications AB. The project explores three primary methodologies: pseudo-weight based on selection models, like the logistic regression, the random forest, and the light gradient boosting machine, raking adjustment aligned with known categorical proportions of the auxiliary variables, and Bayesian model-based method that accounts for uncertainty in the selection process. Two sampling scenarios are simulated, one dependent solely on the auxiliary variables and another incorporating both the auxiliary and response variables, to evaluate method performance. Empirical results show that raking adjustments yield the most accurate cor- rection result when the categorical proportions of the auxiliary variables in the population are available, while pseudo-weight is effective when given enough information of the aux- iliary variables in the population. The Bayesian model-based method, although sensitive to the assumptions of the model, demonstrates substantial correction in cases of severe selection bias. Our project contributes a practical bias correction toolkit for improving the representativeness and reliability of inferences drawn from non-random IoT datasets.}}, author = {{Wang, Xiaofei and Ye, Yizhong}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master's Thesis in Mathematical Sciences}}, title = {{Bias correction of diagnostics data from IoT devices}}, year = {{2025}}, }