Bias correction of diagnostics data from IoT devices

Wang, Xiaofei; Ye, Yizhong

Bias correction of diagnostics data from IoT devices

Mark

Wang, Xiaofei ^LU and Ye, Yizhong ^LU (2025) In Master's Theses in Mathematical Sciences MASM02 20251
Mathematical Statistics

Abstract: Selection bias in non-probability samples presents a significant challenge in IoT device di- agnostics data analysis, where incomplete or systematically missing data can compromise the validity of analytical insights. Our project proposes a comprehensive statistical frame- work to correct for that bias, using real-world data from Axis Communications AB. The project explores three primary methodologies: pseudo-weight based on selection models, like the logistic regression, the random forest, and the light gradient boosting machine, raking adjustment aligned with known categorical proportions of the auxiliary variables, and Bayesian model-based method that accounts for uncertainty in the selection process. Two sampling scenarios are... (More); Selection bias in non-probability samples presents a significant challenge in IoT device di- agnostics data analysis, where incomplete or systematically missing data can compromise the validity of analytical insights. Our project proposes a comprehensive statistical frame- work to correct for that bias, using real-world data from Axis Communications AB. The project explores three primary methodologies: pseudo-weight based on selection models, like the logistic regression, the random forest, and the light gradient boosting machine, raking adjustment aligned with known categorical proportions of the auxiliary variables, and Bayesian model-based method that accounts for uncertainty in the selection process. Two sampling scenarios are simulated, one dependent solely on the auxiliary variables and another incorporating both the auxiliary and response variables, to evaluate method performance. Empirical results show that raking adjustments yield the most accurate cor- rection result when the categorical proportions of the auxiliary variables in the population are available, while pseudo-weight is effective when given enough information of the aux- iliary variables in the population. The Bayesian model-based method, although sensitive to the assumptions of the model, demonstrates substantial correction in cases of severe selection bias. Our project contributes a practical bias correction toolkit for improving the representativeness and reliability of inferences drawn from non-random IoT datasets. (Less)
Popular Abstract: In today’s world of smart devices—such as security cameras, sensors, and connected electronics—massive amounts of data are continuously collected to support diagnostics and decision-making. However, not all data are captured equally. Some devices may not report all the information due to technical, regional, or policy limitations, leading to what is known as selection bias. This bias means that the collected sample does not represent the full population correctly, and that any conclusions drawn from it could be misleading.
Our thesis addresses this challenge by developing statistical correction methods that help fix the bias in Internet of Things (IoT) device diagnostic data. Using real-world data from Axis Communications AB, we propose... (More); In today’s world of smart devices—such as security cameras, sensors, and connected electronics—massive amounts of data are continuously collected to support diagnostics and decision-making. However, not all data are captured equally. Some devices may not report all the information due to technical, regional, or policy limitations, leading to what is known as selection bias. This bias means that the collected sample does not represent the full population correctly, and that any conclusions drawn from it could be misleading.
Our thesis addresses this challenge by developing statistical correction methods that help fix the bias in Internet of Things (IoT) device diagnostic data. Using real-world data from Axis Communications AB, we propose and compare three approaches:
• Pseudo-weight: Estimating the likelihood that each data point is included in the sample dataset using selection models, then reweighting the data accordingly.
• Raking Adjustment: Refining the correction results using categorical proportions of the auxiliary variables in the population.
• Bayesian Model-Based Method: Considering the relationship between the se- lection process and the auxiliary and response variables at the method-level.
Simulated sampling experiments show that when categorical proportions of the auxiliary variables in the population are available, as well as population information on some of the auxiliary variables, the raking adjustment provides the most accurate corrections. When population information on the auxiliary variables is known, the pseudo-weight correction method is effective. The Bayesian model-based method, which is more sensitive to model assumptions, performs well in the cases with uncertainty in the selection mechanism.
Overall, the project contributes practical methods for improving the reliability of analyses based on non-random data missing mechanism in IoT datasets. These methods allow for more accurate, fair, and representative insights, supporting better product evaluation and smarter decisions in connected technology environments. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9206833

author

Wang, Xiaofei ^LU and Ye, Yizhong ^LU

supervisor

Johan Lindström ^LU

organization

Mathematical Statistics

course

MASM02 20251

year

2025

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Non-Probability Sampling, Pseudo-weights, Raking Adjustment, Bayesian Model-Based Method, Bias Correction, Survey Statistics

publication/series

Master's Theses in Mathematical Sciences

report number

LUNFMS-3135-2025

ISSN

1404-6342

other publication id

2025:E87

language

English

id

9206833

date added to LUP

2025-06-30 11:23:53

date last changed

2025-11-28 16:23:31

@misc{9206833,
  abstract     = {{Selection bias in non-probability samples presents a significant challenge in IoT device di- agnostics data analysis, where incomplete or systematically missing data can compromise the validity of analytical insights. Our project proposes a comprehensive statistical frame- work to correct for that bias, using real-world data from Axis Communications AB. The project explores three primary methodologies: pseudo-weight based on selection models, like the logistic regression, the random forest, and the light gradient boosting machine, raking adjustment aligned with known categorical proportions of the auxiliary variables, and Bayesian model-based method that accounts for uncertainty in the selection process. Two sampling scenarios are simulated, one dependent solely on the auxiliary variables and another incorporating both the auxiliary and response variables, to evaluate method performance. Empirical results show that raking adjustments yield the most accurate cor- rection result when the categorical proportions of the auxiliary variables in the population are available, while pseudo-weight is effective when given enough information of the aux- iliary variables in the population. The Bayesian model-based method, although sensitive to the assumptions of the model, demonstrates substantial correction in cases of severe selection bias. Our project contributes a practical bias correction toolkit for improving the representativeness and reliability of inferences drawn from non-random IoT datasets.}},
  author       = {{Wang, Xiaofei and Ye, Yizhong}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Bias correction of diagnostics data from IoT devices}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Bias correction of diagnostics data from IoT devices