The Accuracy Cost of Weakness : A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time
(2025) In Transactions on Machine Learning Research 2025-September.- Abstract
Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as "present" if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments.... (More)
Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as "present" if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments. Furthermore, we quantify the gap between these methods and verify that in most realistic scenarios the oracle method is better than the fixed-length labeling method in both accuracy and cost. Our findings provide a theoretical justification for adaptive weak labeling strategies that mimic the oracle process, and a foundation for optimizing weak labeling processes in sequence labeling tasks.
(Less)
- author
- Martinsson, John LU ; Virtanen, Tuomas ; Sandsten, Maria LU and Mogren, Olof
- organization
-
- Mathematical Statistics
- Sentio: Integrated Sensors and Adaptive Technology for Sustainable Products and Manufacturing
- LU Profile Area: Light and Materials
- LU Profile Area: Natural and Artificial Cognition
- LTH Profile Area: Nanoscience and Semiconductor Technology
- LTH Profile Area: AI and Digitalization
- LTH Profile Area: Engineering Health
- NanoLund: Centre for Nanoscience
- ELLIIT: the Linköping-Lund initiative on IT and mobile communication
- eSSENCE: The e-Science Collaboration
- Statistical Signal Processing Group (research group)
- publishing date
- 2025
- type
- Contribution to journal
- publication status
- published
- subject
- in
- Transactions on Machine Learning Research
- volume
- 2025-September
- external identifiers
-
- scopus:105017875586
- ISSN
- 2835-8856
- language
- English
- LU publication?
- yes
- id
- 4c6e0aae-5f8b-439c-988a-fe9bfcc213f2
- alternative location
- https://openreview.net/pdf?id=tTw8wXBQ18
- date added to LUP
- 2025-12-05 12:00:48
- date last changed
- 2025-12-05 12:02:07
@article{4c6e0aae-5f8b-439c-988a-fe9bfcc213f2,
abstract = {{<p>Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as "present" if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments. Furthermore, we quantify the gap between these methods and verify that in most realistic scenarios the oracle method is better than the fixed-length labeling method in both accuracy and cost. Our findings provide a theoretical justification for adaptive weak labeling strategies that mimic the oracle process, and a foundation for optimizing weak labeling processes in sequence labeling tasks.</p>}},
author = {{Martinsson, John and Virtanen, Tuomas and Sandsten, Maria and Mogren, Olof}},
issn = {{2835-8856}},
language = {{eng}},
series = {{Transactions on Machine Learning Research}},
title = {{The Accuracy Cost of Weakness : A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time}},
url = {{https://openreview.net/pdf?id=tTw8wXBQ18}},
volume = {{2025-September}},
year = {{2025}},
}