Robust Statistical Jump Models with Feature Selection

Persson, Jonatan

Robust Statistical Jump Models with Feature Selection

Mark

Persson, Jonatan ^LU (2023) In Master's Theses in Mathematical Sciences MASM02 20231
Mathematical Statistics

Abstract: A large area in statistics and machine learning is cluster analysis. This field of research concerns the design of algorithms that allow computers to automatically categorize a set of observations into different groups in a reasonable way, without any prior information about which observations belongs to which group. It is a part of the larger field of unsupervised learning within machine learning.

Many of these algorithms are designed with a specific problem-task in mind. One example are so-called statistical jump models, developed by Bemporad et al. [2018] and further by Nystrup et al. [2021], that are developed to be used for (amongst other things, they are quite flexible models) the clustering of time-series data.

In this... (More); A large area in statistics and machine learning is cluster analysis. This field of research concerns the design of algorithms that allow computers to automatically categorize a set of observations into different groups in a reasonable way, without any prior information about which observations belongs to which group. It is a part of the larger field of unsupervised learning within machine learning.

Many of these algorithms are designed with a specific problem-task in mind. One example are so-called statistical jump models, developed by Bemporad et al. [2018] and further by Nystrup et al. [2021], that are developed to be used for (amongst other things, they are quite flexible models) the clustering of time-series data.

In this thesis we have made a modification of these jump models that allows us to more freely chose how to measure distance between different observations. This opens up the possibility of designing more cluster algorithms for time series data that are more resilient to data containing many outliers, or doing clustering for categorical time series data. (Less)
Popular Abstract (Swedish): Ett välstuderat ämne inom statistik och maskininlärning är så kallad klusteranalys. Forskning som drivs inom detta fält går ut på att designa modeller och algoritmer som tillåter maskiner och datorer att automatiskt dela in datamaterial i olika grupper på ett naturligt sätt utan mänsklig inblandning.

Ett vanligt exempel är om vi har tillgång till data från finansmarknaden, t.ex. daglig avkastning från olika värdepappersportföljer. Sådan data bilder i sin tur en tidsserie, där alltså avkastningarna fluktuerar över tid. Vi skulle kunna tänka oss att marknaden befinner sig i olika tillstånd över tid, t.ex. en period där marknaden är i ett volatilt tillstånd med mycket risk eller ett mer lugnt tillstånd där risken på marknaden är lägre.... (More); Ett välstuderat ämne inom statistik och maskininlärning är så kallad klusteranalys. Forskning som drivs inom detta fält går ut på att designa modeller och algoritmer som tillåter maskiner och datorer att automatiskt dela in datamaterial i olika grupper på ett naturligt sätt utan mänsklig inblandning.

Ett vanligt exempel är om vi har tillgång till data från finansmarknaden, t.ex. daglig avkastning från olika värdepappersportföljer. Sådan data bilder i sin tur en tidsserie, där alltså avkastningarna fluktuerar över tid. Vi skulle kunna tänka oss att marknaden befinner sig i olika tillstånd över tid, t.ex. en period där marknaden är i ett volatilt tillstånd med mycket risk eller ett mer lugnt tillstånd där risken på marknaden är lägre. Ett klustringsproblem skulle då vara att dela in den här tidserien i olika marknadstillstånd.

Många av de mest använda klustringsalgoritmerna klarar dock inte av den här uppgiften särskilt bra, eftersom de inte är konstruerade för just tidsserier. Det finns dock vissa algoritmer som har visat sig fungerar väldigt bra för klustring av tidsserier, t.ex. statistiska hoppmodeller ("statistical jump models" på engelska) som utvecklats av Bemporad m. fl. från 2018 och sedan vidare utvecklats av Nystrup m. fl. från 2021.

Problemet med de statistiska hoppmodellerna som utvecklats av just Nystrup m. fl. är att de är utvecklade för ett visst sätt att mäta avstånd mellan observationer. Det är dock inte alltid detta avståndsmått är lämpligt. T.ex. kanske vi har data som inte består av nummer och siffror, utan av olike kategorier. I dessa fall fallerar den tidigare nämnda hoppmodellen. I detta masterprojekt har vi ägnat oss åt att lösa just detta problem och vi har utvecklat en metod som låter oss välja avståndsmått mer fritt. Detta möjliggjör för en större mängd tillämpningar än tidigare. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9125947

author

Persson, Jonatan ^LU

supervisor

Erik Lindström ^LU

organization

Mathematical Statistics

course

MASM02 20231

year

2023

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Clustering, Jump, Feature selection, Robust

publication/series

Master's Theses in Mathematical Sciences

report number

LUNFMS-3121-2023

ISSN

1404-6342

other publication id

2023:E54

language

English

id

9125947

date added to LUP

2023-06-19 11:12:20

date last changed

2023-06-29 12:53:37

@misc{9125947,
  abstract     = {{A large area in statistics and machine learning is cluster analysis. This field of research concerns the design of algorithms that allow computers to automatically categorize a set of observations into different groups in a reasonable way, without any prior information about which observations belongs to which group. It is a part of the larger field of unsupervised learning within machine learning. 

Many of these algorithms are designed with a specific problem-task in mind. One example are so-called statistical jump models, developed by Bemporad et al. [2018] and further by Nystrup et al. [2021], that are developed to be used for (amongst other things, they are quite flexible models) the clustering of time-series data. 

In this thesis we have made a modification of these jump models that allows us to more freely chose how to measure distance between different observations. This opens up the possibility of designing more cluster algorithms for time series data that are more resilient to data containing many outliers, or doing clustering for categorical time series data.}},
  author       = {{Persson, Jonatan}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Robust Statistical Jump Models with Feature Selection}},
  year         = {{2023}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Robust Statistical Jump Models with Feature Selection