Clustering and Anomaly Detection in Financial Trading Data

Norlander, Erik

Clustering and Anomaly Detection in Financial Trading Data

Mark

Norlander, Erik ^LU (2019) In Master's Theses in Mathematical Sciences FMSM01 20191
Mathematical Statistics

Abstract: In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CL-VAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tug-of-war effect between reconstruction loss and KL-divergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.

With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical,... (More); In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CL-VAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tug-of-war effect between reconstruction loss and KL-divergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.

With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical, and the points will be projected to some more easily understandable structure. This is a big advantage over other dimensionality reduction algorithms like PCA that only deals with continuous variables.

The method is applied to trading data from Handelsbanken Capital Markets, a Swedish investment bank. We show that it can be used in modeling the trading behavior of the traders at the bank by performing clustering and anomaly detection in the latent space. CL-VAE outperforms the regular VAE on all our metrics and seems to prepare the data for analysis in a straightforward and interpretable manner. We also discuss the issue of unsupervised anomaly detection at length and use a new form of metric for such problems called the EM-MV measure.

Finally, the result is a system that can be used in order to model trading behavior and perform clustering and anomaly detection on the transformed data. We have performed the analysis by conditioning on the traders but the model is not limited to that label. Instead, we can condition on counter parties, instruments, portfolios or any other label in the dataset. (Less)
Popular Abstract: Recently, financial crime has become a major issue for financial institutions. Whether it being money laundering, insider trading or related crimes, it strongly undermines the trust and stability of the financial system. As methods of committing crimes become more sophisticated, so are the methods for detecting them. Like many difficult problems these days, this might be approached with machine learning.

We have a dataset from Handelsbanken Capital Markets consisting of trades made by the traders at the bank. There is a very large amount of features many being categorical, meaning that no traditional technique to reduce the number of features was possible, so a new approach must be utilized.

In order to detect some strange behavior... (More); Recently, financial crime has become a major issue for financial institutions. Whether it being money laundering, insider trading or related crimes, it strongly undermines the trust and stability of the financial system. As methods of committing crimes become more sophisticated, so are the methods for detecting them. Like many difficult problems these days, this might be approached with machine learning.

We have a dataset from Handelsbanken Capital Markets consisting of trades made by the traders at the bank. There is a very large amount of features many being categorical, meaning that no traditional technique to reduce the number of features was possible, so a new approach must be utilized.

In order to detect some strange behavior we want to separate what would be considered normal from anomalous. This could be done with an algorithm called Isolation Forest. It tries to divide the dataset up into as small a chunks as possible. This process continues for each point and eventually when each trade has been divided up you can measure how many divides had to be made. If there was not many divides, the data point was easy to separate, and therefore will be considered anomalous.

An issue with this approach is that it is very dependent on the \textit{shape} of the data. That is, if the shapes are irregular and unpredictable it will have a harder time telling normal and anomalous behaviour apart. Therefore, we must prepare the data in a helpful way for this algorithm.

This is where the Variational Autoencoder (VAE) comes in. It is actually a generative model, meaning that you can estimate the distribution of some data and then generate new samples that look like they came from the original dataset. In order to do this the VAE takes some input sample, reduces it to a smaller space called the latent space and then tries to reconstruct the data from the latent space using neural networks.

This would be true for the simpler, regular autoencoder as well but the VAE has a statistical spin to it. Instead of just training the algorithm to recreate the input data, you also want the samples in the latent space to be close to some probability distribution that you choose yourself. For simplicity, you usually choose a Normal distribution. An issue with this is that there's only so much room in a normal distribution. Often the classes that do form overlap and create strange shapes that can be hard to interpret.

Instead, we propose the CL-VAE which stands for Conditional Latent Space-VAE. This approach gives each trader their own Normal distribution that can then move around freely in the latent space as the algorithm learns where it fits in relation to all other traders. With this approach we find that traders that work in similar markets end up close to each other, far away from other categories that form. The trades also end up having predictable shapes that are easier for the Isolation Forest algorithm to work with as seen in Figure 1.

By using this method a system can be developed that learns the regular behavior of a trader and as new trades come in they will be scored on their "degree-of-abnormality". (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8983838

author

Norlander, Erik ^LU

supervisor

Alexandros Sopasakis ^LU

organization

Mathematical Statistics

alternative title

Using the Conditional Latent Space Variational Autoencoder

course

FMSM01 20191

year

2019

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Variational Autoencoder, Generative Models, Latent Space, Dimensionality Reduction, Unsupervised Learning, Anomaly Detection, Clustering, Gaussian Mixture Models, Isolation Forest.

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3370-2019

ISSN

1404-6342

other publication id

2019:E35

language

English

id

8983838

date added to LUP

2019-10-08 14:14:39

date last changed

2019-10-08 14:14:39

@misc{8983838,
  abstract     = {{In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CL-VAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tug-of-war effect between reconstruction loss and KL-divergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.

With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical, and the points will be projected to some more easily understandable structure. This is a big advantage over other dimensionality reduction algorithms like PCA that only deals with continuous variables.

The method is applied to trading data from Handelsbanken Capital Markets, a Swedish investment bank. We show that it can be used in modeling the trading behavior of the traders at the bank by performing clustering and anomaly detection in the latent space. CL-VAE outperforms the regular VAE on all our metrics and seems to prepare the data for analysis in a straightforward and interpretable manner. We also discuss the issue of unsupervised anomaly detection at length and use a new form of metric for such problems called the EM-MV measure.

Finally, the result is a system that can be used in order to model trading behavior and perform clustering and anomaly detection on the transformed data. We have performed the analysis by conditioning on the traders but the model is not limited to that label. Instead, we can condition on counter parties, instruments, portfolios or any other label in the dataset.}},
  author       = {{Norlander, Erik}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Clustering and Anomaly Detection in Financial Trading Data}},
  year         = {{2019}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Clustering and Anomaly Detection in Financial Trading Data