Advanced

Clustering and Anomaly Detection in Financial Trading Data

Norlander, Erik LU (2019) In Master's Theses in Mathematical Sciences FMSM01 20191
Mathematical Statistics
Abstract
In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CL-VAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tug-of-war effect between reconstruction loss and KL-divergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.

With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical,... (More)
In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CL-VAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tug-of-war effect between reconstruction loss and KL-divergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.

With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical, and the points will be projected to some more easily understandable structure. This is a big advantage over other dimensionality reduction algorithms like PCA that only deals with continuous variables.

The method is applied to trading data from Handelsbanken Capital Markets, a Swedish investment bank. We show that it can be used in modeling the trading behavior of the traders at the bank by performing clustering and anomaly detection in the latent space. CL-VAE outperforms the regular VAE on all our metrics and seems to prepare the data for analysis in a straightforward and interpretable manner. We also discuss the issue of unsupervised anomaly detection at length and use a new form of metric for such problems called the EM-MV measure.

Finally, the result is a system that can be used in order to model trading behavior and perform clustering and anomaly detection on the transformed data. We have performed the analysis by conditioning on the traders but the model is not limited to that label. Instead, we can condition on counter parties, instruments, portfolios or any other label in the dataset. (Less)
Popular Abstract
Recently, financial crime has become a major issue for financial institutions. Whether it being money laundering, insider trading or related crimes, it strongly undermines the trust and stability of the financial system. As methods of committing crimes become more sophisticated, so are the methods for detecting them. Like many difficult problems these days, this might be approached with machine learning.

We have a dataset from Handelsbanken Capital Markets consisting of trades made by the traders at the bank. There is a very large amount of features many being categorical, meaning that no traditional technique to reduce the number of features was possible, so a new approach must be utilized.

In order to detect some strange behavior... (More)
Recently, financial crime has become a major issue for financial institutions. Whether it being money laundering, insider trading or related crimes, it strongly undermines the trust and stability of the financial system. As methods of committing crimes become more sophisticated, so are the methods for detecting them. Like many difficult problems these days, this might be approached with machine learning.

We have a dataset from Handelsbanken Capital Markets consisting of trades made by the traders at the bank. There is a very large amount of features many being categorical, meaning that no traditional technique to reduce the number of features was possible, so a new approach must be utilized.

In order to detect some strange behavior we want to separate what would be considered normal from anomalous. This could be done with an algorithm called Isolation Forest. It tries to divide the dataset up into as small a chunks as possible. This process continues for each point and eventually when each trade has been divided up you can measure how many divides had to be made. If there was not many divides, the data point was easy to separate, and therefore will be considered anomalous.

An issue with this approach is that it is very dependent on the \textit{shape} of the data. That is, if the shapes are irregular and unpredictable it will have a harder time telling normal and anomalous behaviour apart. Therefore, we must prepare the data in a helpful way for this algorithm.

This is where the Variational Autoencoder (VAE) comes in. It is actually a generative model, meaning that you can estimate the distribution of some data and then generate new samples that look like they came from the original dataset. In order to do this the VAE takes some input sample, reduces it to a smaller space called the latent space and then tries to reconstruct the data from the latent space using neural networks.

This would be true for the simpler, regular autoencoder as well but the VAE has a statistical spin to it. Instead of just training the algorithm to recreate the input data, you also want the samples in the latent space to be close to some probability distribution that you choose yourself. For simplicity, you usually choose a Normal distribution. An issue with this is that there's only so much room in a normal distribution. Often the classes that do form overlap and create strange shapes that can be hard to interpret.

Instead, we propose the CL-VAE which stands for Conditional Latent Space-VAE. This approach gives each trader their own Normal distribution that can then move around freely in the latent space as the algorithm learns where it fits in relation to all other traders. With this approach we find that traders that work in similar markets end up close to each other, far away from other categories that form. The trades also end up having predictable shapes that are easier for the Isolation Forest algorithm to work with as seen in Figure 1.

By using this method a system can be developed that learns the regular behavior of a trader and as new trades come in they will be scored on their "degree-of-abnormality". (Less)
Please use this url to cite or link to this publication:
author
Norlander, Erik LU
supervisor
organization
alternative title
Using the Conditional Latent Space Variational Autoencoder
course
FMSM01 20191
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Variational Autoencoder, Generative Models, Latent Space, Dimensionality Reduction, Unsupervised Learning, Anomaly Detection, Clustering, Gaussian Mixture Models, Isolation Forest.
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMA-3370-2019
ISSN
1404-6342
other publication id
2019:E35
language
English
id
8983838
date added to LUP
2019-10-08 14:14:39
date last changed
2019-10-08 14:14:39
@misc{8983838,
  abstract     = {In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CL-VAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tug-of-war effect between reconstruction loss and KL-divergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.

With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical, and the points will be projected to some more easily understandable structure. This is a big advantage over other dimensionality reduction algorithms like PCA that only deals with continuous variables.

The method is applied to trading data from Handelsbanken Capital Markets, a Swedish investment bank. We show that it can be used in modeling the trading behavior of the traders at the bank by performing clustering and anomaly detection in the latent space. CL-VAE outperforms the regular VAE on all our metrics and seems to prepare the data for analysis in a straightforward and interpretable manner. We also discuss the issue of unsupervised anomaly detection at length and use a new form of metric for such problems called the EM-MV measure.

Finally, the result is a system that can be used in order to model trading behavior and perform clustering and anomaly detection on the transformed data. We have performed the analysis by conditioning on the traders but the model is not limited to that label. Instead, we can condition on counter parties, instruments, portfolios or any other label in the dataset.},
  author       = {Norlander, Erik},
  issn         = {1404-6342},
  keyword      = {Variational Autoencoder,Generative Models,Latent Space,Dimensionality Reduction,Unsupervised Learning,Anomaly Detection,Clustering,Gaussian Mixture Models,Isolation Forest.},
  language     = {eng},
  note         = {Student Paper},
  series       = {Master's Theses in Mathematical Sciences},
  title        = {Clustering and Anomaly Detection in Financial Trading Data},
  year         = {2019},
}