Clustering and Anomaly Detection in Financial Trading Data
(2019) In Master's Theses in Mathematical Sciences FMSM01 20191Mathematical Statistics
 Abstract
 In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CLVAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tugofwar effect between reconstruction loss and KLdivergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.
With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical,... (More)  In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CLVAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tugofwar effect between reconstruction loss and KLdivergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own.
With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical, and the points will be projected to some more easily understandable structure. This is a big advantage over other dimensionality reduction algorithms like PCA that only deals with continuous variables.
The method is applied to trading data from Handelsbanken Capital Markets, a Swedish investment bank. We show that it can be used in modeling the trading behavior of the traders at the bank by performing clustering and anomaly detection in the latent space. CLVAE outperforms the regular VAE on all our metrics and seems to prepare the data for analysis in a straightforward and interpretable manner. We also discuss the issue of unsupervised anomaly detection at length and use a new form of metric for such problems called the EMMV measure.
Finally, the result is a system that can be used in order to model trading behavior and perform clustering and anomaly detection on the transformed data. We have performed the analysis by conditioning on the traders but the model is not limited to that label. Instead, we can condition on counter parties, instruments, portfolios or any other label in the dataset. (Less)  Popular Abstract
 Recently, financial crime has become a major issue for financial institutions. Whether it being money laundering, insider trading or related crimes, it strongly undermines the trust and stability of the financial system. As methods of committing crimes become more sophisticated, so are the methods for detecting them. Like many difficult problems these days, this might be approached with machine learning.
We have a dataset from Handelsbanken Capital Markets consisting of trades made by the traders at the bank. There is a very large amount of features many being categorical, meaning that no traditional technique to reduce the number of features was possible, so a new approach must be utilized.
In order to detect some strange behavior... (More)  Recently, financial crime has become a major issue for financial institutions. Whether it being money laundering, insider trading or related crimes, it strongly undermines the trust and stability of the financial system. As methods of committing crimes become more sophisticated, so are the methods for detecting them. Like many difficult problems these days, this might be approached with machine learning.
We have a dataset from Handelsbanken Capital Markets consisting of trades made by the traders at the bank. There is a very large amount of features many being categorical, meaning that no traditional technique to reduce the number of features was possible, so a new approach must be utilized.
In order to detect some strange behavior we want to separate what would be considered normal from anomalous. This could be done with an algorithm called Isolation Forest. It tries to divide the dataset up into as small a chunks as possible. This process continues for each point and eventually when each trade has been divided up you can measure how many divides had to be made. If there was not many divides, the data point was easy to separate, and therefore will be considered anomalous.
An issue with this approach is that it is very dependent on the \textit{shape} of the data. That is, if the shapes are irregular and unpredictable it will have a harder time telling normal and anomalous behaviour apart. Therefore, we must prepare the data in a helpful way for this algorithm.
This is where the Variational Autoencoder (VAE) comes in. It is actually a generative model, meaning that you can estimate the distribution of some data and then generate new samples that look like they came from the original dataset. In order to do this the VAE takes some input sample, reduces it to a smaller space called the latent space and then tries to reconstruct the data from the latent space using neural networks.
This would be true for the simpler, regular autoencoder as well but the VAE has a statistical spin to it. Instead of just training the algorithm to recreate the input data, you also want the samples in the latent space to be close to some probability distribution that you choose yourself. For simplicity, you usually choose a Normal distribution. An issue with this is that there's only so much room in a normal distribution. Often the classes that do form overlap and create strange shapes that can be hard to interpret.
Instead, we propose the CLVAE which stands for Conditional Latent SpaceVAE. This approach gives each trader their own Normal distribution that can then move around freely in the latent space as the algorithm learns where it fits in relation to all other traders. With this approach we find that traders that work in similar markets end up close to each other, far away from other categories that form. The trades also end up having predictable shapes that are easier for the Isolation Forest algorithm to work with as seen in Figure 1.
By using this method a system can be developed that learns the regular behavior of a trader and as new trades come in they will be scored on their "degreeofabnormality". (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/studentpapers/record/8983838
 author
 Norlander, Erik ^{LU}
 supervisor

 Alexandros Sopasakis ^{LU}
 organization
 alternative title
 Using the Conditional Latent Space Variational Autoencoder
 course
 FMSM01 20191
 year
 2019
 type
 H2  Master's Degree (Two Years)
 subject
 keywords
 Variational Autoencoder, Generative Models, Latent Space, Dimensionality Reduction, Unsupervised Learning, Anomaly Detection, Clustering, Gaussian Mixture Models, Isolation Forest.
 publication/series
 Master's Theses in Mathematical Sciences
 report number
 LUTFMA33702019
 ISSN
 14046342
 other publication id
 2019:E35
 language
 English
 id
 8983838
 date added to LUP
 20191008 14:14:39
 date last changed
 20191008 14:14:39
@misc{8983838, abstract = {In this thesis we propose a new form of Variational Autoencoder called the Conditional Latent Space Variational Autoencoder or CLVAE. By conditioning on a known label in a dataset we can decide what points are being mapped to what prior distribution. This makes the latent space more understandable and separates the classes further. It also subverts the tugofwar effect between reconstruction loss and KLdivergence somewhat. This is because we're not trying to map all the data to one simple prior distribution, but rather giving every class its own. With this method, we can customize the latent space for a specific task like clustering or anomaly detection. This means that we can send in any kind of data, be it numerical or categorical, and the points will be projected to some more easily understandable structure. This is a big advantage over other dimensionality reduction algorithms like PCA that only deals with continuous variables. The method is applied to trading data from Handelsbanken Capital Markets, a Swedish investment bank. We show that it can be used in modeling the trading behavior of the traders at the bank by performing clustering and anomaly detection in the latent space. CLVAE outperforms the regular VAE on all our metrics and seems to prepare the data for analysis in a straightforward and interpretable manner. We also discuss the issue of unsupervised anomaly detection at length and use a new form of metric for such problems called the EMMV measure. Finally, the result is a system that can be used in order to model trading behavior and perform clustering and anomaly detection on the transformed data. We have performed the analysis by conditioning on the traders but the model is not limited to that label. Instead, we can condition on counter parties, instruments, portfolios or any other label in the dataset.}, author = {Norlander, Erik}, issn = {14046342}, keyword = {Variational Autoencoder,Generative Models,Latent Space,Dimensionality Reduction,Unsupervised Learning,Anomaly Detection,Clustering,Gaussian Mixture Models,Isolation Forest.}, language = {eng}, note = {Student Paper}, series = {Master's Theses in Mathematical Sciences}, title = {Clustering and Anomaly Detection in Financial Trading Data}, year = {2019}, }