Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Self-Supervised Learning for Tabular Data: Analysing VIME and introducing Mix Encoder

Svensson, Max LU (2024) FYSK03 20232
Department of Physics
Abstract
We introduce Mix Encoder, a novel self-supervised learning framework for deep tabular data models based on Mixup [1]. Mix Encoder uses linear interpolations of samples with associated pretext tasks to form useful pre-trained representations. We further analyze the viability of tabular self-supervised learning by introducing VIME [2], an established representation learning framework for tabular data structures, to scarce healthcare datasets. We demonstrate that Mix Encoder outperforms VIME and a normal MLP in classifying breast cancer tabular data as well as show that both self-supervised learning frameworks can grant deep tabular models increased performance. Finally, we demonstrate that the combination of both representations, VIME and... (More)
We introduce Mix Encoder, a novel self-supervised learning framework for deep tabular data models based on Mixup [1]. Mix Encoder uses linear interpolations of samples with associated pretext tasks to form useful pre-trained representations. We further analyze the viability of tabular self-supervised learning by introducing VIME [2], an established representation learning framework for tabular data structures, to scarce healthcare datasets. We demonstrate that Mix Encoder outperforms VIME and a normal MLP in classifying breast cancer tabular data as well as show that both self-supervised learning frameworks can grant deep tabular models increased performance. Finally, we demonstrate that the combination of both representations, VIME and Mix, can yield even higher performance on certain datasets, such as early classification of diabetes. (Less)
Popular Abstract
Having a meaningful conversation with a machine can be an incredible, yet slightly eerie, experience. Being able to tell it to write a short poem can make you crack a smile, but at the same time, it can make you cringe. With the recent breakthrough of self-supervision, a framework that largely removes humans as a necessary part of a machine's training, humans are now getting used to the prospect of machines aiding us in various writing tasks. But we have yet to feel the profound impact of self-supervision across other, possibly more urgent, domains.

It is easy to think that the solution to crossing intellectual domains is to just keep training our machine learning models on more and more text. After all, this is what made them... (More)
Having a meaningful conversation with a machine can be an incredible, yet slightly eerie, experience. Being able to tell it to write a short poem can make you crack a smile, but at the same time, it can make you cringe. With the recent breakthrough of self-supervision, a framework that largely removes humans as a necessary part of a machine's training, humans are now getting used to the prospect of machines aiding us in various writing tasks. But we have yet to feel the profound impact of self-supervision across other, possibly more urgent, domains.

It is easy to think that the solution to crossing intellectual domains is to just keep training our machine learning models on more and more text. After all, this is what made them successful, right? Perhaps surprisingly, this is no longer the belief among experts who now understand that text is limited, not only in content but as a medium. Painfully obvious statements, that are not obvious to machines, are rarely written down, and such a thing as common sense is often not discovered by reading, but by experiencing. Forcefully pursuing this path when we enter even more foreign concepts than creative writing might be utterly pointless. Important domains such as economy, science, and healthcare are mostly based on tabular presentations of data that have so far been inaccessible to the self-supervised approach. New studies, applying to tabular data the ideas that have made language models so successful, hint at great improvements in speed and performance over previous models, both of which are of utmost importance in healthcare. The studies take the main idea of self-supervision, namely removing the need for manually labeled data, and are instead letting the machines understand the data by augmenting what they are given. The type of augmentations is most clearly explained in the context of vision. A model can be made to rotate an image of a dog and later compare it with the original, non-rotated image. It is different, yes, but not as different as an image of a cat would have been, much like turning our heads sideways does not fundamentally change the world around us. If the model understands this, it has found some understanding of how our world functions and can, hopefully, apply it effectively. With this type of training, the size of the data at hand becomes less important, such that possibly life-saving predictions on both very small and very large data sets can gain a significant performance boost.

Following the studies, it can now be said that some sort of nuanced understanding can be found in tabular data. The deep implications this has can resonate across all large tabular-dominated domains, sparking new fascinating discoveries in important societal issues, such as cancer research. In this project, we apply this new approach to tabular cancer treatment data, confident that taking this journey further, by introducing novel augmentations such as mixing, will prove once more that we can teach machines to not only process information, but dig deeper and reveal its secrets. (Less)
Please use this url to cite or link to this publication:
author
Svensson, Max LU
supervisor
organization
course
FYSK03 20232
year
type
M2 - Bachelor Degree
subject
keywords
Machine Learning, Self-supervised learning, AI, Physics, Medicine
language
English
id
9148424
date added to LUP
2024-02-19 09:56:04
date last changed
2024-02-19 09:56:04
@misc{9148424,
  abstract     = {{We introduce Mix Encoder, a novel self-supervised learning framework for deep tabular data models based on Mixup [1]. Mix Encoder uses linear interpolations of samples with associated pretext tasks to form useful pre-trained representations. We further analyze the viability of tabular self-supervised learning by introducing VIME [2], an established representation learning framework for tabular data structures, to scarce healthcare datasets. We demonstrate that Mix Encoder outperforms VIME and a normal MLP in classifying breast cancer tabular data as well as show that both self-supervised learning frameworks can grant deep tabular models increased performance. Finally, we demonstrate that the combination of both representations, VIME and Mix, can yield even higher performance on certain datasets, such as early classification of diabetes.}},
  author       = {{Svensson, Max}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Self-Supervised Learning for Tabular Data: Analysing VIME and introducing Mix Encoder}},
  year         = {{2024}},
}