Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Impact of model architecture and data distribution on self-supervised federated learning

Bergdahl, Karin LU (2022) In Master's Theses in Mathematical Sciences FMAM05 20221
Mathematics (Faculty of Engineering)
Abstract
Data is a crucial resource for machine learning. But in many settings, such as in
healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of... (More)
Data is a crucial resource for machine learning. But in many settings, such as in
healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of SimCLR. Within the SimCLR framework, we test two different model types for the encoder (ResNet-18 and AlexNet). The encoders are trained in two different federated settings: i.i.d., where all clients have data from the same distribution, and non-i.i.d., where the client data distributions are completely disjoint. We also create a non-federated implementation trained on the same data, to compare the impact of federation on SimCLR. The quality of the representations is measured by the accuracy of a linear classifier trained on a small, labelled data set. We find that the best type of federated encoder has an average classifier accuracy of 67.0 % in the i.i.d. setting. This is only a small drop from the non-federated implementation, which reaches 69.0 %. However, the encoders trained in a non-i.i.d. setting have a lower average accuracy at 62.3 %. So, while a federated model has the capacity to perform on the level of a central one, a challenge in real world federated applications may be unbalanced data distributions. (Less)
Please use this url to cite or link to this publication:
author
Bergdahl, Karin LU
supervisor
organization
course
FMAM05 20221
year
type
H2 - Master's Degree (Two Years)
subject
keywords
machine learning, federated learning, self-supervised learning
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMA-3487-2022
ISSN
1404-6342
other publication id
2022:E64
language
English
id
9097019
date added to LUP
2022-08-16 13:19:33
date last changed
2022-08-16 13:19:33
@misc{9097019,
  abstract     = {{Data is a crucial resource for machine learning. But in many settings, such as in
healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of SimCLR. Within the SimCLR framework, we test two different model types for the encoder (ResNet-18 and AlexNet). The encoders are trained in two different federated settings: i.i.d., where all clients have data from the same distribution, and non-i.i.d., where the client data distributions are completely disjoint. We also create a non-federated implementation trained on the same data, to compare the impact of federation on SimCLR. The quality of the representations is measured by the accuracy of a linear classifier trained on a small, labelled data set. We find that the best type of federated encoder has an average classifier accuracy of 67.0 % in the i.i.d. setting. This is only a small drop from the non-federated implementation, which reaches 69.0 %. However, the encoders trained in a non-i.i.d. setting have a lower average accuracy at 62.3 %. So, while a federated model has the capacity to perform on the level of a central one, a challenge in real world federated applications may be unbalanced data distributions.}},
  author       = {{Bergdahl, Karin}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Impact of model architecture and data distribution on self-supervised federated learning}},
  year         = {{2022}},
}