Impact of model architecture and data distribution on self-supervised federated learning

Bergdahl, Karin

Impact of model architecture and data distribution on self-supervised federated learning

Mark

Bergdahl, Karin ^LU (2022) In Master's Theses in Mathematical Sciences FMAM05 20221
Mathematics (Faculty of Engineering)

Abstract: Data is a crucial resource for machine learning. But in many settings, such as in
healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of... (More); Data is a crucial resource for machine learning. But in many settings, such as in
healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of SimCLR. Within the SimCLR framework, we test two different model types for the encoder (ResNet-18 and AlexNet). The encoders are trained in two different federated settings: i.i.d., where all clients have data from the same distribution, and non-i.i.d., where the client data distributions are completely disjoint. We also create a non-federated implementation trained on the same data, to compare the impact of federation on SimCLR. The quality of the representations is measured by the accuracy of a linear classifier trained on a small, labelled data set. We find that the best type of federated encoder has an average classifier accuracy of 67.0 % in the i.i.d. setting. This is only a small drop from the non-federated implementation, which reaches 69.0 %. However, the encoders trained in a non-i.i.d. setting have a lower average accuracy at 62.3 %. So, while a federated model has the capacity to perform on the level of a central one, a challenge in real world federated applications may be unbalanced data distributions. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9097019

author

Bergdahl, Karin ^LU

supervisor

Carina Geldhauser ^LU
Edvin Listo Zec

organization

Mathematics (Faculty of Engineering)

course

FMAM05 20221

year

2022

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

machine learning, federated learning, self-supervised learning

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3487-2022

ISSN

1404-6342

other publication id

2022:E64

language

English

id

9097019

date added to LUP

2022-08-16 13:19:33

date last changed

2022-08-16 13:19:33

@misc{9097019,
  abstract     = {{Data is a crucial resource for machine learning. But in many settings, such as in
healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of SimCLR. Within the SimCLR framework, we test two different model types for the encoder (ResNet-18 and AlexNet). The encoders are trained in two different federated settings: i.i.d., where all clients have data from the same distribution, and non-i.i.d., where the client data distributions are completely disjoint. We also create a non-federated implementation trained on the same data, to compare the impact of federation on SimCLR. The quality of the representations is measured by the accuracy of a linear classifier trained on a small, labelled data set. We find that the best type of federated encoder has an average classifier accuracy of 67.0 % in the i.i.d. setting. This is only a small drop from the non-federated implementation, which reaches 69.0 %. However, the encoders trained in a non-i.i.d. setting have a lower average accuracy at 62.3 %. So, while a federated model has the capacity to perform on the level of a central one, a challenge in real world federated applications may be unbalanced data distributions.}},
  author       = {{Bergdahl, Karin}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Impact of model architecture and data distribution on self-supervised federated learning}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Impact of model architecture and data distribution on self-supervised federated learning