Supply chain attacks in open source projects

Stussi, Oliver; Uhler Brand, David

Supply chain attacks in open source projects

Mark

Stussi, Oliver ^LU and Uhler Brand, David ^LU (2022) EITM01 20222
Department of Electrical and Information Technology

Abstract: The space of open source supply chain attacks is ever evolving and growing. There
is extensive previous work identifying and collecting open source supply chain
attacks, as well as identifying patterns in these attacks and proving that machine
learning models may be able to detect these patterns.

The aim of this thesis is to develop such a system and study its efficacy in
detecting attacks. To achieve this, packages from the npm Registry, PyPi, and
RubyGems originating from three previous data sets were combined into one data
set and manually labeled. UniXcoder was used to generate embeddings of the
source code, these were then fed to the Markov Clustering Algorithm to create
clusters of attacks. Unknown files were compared... (More); The space of open source supply chain attacks is ever evolving and growing. There
is extensive previous work identifying and collecting open source supply chain
attacks, as well as identifying patterns in these attacks and proving that machine
learning models may be able to detect these patterns.

The aim of this thesis is to develop such a system and study its efficacy in
detecting attacks. To achieve this, packages from the npm Registry, PyPi, and
RubyGems originating from three previous data sets were combined into one data
set and manually labeled. UniXcoder was used to generate embeddings of the
source code, these were then fed to the Markov Clustering Algorithm to create
clusters of attacks. Unknown files were compared against representative embeddings
of these clusters to classify them as either malicious or benign. Two different
methods for cluster generation and three different cluster optimization metrics
were explored. The best performing approach achieved a F1 score of 0.85,
outperforming a similar approach within the field. This approach seems to have no
major differences in performance between obfuscated or un-obfuscated attacks.
Neither did the programming language of attacks seem to impact performance
significantly. (Less)
Popular Abstract: The growth in popularity of open-source software has not only drawn the attention
of benign actors. Rather it has given rise to a new genre of attack, the open source
supply chain attack. The basis of the attack is that instead of attacking a target
you attack one of the open source projects they rely on.

While this might seem counter productive as one assumes that the projects
one relies on are properly vetted, the reality is that certain large projects still rely
on niche projects. These niche projects are then much easier attack targets.
This field is receiving more and more attention as these attacks become more
common. Studies have already shown the theoretical possibility of machine
learning algorithms categorizing and... (More); The growth in popularity of open-source software has not only drawn the attention
of benign actors. Rather it has given rise to a new genre of attack, the open source
supply chain attack. The basis of the attack is that instead of attacking a target
you attack one of the open source projects they rely on.

While this might seem counter productive as one assumes that the projects
one relies on are properly vetted, the reality is that certain large projects still rely
on niche projects. These niche projects are then much easier attack targets.
This field is receiving more and more attention as these attacks become more
common. Studies have already shown the theoretical possibility of machine
learning algorithms categorizing and detecting these kinds of attacks.

This thesis aims to quantify the efficacy of such approaches and study what
approaches are more effective, and what attacks are easier to detect. To achieve
this, we leverage state-of-the-art machine learning algorithms to first convert the
source code to an easier-compared data structure called a tensor.
The similarity between tensors is determined and then based on these similarities,
relationships are detected by clustering those that are more similar together.

New attacks are then matched against these clusters and based on how similar a
new attack is to a cluster it is deemed malicious or not. By this rather simple
approach, we manage to achieve a F1-score of roughly 0.85 which is better than
other approaches within the same field of study. F1-score is a scale from 0 to 1
where 1 is a perfect model that always categorizes correctly.

Overall, the system labels a malicious file correctly 79% of the time and non-
malicious files 93% of the time. Of the attacks types considered, exfiltration
(extracting users data) was the best performing, with 86% correctly identified. This
is not surprising as it also is the most common form of attack. The two worst
performing types were financial gain and dropper with 50% and 62.5% correct,
respectively. However there were only two financial gain samples in the test set
and seven in total in our data set, we cannot be certain of these results. As for
obfuscation, the results showed that in general, the more obfuscation was used the
easier it was to detect. Which might seem paradoxical, as attempting to hide code
somehow makes it easier to detect. This could be attributed to non-malicious code
rarely, if ever, being obfuscated and thus obfuscated code shares few similarities
with ordinary code making it stand out more. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9105730

author

Stussi, Oliver ^LU and Uhler Brand, David ^LU

supervisor

Christian Gehrmann ^LU
Emil Wåreus ^LU

organization

Department of Electrical and Information Technology

course

EITM01 20222

year

2022

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

report number

LU/LTH-EIT 2022-904

language

English

id

9105730

date added to LUP

2023-03-02 11:41:27

date last changed

2023-03-02 11:41:27

@misc{9105730,
  abstract     = {{The space of open source supply chain attacks is ever evolving and growing. There
is extensive previous work identifying and collecting open source supply chain
attacks, as well as identifying patterns in these attacks and proving that machine
learning models may be able to detect these patterns.

The aim of this thesis is to develop such a system and study its efficacy in
detecting attacks. To achieve this, packages from the npm Registry, PyPi, and
RubyGems originating from three previous data sets were combined into one data
set and manually labeled. UniXcoder was used to generate embeddings of the
source code, these were then fed to the Markov Clustering Algorithm to create
clusters of attacks. Unknown files were compared against representative embeddings
of these clusters to classify them as either malicious or benign. Two different
methods for cluster generation and three different cluster optimization metrics
were explored. The best performing approach achieved a F1 score of 0.85,
outperforming a similar approach within the field. This approach seems to have no
major differences in performance between obfuscated or un-obfuscated attacks.
Neither did the programming language of attacks seem to impact performance
significantly.}},
  author       = {{Stussi, Oliver and Uhler Brand, David}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Supply chain attacks in open source projects}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Supply chain attacks in open source projects