Predicting Antibody Developability: Machine Learning Meets Therapeutic Antibodies
(2025) In Master’s Theses in Mathematical Sciences FMAM05 20251Mathematics (Faculty of Engineering)
- Abstract
- Antibody developability refers to an antibody’s suitability for clinical use, including properties such as solubility, stability, and aggregation. These traits are traditionally assessed through experimental screening, which is time-consuming and resource heavy. Machine learning offers a promising alternative for early prediction of developability, though many existing models are still in early stages.
This work compares multiple machine learning strategies for predicting protein solubility, a key developability factor. Five datasets were used: four consisting of non-antibody protein sequences expressed in E. Coli with solubility labels, and one independent antibody dataset without labels. Three existing models—NetSolP, SWI, and... (More) - Antibody developability refers to an antibody’s suitability for clinical use, including properties such as solubility, stability, and aggregation. These traits are traditionally assessed through experimental screening, which is time-consuming and resource heavy. Machine learning offers a promising alternative for early prediction of developability, though many existing models are still in early stages.
This work compares multiple machine learning strategies for predicting protein solubility, a key developability factor. Five datasets were used: four consisting of non-antibody protein sequences expressed in E. Coli with solubility labels, and one independent antibody dataset without labels. Three existing models—NetSolP, SWI, and ProteinSol—were evaluated using standard performance metrics, and new models were developed by leveraging feature extraction from SWI and ProteinSol to explore potential improvements.
Developed approaches included logistic regression for direct solubility prediction, models that first classified a sample’s likely dataset of origin before applying a corresponding solubility model, clustering-based methods with cluster-specific classifiers, and multi-layer perceptrons to test the benefits of deeper architectures.
Overall, the models achieved similar performance, with no single approach consistently outperforming others. Simpler models like logistic regression often performed on par with more complex models such as multi-layer perceptrons. Results varied by dataset, with the lowest performance observed on the largest and most diverse dataset, PDBSol, suggesting that high variability in sequence data may reduce prediction reliability. (Less) - Popular Abstract
- What if we could fast-track the development of life-saving medicines, cutting down the time and cost required to bring them to patients? Antibodies are proteins found naturally in the body that help fight off disease and scientists have learned how to turn them into powerful medicines. These therapeutic antibodies have been used to treat cancer, autoimmune conditions, and even COVID-19. However, before any antibody can be developed into a medicine, it has to pass a series of tests to make sure it dissolves well, stays stable, and doesn't clump or break down. These tests are expensive and time-consuming.
In this thesis, we explored whether artificial intelligence (AI) could help predict a key trait of antibodies: solubility, which... (More) - What if we could fast-track the development of life-saving medicines, cutting down the time and cost required to bring them to patients? Antibodies are proteins found naturally in the body that help fight off disease and scientists have learned how to turn them into powerful medicines. These therapeutic antibodies have been used to treat cancer, autoimmune conditions, and even COVID-19. However, before any antibody can be developed into a medicine, it has to pass a series of tests to make sure it dissolves well, stays stable, and doesn't clump or break down. These tests are expensive and time-consuming.
In this thesis, we explored whether artificial intelligence (AI) could help predict a key trait of antibodies: solubility, which affects how suitable an antibody is to be used as a drug. We tested existing tools and built new models using five datasets, trying both simple methods like logistic regression and more advanced ones like neural networks.
The surprising result? Simple models performed similarly to the complex ones. No single method stood out across all datasets, and the most challenging results came from the largest and most diverse dataset. This shows that the type and quality of data have a big impact on how well AI models perform.
While the models aren't perfect yet, this work highlights how AI could help scientists sort through large numbers of antibody candidates much faster, which can save time, money, and potentially speed up the development of new treatments. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9195501
- author
- Höjding, Josephine LU and Björkhem, William LU
- supervisor
- organization
- course
- FMAM05 20251
- year
- 2025
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- machine learning, deep learning, antibody, antibody developability
- publication/series
- Master’s Theses in Mathematical Sciences
- report number
- LUTFMA-3582-2025
- ISSN
- 1404-6342
- other publication id
- 2025:E33
- language
- English
- id
- 9195501
- date added to LUP
- 2025-06-19 09:54:28
- date last changed
- 2025-06-19 09:54:28
@misc{9195501, abstract = {{Antibody developability refers to an antibody’s suitability for clinical use, including properties such as solubility, stability, and aggregation. These traits are traditionally assessed through experimental screening, which is time-consuming and resource heavy. Machine learning offers a promising alternative for early prediction of developability, though many existing models are still in early stages. This work compares multiple machine learning strategies for predicting protein solubility, a key developability factor. Five datasets were used: four consisting of non-antibody protein sequences expressed in E. Coli with solubility labels, and one independent antibody dataset without labels. Three existing models—NetSolP, SWI, and ProteinSol—were evaluated using standard performance metrics, and new models were developed by leveraging feature extraction from SWI and ProteinSol to explore potential improvements. Developed approaches included logistic regression for direct solubility prediction, models that first classified a sample’s likely dataset of origin before applying a corresponding solubility model, clustering-based methods with cluster-specific classifiers, and multi-layer perceptrons to test the benefits of deeper architectures. Overall, the models achieved similar performance, with no single approach consistently outperforming others. Simpler models like logistic regression often performed on par with more complex models such as multi-layer perceptrons. Results varied by dataset, with the lowest performance observed on the largest and most diverse dataset, PDBSol, suggesting that high variability in sequence data may reduce prediction reliability.}}, author = {{Höjding, Josephine and Björkhem, William}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master’s Theses in Mathematical Sciences}}, title = {{Predicting Antibody Developability: Machine Learning Meets Therapeutic Antibodies}}, year = {{2025}}, }