HMMProfiler: A Set of Pipelines for Identifying and Quantifying Functional Regions in Antibody Repertoires from Next-Generation Sequencing
(2025) BINP52 20242Degree Projects in Bioinformatics
- Abstract
- Next-generation sequencing (NGS) has transformed the study of antibody repertoires by enabling high-throughput, in-depth analysis of immune diversity and antigen-driven selection. When paired with in vitro display technologies, NGS offers unparalleled resolution in characterizing antibody libraries, but also presents substantial computational challenges due to the highly variable nature of antibody sequences. To address these challenges, we developed HMMProfiler, a pair of modular, JavaScript-based pipelines for NGS analysis that significantly improve on the legacy AbGenesis workflow for antibody sequence annotation in speed, usability, and integration with modern application platforms. Developed to operate in a Node.js environment, these... (More)
- Next-generation sequencing (NGS) has transformed the study of antibody repertoires by enabling high-throughput, in-depth analysis of immune diversity and antigen-driven selection. When paired with in vitro display technologies, NGS offers unparalleled resolution in characterizing antibody libraries, but also presents substantial computational challenges due to the highly variable nature of antibody sequences. To address these challenges, we developed HMMProfiler, a pair of modular, JavaScript-based pipelines for NGS analysis that significantly improve on the legacy AbGenesis workflow for antibody sequence annotation in speed, usability, and integration with modern application platforms. Developed to operate in a Node.js environment, these pipelines leverage command-line tools such as Seqkit and HMMER to perform quality filtering, six-frame translation, domain detection via profile Hidden Markov Models (HMMs), and robust domain-wise profile counting. All parameters are user-configurable via structured JSON files, ensuring reproducibility and ease of customization. Benchmarking with FASTQ datasets demonstrated that Pipeline 1 processes up to 5 million reads with linear runtime scaling and generates accurate, high-resolution domain profiles. Pipeline 2 enables rapid matching of new sequences to precomputed profiles using domain-specific Levenshtein distances, with clear discrimination between closely related and divergent sequences. Compared to AbGenesis, the new system offers faster runtimes and improved domain annotation sensitivity. Its architecture also supports seamless integration with Bionamic’s front-end tools, providing interactive outputs and real-time user feedback. While limitations include the computational cost of Levenshtein comparisons and lack of structural motif recognition, the pipelines are structured for future expansion, including GPU acceleration, motif search capabilities, and machine learning integration. Together, these tools represent a scalable, extensible foundation for high-throughput antibody repertoire analysis and broader applications in synthetic biology and immunodiagnostics. (Less)
- Popular Abstract
- Streamlining Antibody Analysis: A Fast and Flexible Pipeline for Next-Generation Sequencing Data
In the world of modern medicine and biotechnology, understanding the diversity and function of antibodies is critical for developing new therapies and diagnostics. However, analyzing the massive amounts of genetic data generated by next-generation sequencing (NGS) can be slow, complicated, and sometimes inaccurate. To tackle this challenge, we developed a powerful new set of computational pipelines that speeds up and improves the process of identifying and comparing antibody sequences, making it easier for researchers to uncover meaningful biological insights.
Our pipelines stand out by combining the strengths of tried-and-true... (More) - Streamlining Antibody Analysis: A Fast and Flexible Pipeline for Next-Generation Sequencing Data
In the world of modern medicine and biotechnology, understanding the diversity and function of antibodies is critical for developing new therapies and diagnostics. However, analyzing the massive amounts of genetic data generated by next-generation sequencing (NGS) can be slow, complicated, and sometimes inaccurate. To tackle this challenge, we developed a powerful new set of computational pipelines that speeds up and improves the process of identifying and comparing antibody sequences, making it easier for researchers to uncover meaningful biological insights.
Our pipelines stand out by combining the strengths of tried-and-true bioinformatics tools with the flexibility and speed of modern programming. Using a modular approach, the system breaks down huge sequencing datasets into manageable chunks and filters them based on quality. It then translates DNA sequences into proteins, searches for important functional domains using customizable Hidden Markov Models (HMMs), and extracts precise protein fragments for detailed analysis. Unlike previous tools, our pipeline allows users to supply their own models, making it adaptable to a wide variety of antibodies and related protein families. The results are clear, organized files that highlight unique protein domain patterns and their frequency across millions of sequences.
One of the key advantages of our approach is its seamless integration with the existing Bionamic platform, providing real-time responsiveness and interactive outputs through a user-friendly interface. This makes it not only faster but also easier to use, allowing researchers to quickly identify patterns of clonal expansion and somatic mutations that are crucial for understanding immune responses and selecting therapeutic candidates. The pipeline’s ability to compare new sequences against a growing database of known profiles further enriches biological interpretation, helping to distinguish meaningful binders from irrelevant or “sticky” sequences.
While our pipelines already outperform legacy tools like AbGenesis in speed and accuracy, there’s still room to grow. Future improvements could bring in machine learning to predict antibody properties directly from sequences, or add structural motif recognition to reveal how protein shapes influence function. By expanding beyond antibodies to other immune receptors or synthetic protein libraries, this tool promises to become a versatile resource across immunology, virology, and synthetic biology. Ultimately, these new pipelines bridge the gap between raw genetic data and actionable biological insights, accelerating antibody discovery and opening new avenues for research and therapy development.
Master’s Degree Project in Bioinformatics, 60 credits, 2025
Department of Biology, Lund University
Advisor: Anders Carlsson, Bionamic AB (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9212574
- author
- Burchak, Alexa
- supervisor
- organization
- course
- BINP52 20242
- year
- 2025
- type
- H2 - Master's Degree (Two Years)
- subject
- language
- English
- id
- 9212574
- date added to LUP
- 2025-09-17 11:55:06
- date last changed
- 2025-09-17 11:55:06
@misc{9212574, abstract = {{Next-generation sequencing (NGS) has transformed the study of antibody repertoires by enabling high-throughput, in-depth analysis of immune diversity and antigen-driven selection. When paired with in vitro display technologies, NGS offers unparalleled resolution in characterizing antibody libraries, but also presents substantial computational challenges due to the highly variable nature of antibody sequences. To address these challenges, we developed HMMProfiler, a pair of modular, JavaScript-based pipelines for NGS analysis that significantly improve on the legacy AbGenesis workflow for antibody sequence annotation in speed, usability, and integration with modern application platforms. Developed to operate in a Node.js environment, these pipelines leverage command-line tools such as Seqkit and HMMER to perform quality filtering, six-frame translation, domain detection via profile Hidden Markov Models (HMMs), and robust domain-wise profile counting. All parameters are user-configurable via structured JSON files, ensuring reproducibility and ease of customization. Benchmarking with FASTQ datasets demonstrated that Pipeline 1 processes up to 5 million reads with linear runtime scaling and generates accurate, high-resolution domain profiles. Pipeline 2 enables rapid matching of new sequences to precomputed profiles using domain-specific Levenshtein distances, with clear discrimination between closely related and divergent sequences. Compared to AbGenesis, the new system offers faster runtimes and improved domain annotation sensitivity. Its architecture also supports seamless integration with Bionamic’s front-end tools, providing interactive outputs and real-time user feedback. While limitations include the computational cost of Levenshtein comparisons and lack of structural motif recognition, the pipelines are structured for future expansion, including GPU acceleration, motif search capabilities, and machine learning integration. Together, these tools represent a scalable, extensible foundation for high-throughput antibody repertoire analysis and broader applications in synthetic biology and immunodiagnostics.}}, author = {{Burchak, Alexa}}, language = {{eng}}, note = {{Student Paper}}, title = {{HMMProfiler: A Set of Pipelines for Identifying and Quantifying Functional Regions in Antibody Repertoires from Next-Generation Sequencing}}, year = {{2025}}, }