HMMProfiler: A Set of Pipelines for Identifying and Quantifying Functional Regions in Antibody Repertoires from Next-Generation Sequencing

Burchak, Alexa

HMMProfiler: A Set of Pipelines for Identifying and Quantifying Functional Regions in Antibody Repertoires from Next-Generation Sequencing

Mark

Burchak, Alexa (2025) BINP52 20242
Degree Projects in Bioinformatics

Abstract: Next-generation sequencing (NGS) has transformed the study of antibody repertoires by enabling high-throughput, in-depth analysis of immune diversity and antigen-driven selection. When paired with in vitro display technologies, NGS offers unparalleled resolution in characterizing antibody libraries, but also presents substantial computational challenges due to the highly variable nature of antibody sequences. To address these challenges, we developed HMMProfiler, a pair of modular, JavaScript-based pipelines for NGS analysis that significantly improve on the legacy AbGenesis workflow for antibody sequence annotation in speed, usability, and integration with modern application platforms. Developed to operate in a Node.js environment, these... (More); Next-generation sequencing (NGS) has transformed the study of antibody repertoires by enabling high-throughput, in-depth analysis of immune diversity and antigen-driven selection. When paired with in vitro display technologies, NGS offers unparalleled resolution in characterizing antibody libraries, but also presents substantial computational challenges due to the highly variable nature of antibody sequences. To address these challenges, we developed HMMProfiler, a pair of modular, JavaScript-based pipelines for NGS analysis that significantly improve on the legacy AbGenesis workflow for antibody sequence annotation in speed, usability, and integration with modern application platforms. Developed to operate in a Node.js environment, these pipelines leverage command-line tools such as Seqkit and HMMER to perform quality filtering, six-frame translation, domain detection via profile Hidden Markov Models (HMMs), and robust domain-wise profile counting. All parameters are user-configurable via structured JSON files, ensuring reproducibility and ease of customization. Benchmarking with FASTQ datasets demonstrated that Pipeline 1 processes up to 5 million reads with linear runtime scaling and generates accurate, high-resolution domain profiles. Pipeline 2 enables rapid matching of new sequences to precomputed profiles using domain-specific Levenshtein distances, with clear discrimination between closely related and divergent sequences. Compared to AbGenesis, the new system offers faster runtimes and improved domain annotation sensitivity. Its architecture also supports seamless integration with Bionamic’s front-end tools, providing interactive outputs and real-time user feedback. While limitations include the computational cost of Levenshtein comparisons and lack of structural motif recognition, the pipelines are structured for future expansion, including GPU acceleration, motif search capabilities, and machine learning integration. Together, these tools represent a scalable, extensible foundation for high-throughput antibody repertoire analysis and broader applications in synthetic biology and immunodiagnostics. (Less)
Popular Abstract: Streamlining Antibody Analysis: A Fast and Flexible Pipeline for Next-Generation Sequencing Data

In the world of modern medicine and biotechnology, understanding the diversity and function of antibodies is critical for developing new therapies and diagnostics. However, analyzing the massive amounts of genetic data generated by next-generation sequencing (NGS) can be slow, complicated, and sometimes inaccurate. To tackle this challenge, we developed a powerful new set of computational pipelines that speeds up and improves the process of identifying and comparing antibody sequences, making it easier for researchers to uncover meaningful biological insights.

Our pipelines stand out by combining the strengths of tried-and-true... (More); Streamlining Antibody Analysis: A Fast and Flexible Pipeline for Next-Generation Sequencing Data

In the world of modern medicine and biotechnology, understanding the diversity and function of antibodies is critical for developing new therapies and diagnostics. However, analyzing the massive amounts of genetic data generated by next-generation sequencing (NGS) can be slow, complicated, and sometimes inaccurate. To tackle this challenge, we developed a powerful new set of computational pipelines that speeds up and improves the process of identifying and comparing antibody sequences, making it easier for researchers to uncover meaningful biological insights.

Our pipelines stand out by combining the strengths of tried-and-true bioinformatics tools with the flexibility and speed of modern programming. Using a modular approach, the system breaks down huge sequencing datasets into manageable chunks and filters them based on quality. It then translates DNA sequences into proteins, searches for important functional domains using customizable Hidden Markov Models (HMMs), and extracts precise protein fragments for detailed analysis. Unlike previous tools, our pipeline allows users to supply their own models, making it adaptable to a wide variety of antibodies and related protein families. The results are clear, organized files that highlight unique protein domain patterns and their frequency across millions of sequences.

One of the key advantages of our approach is its seamless integration with the existing Bionamic platform, providing real-time responsiveness and interactive outputs through a user-friendly interface. This makes it not only faster but also easier to use, allowing researchers to quickly identify patterns of clonal expansion and somatic mutations that are crucial for understanding immune responses and selecting therapeutic candidates. The pipeline’s ability to compare new sequences against a growing database of known profiles further enriches biological interpretation, helping to distinguish meaningful binders from irrelevant or “sticky” sequences.

While our pipelines already outperform legacy tools like AbGenesis in speed and accuracy, there’s still room to grow. Future improvements could bring in machine learning to predict antibody properties directly from sequences, or add structural motif recognition to reveal how protein shapes influence function. By expanding beyond antibodies to other immune receptors or synthetic protein libraries, this tool promises to become a versatile resource across immunology, virology, and synthetic biology. Ultimately, these new pipelines bridge the gap between raw genetic data and actionable biological insights, accelerating antibody discovery and opening new avenues for research and therapy development.

Master’s Degree Project in Bioinformatics, 60 credits, 2025
Department of Biology, Lund University
Advisor: Anders Carlsson, Bionamic AB (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9212574

author

Burchak, Alexa

supervisor

Anders Carlsson

organization

Degree Projects in Bioinformatics

course

BINP52 20242

year

2025

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

9212574

date added to LUP

2025-09-17 11:55:06

date last changed

2025-09-17 11:55:06

@misc{9212574,
  abstract     = {{Next-generation sequencing (NGS) has transformed the study of antibody repertoires by enabling high-throughput, in-depth analysis of immune diversity and antigen-driven selection. When paired with in vitro display technologies, NGS offers unparalleled resolution in characterizing antibody libraries, but also presents substantial computational challenges due to the highly variable nature of antibody sequences. To address these challenges, we developed HMMProfiler, a pair of modular, JavaScript-based pipelines for NGS analysis that significantly improve on the legacy AbGenesis workflow for antibody sequence annotation in speed, usability, and integration with modern application platforms. Developed to operate in a Node.js environment, these pipelines leverage command-line tools such as Seqkit and HMMER to perform quality filtering, six-frame translation, domain detection via profile Hidden Markov Models (HMMs), and robust domain-wise profile counting. All parameters are user-configurable via structured JSON files, ensuring reproducibility and ease of customization. Benchmarking with FASTQ datasets demonstrated that Pipeline 1 processes up to 5 million reads with linear runtime scaling and generates accurate, high-resolution domain profiles. Pipeline 2 enables rapid matching of new sequences to precomputed profiles using domain-specific Levenshtein distances, with clear discrimination between closely related and divergent sequences. Compared to AbGenesis, the new system offers faster runtimes and improved domain annotation sensitivity. Its architecture also supports seamless integration with Bionamic’s front-end tools, providing interactive outputs and real-time user feedback. While limitations include the computational cost of Levenshtein comparisons and lack of structural motif recognition, the pipelines are structured for future expansion, including GPU acceleration, motif search capabilities, and machine learning integration. Together, these tools represent a scalable, extensible foundation for high-throughput antibody repertoire analysis and broader applications in synthetic biology and immunodiagnostics.}},
  author       = {{Burchak, Alexa}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{HMMProfiler: A Set of Pipelines for Identifying and Quantifying Functional Regions in Antibody Repertoires from Next-Generation Sequencing}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

HMMProfiler: A Set of Pipelines for Identifying and Quantifying Functional Regions in Antibody Repertoires from Next-Generation Sequencing