A Comprehensive Mutation Analysis and Disease Association Tool

Cai, Huixin

A Comprehensive Mutation Analysis and Disease Association Tool

Mark

Cai, Huixin (2025) BINP51 20242
Degree Projects in Bioinformatics

Abstract: With the development of high-throughput sequencing technology, the acquisition of large-scale human genome variation data has become increasingly prevalent, offering new opportunities for biomedical research. Therefore, obtaining meaningful information from high-throughput data has become a key task in related scientific research fields. However, how to quickly identify potential pathogenic sites from complex mutation information and establish their association with diseases is still a key challenge in current precision medicine research. To address this issue, this project has developed a mutation-disease association mining tool based on VCF files, which integrates multiple functions such as annotation, text mining, classification and... (More); With the development of high-throughput sequencing technology, the acquisition of large-scale human genome variation data has become increasingly prevalent, offering new opportunities for biomedical research. Therefore, obtaining meaningful information from high-throughput data has become a key task in related scientific research fields. However, how to quickly identify potential pathogenic sites from complex mutation information and establish their association with diseases is still a key challenge in current precision medicine research. To address this issue, this project has developed a mutation-disease association mining tool based on VCF files, which integrates multiple functions such as annotation, text mining, classification and visualization, and constructs an integrated analysis process from gene variation clinical data to disease prediction visualization.

The tool first annotates the variants in the VCF through VEP, screens mutations at the gene level, and obtains research literature on related genes through E-utilities based on the PubMed database. Subsequently, it extracts the association information between diseases and genes or mutations using Pubtator, and groups diseases in combination with the MeSH tree classification system to improve the structure clarity and interpretability of the results. To enhance the reliability of the results, the tool support validation with GWAS data. Finally, the analysis results are stored in the SQL database built by the tool and integrated into the interactive Streamlit visualization interface. Upon querying, the system presents the relationship network between the gene and diseases through a force-directed graph, and displays the structured result data in the form of a list, which can be downloaded in CSV format for further analysis.

This tool provides an integrated full-process analysis pipeline. By introducing literature-based text mining methods, it significantly improves the timeliness of the results and can promptly reflect the latest scientific research progress. This tool has good versatility and extensibility, and can provide strong data support and visualization methods for disease mechanism research, genetic counseling, and the discovery of new disease biomarkers. (Less)
Popular Abstract: Scientists can now read our genetic code(DNA) faster and cheaper than ever, creating massive amounts of data. The big challenge would be figuring out which tiny changes in this code might actually cause disease. In this project we built a powerful new tool. It can find the clues, which are the genes and mutations, in a person's genetic data file, then automatically search the latest medical research papers to see what's known about those specific genes and mutations—does anything link them to diseases? Then it groups any linked diseases using a smart medical classification system, making the results clearer and easier to understand. For extra confidence, it can reference findings against large genetic studies(GWAS) to see if other... (More); Scientists can now read our genetic code(DNA) faster and cheaper than ever, creating massive amounts of data. The big challenge would be figuring out which tiny changes in this code might actually cause disease. In this project we built a powerful new tool. It can find the clues, which are the genes and mutations, in a person's genetic data file, then automatically search the latest medical research papers to see what's known about those specific genes and mutations—does anything link them to diseases? Then it groups any linked diseases using a smart medical classification system, making the results clearer and easier to understand. For extra confidence, it can reference findings against large genetic studies(GWAS) to see if other researchers found similar links. Finally, it presents the results visually, allowing users to see interactive networks showing how specific genes connect to different diseases.

This tool provides an all-in-one solution. It handles everything – from analyzing raw genetic data to showing clear, visual results – in one seamless process. And by constantly checking the latest scientific papers, it provides the most current information possible. Moreover, the visualizations and organized disease groupings make complex genetic links much easier to grasp. Thus, this tool is a game-changer for researchers studying how diseases work, doctors advising patients on genetic risks, and scientists hunting for new biological signs of disease(biomarkers).

This project cuts through the complexity of genetic data, using the latest research to quickly pinpoint potential disease-causing mutations and show their connections clearly. It's a major step forward for precision medicine. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9212578

author

Cai, Huixin

supervisor

Ping Chen

organization

Degree Projects in Bioinformatics

course

BINP51 20242

year

2025

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

9212578

date added to LUP

2025-09-17 12:14:52

date last changed

2025-09-17 12:14:52

@misc{9212578,
  abstract     = {{With the development of high-throughput sequencing technology, the acquisition of large-scale human genome variation data has become increasingly prevalent, offering new opportunities for biomedical research. Therefore, obtaining meaningful information from high-throughput data has become a key task in related scientific research fields. However, how to quickly identify potential pathogenic sites from complex mutation information and establish their association with diseases is still a key challenge in current precision medicine research. To address this issue, this project has developed a mutation-disease association mining tool based on VCF files, which integrates multiple functions such as annotation, text mining, classification and visualization, and constructs an integrated analysis process from gene variation clinical data to disease prediction visualization.

The tool first annotates the variants in the VCF through VEP, screens mutations at the gene level, and obtains research literature on related genes through E-utilities based on the PubMed database. Subsequently, it extracts the association information between diseases and genes or mutations using Pubtator, and groups diseases in combination with the MeSH tree classification system to improve the structure clarity and interpretability of the results. To enhance the reliability of the results, the tool support validation with GWAS data. Finally, the analysis results are stored in the SQL database built by the tool and integrated into the interactive Streamlit visualization interface. Upon querying, the system presents the relationship network between the gene and diseases through a force-directed graph, and displays the structured result data in the form of a list, which can be downloaded in CSV format for further analysis.

This tool provides an integrated full-process analysis pipeline. By introducing literature-based text mining methods, it significantly improves the timeliness of the results and can promptly reflect the latest scientific research progress. This tool has good versatility and extensibility, and can provide strong data support and visualization methods for disease mechanism research, genetic counseling, and the discovery of new disease biomarkers.}},
  author       = {{Cai, Huixin}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{A Comprehensive Mutation Analysis and Disease Association Tool}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

A Comprehensive Mutation Analysis and Disease Association Tool