Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

The Geographic Coordinate Prediction of Ancestral Origin for Individuals Worldwide with Less Markers

Yu, Yuexin (2022) BINP51 20212
Degree Projects in Bioinformatics
Abstract
The perpetrator inference in the crime scene can be narrowed down in geography using ancestry informative markers (AIMs) to infer ancestry. The commercial kits including AIMs are in high demand in criminal investigations to infer the ancestral geographic origin from the individual’s genotype. However, more than 10,000 SNPs were used to carry out accurate inferences, causing extremely high costs. A cost-effective kit limits the size of AIMs to several hundreds. In addition, AIMs-based-methods also requires accurate biogeographical tools to predict geographic locations as precisely as possible on a small size AIMs set. Previously, the prediction of human’s place of origin by genetic data (i.e., BGA) has been successfully realized with an 83%... (More)
The perpetrator inference in the crime scene can be narrowed down in geography using ancestry informative markers (AIMs) to infer ancestry. The commercial kits including AIMs are in high demand in criminal investigations to infer the ancestral geographic origin from the individual’s genotype. However, more than 10,000 SNPs were used to carry out accurate inferences, causing extremely high costs. A cost-effective kit limits the size of AIMs to several hundreds. In addition, AIMs-based-methods also requires accurate biogeographical tools to predict geographic locations as precisely as possible on a small size AIMs set. Previously, the prediction of human’s place of origin by genetic data (i.e., BGA) has been successfully realized with an 83% worldwide individuals’ placement in their country of origin, done by Geographic Population Structure (GPS) algorithm. However, the GPS algorithm employs over 100,000 AIMs in the classification of individuals into several reference populations. This limited the forensic application due to the requirement of thousands of AIMs genotyped on a microarray. Therefore, in this study, forensic biogeographical AIMs sets in two sizes (300 AIMs and 6,805 AIMs) were selected, by feature selection using target permutation, for different forensic investigation purposes. A biogeographical prediction pipeline was also constructed based on selected AIMs sets and Random Forest regression model. 300 AIMs set can be widely applied to most of forensic investigation due to its small number of AIMs, and can predict 85% individuals into the correct continents. 6,805 AIMs set is less applicable in forensic, but almost predict the geographic coordinates of individuals in the accuracy of the prediction using more than 100,000 AIMs. (Less)
Popular Abstract
A machine learning pipeline that can predict the geographic coordinate of a criminal suspect with several hundreds of markers

Currently, people were able to infer personally identifying characteristics, such as sex or biogeographical ancestry, from DNA found at a crime scene. The inferences were really helped by DNA profiling, like short tandem repeats (STRs), or biogeographical ancestry of donors. STRs can mostly provide correct criminal suspects from blood and saliva samples. However, the method using STRs highly rely on the reference sample from a known human being. This means it depends on a huge database with samples of human being. What if the criminal suspect is not in the database? So instead of directly recognizing the criminal... (More)
A machine learning pipeline that can predict the geographic coordinate of a criminal suspect with several hundreds of markers

Currently, people were able to infer personally identifying characteristics, such as sex or biogeographical ancestry, from DNA found at a crime scene. The inferences were really helped by DNA profiling, like short tandem repeats (STRs), or biogeographical ancestry of donors. STRs can mostly provide correct criminal suspects from blood and saliva samples. However, the method using STRs highly rely on the reference sample from a known human being. This means it depends on a huge database with samples of human being. What if the criminal suspect is not in the database? So instead of directly recognizing the criminal suspect, nowadays, DNA intelligence tools with machine learning are employed to filter out and narrow down the potential criminal suspect pools. Ancestry inference can be a characteristic in perpetrator inference. We can use ancestry informative markers (AIMs) to predict ancestry by classifying individuals into distinct populations. The commercial kits including AIMs are in high demand in criminal investigations to infer the ancestral geographic origin from the individual’s genotype. However, more than 10,000 AIMs were used to carry out accurate inferences, causing extremely high costs. A cost-effective kit limits the size of AIMs to several hundred. In addition, AIMs-based-methods also requires accurate biogeographical tools to predict geographic locations as precisely as possible on a small size AIMs set. Previously, the prediction of human’s place of origin by genetic data has been successfully realized with an 83% worldwide individuals’ placement in their country of origin, done by Geographic Population Structure (GPS) algorithm. However, the GPS algorithm employs over 100,000 AIMs in the classification of individuals into several reference populations. The big number of AIMs used in GPS limited the forensic application due to the requirement of thousands of AIMs genotyped on a microarray.

In this study, a biogeographical prediction pipeline was developed using selected AIMs set and Random Forest (RF) regression modeling. AIM sets were selected by feature selection using target permutation, which decrease a single feature’s importance in a model score when the dependent variable values are randomly shuffled. Two AIMs sets in different number of AIMs (300 AIMs and 6,805 AIMs) were selected. The pipeline first calculates the admixture proportions of given individuals with respect to several reference populations using one of selected AIMs sets (selected by user). Then the pipeline predicts the geographic coordinate of given individuals based on their admixture profiles using model trained for these individuals.

Selected 300 AIMs set can be widely applied to most of forensic investigation due to its small number of AIMs, and can predict 85% individuals into the correct continents. 6,805 AIMs set is less applicable in forensic, but almost predict the geographic coordinates of individuals in the accuracy of the prediction using more than 100,000 AIMs.

Master’s Degree Project in Bioinformatics 45 credits 2022
Department of Biology, Lund University

Advisor: Eran Elhaik
Advisors Department: Department of Biology, Lund University (Less)
Please use this url to cite or link to this publication:
author
Yu, Yuexin
supervisor
organization
course
BINP51 20212
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
9102875
date added to LUP
2022-11-07 11:24:56
date last changed
2022-11-07 11:24:56
@misc{9102875,
  abstract     = {{The perpetrator inference in the crime scene can be narrowed down in geography using ancestry informative markers (AIMs) to infer ancestry. The commercial kits including AIMs are in high demand in criminal investigations to infer the ancestral geographic origin from the individual’s genotype. However, more than 10,000 SNPs were used to carry out accurate inferences, causing extremely high costs. A cost-effective kit limits the size of AIMs to several hundreds. In addition, AIMs-based-methods also requires accurate biogeographical tools to predict geographic locations as precisely as possible on a small size AIMs set. Previously, the prediction of human’s place of origin by genetic data (i.e., BGA) has been successfully realized with an 83% worldwide individuals’ placement in their country of origin, done by Geographic Population Structure (GPS) algorithm. However, the GPS algorithm employs over 100,000 AIMs in the classification of individuals into several reference populations. This limited the forensic application due to the requirement of thousands of AIMs genotyped on a microarray. Therefore, in this study, forensic biogeographical AIMs sets in two sizes (300 AIMs and 6,805 AIMs) were selected, by feature selection using target permutation, for different forensic investigation purposes. A biogeographical prediction pipeline was also constructed based on selected AIMs sets and Random Forest regression model. 300 AIMs set can be widely applied to most of forensic investigation due to its small number of AIMs, and can predict 85% individuals into the correct continents. 6,805 AIMs set is less applicable in forensic, but almost predict the geographic coordinates of individuals in the accuracy of the prediction using more than 100,000 AIMs.}},
  author       = {{Yu, Yuexin}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{The Geographic Coordinate Prediction of Ancestral Origin for Individuals Worldwide with Less Markers}},
  year         = {{2022}},
}