Advanced

Fuzzy Matching and Merging of Family Trees using a Graph Database

Lundberg, Hampus LU (2015) EITM01 20141
Department of Electrical and Information Technology
Abstract
Association for computer aided genealogy research of Sweden(DIS) is investigating the possibility of a nationwide genealogical database (RGD) of Sweden's historical population. The finished database should contain basic information about individuals' names, birth, marriage, death, and individuals' relationships such as father, mother, husband/wife and children. This will make up a kind of graph over the ancestry of Sweden historical population were persons are connected to each other according to ancestry. The idea is that different genealogists should add already finished pedigrees (family trees). The problem is that the same pedigree can be inserted by two different genealogist. RGD has to find these duplicates and merge them. In this... (More)
Association for computer aided genealogy research of Sweden(DIS) is investigating the possibility of a nationwide genealogical database (RGD) of Sweden's historical population. The finished database should contain basic information about individuals' names, birth, marriage, death, and individuals' relationships such as father, mother, husband/wife and children. This will make up a kind of graph over the ancestry of Sweden historical population were persons are connected to each other according to ancestry. The idea is that different genealogists should add already finished pedigrees (family trees). The problem is that the same pedigree can be inserted by two different genealogist. RGD has to find these duplicates and merge them. In this thesis a test application for RGD is made in small scale using the graph database Neo4j and the document search tool Lucene. The focus is on finding and merging duplicated pedigrees. The test application made was able to upload files containing multiple pedigrees and merge them into a graph that was stored in Neo4j. This merging process had a linear time complexity in relation to how many families that were merged. A family in this thesis means family unit containing father, mother and children. Storing the biggest available file (20000 persons and 6500 families) in an empty database and then inserting the second biggest (10000 persons and 3000 families) took about 2 min and 30 seconds. The result was about 1200 family merges. This was done on the laptop Lenovo N500. Lenovo N500 has a Dual CPU T3200(2GHz) and 3GB RAM. The accuracy of the algorithm was compared to a previously made application. Both applications were tested on the same data-set. The result from the tests was the overlap of families merged by both applications. Then precision and recall was calculated for the test application considering the previous application gave all the correct family merges. There were two precision and recall scores estimated. The better one of those two gave the precision 97% and the recall 81%. The F-score for the system was 88%. (Less)
Popular Abstract (Swedish)
Populariteten för släktforskning har vuxit under dom senaste åren. Detta beror framför allt på att det blivit avsevärt enklare genom internets framfart. Idag kan privat personer bedriva släktforskning hemifrån då man tidigare behövde hjälp från arkivister och professionella släktforskare. Då stora mängder släktdata har digitaliserats och lagts ut på nätet genom olika internationella företag och ideala föreningar. Till exempel företaget ancestry och mormon kyrkan church of latter day saints. I Sverige finns föreningen för datorhjälp i släktforskning (DIS) som är världens äldsta förening för datorhjälp i släktforskning. För närvarande så jobbar man på DIS med att ta nästa steg inom digitaliseringen av släktforskning. Detta projekt handlar om... (More)
Populariteten för släktforskning har vuxit under dom senaste åren. Detta beror framför allt på att det blivit avsevärt enklare genom internets framfart. Idag kan privat personer bedriva släktforskning hemifrån då man tidigare behövde hjälp från arkivister och professionella släktforskare. Då stora mängder släktdata har digitaliserats och lagts ut på nätet genom olika internationella företag och ideala föreningar. Till exempel företaget ancestry och mormon kyrkan church of latter day saints. I Sverige finns föreningen för datorhjälp i släktforskning (DIS) som är världens äldsta förening för datorhjälp i släktforskning. För närvarande så jobbar man på DIS med att ta nästa steg inom digitaliseringen av släktforskning. Detta projekt handlar om att göra en rikstäckande genealogisk databas (RGD) för hela Sveriges historiska befolkning. RGD är tänkt att innehålla grundläggande information om individer som namn, kön, födelse datum, födelse församling, döds datum och döds församling samt individers relationer som pappa, mamma, barn. Det är tänkt att informationen ska komma från flera olika forskare, där all forsknings slåss ihop i en enda databas. Eftersom det som sparas i databasen är flera olika individer och deras släktband så kan man tänka sig att databasen blir som en jättelik graf över Sveriges historiska befolkning. (Less)
Please use this url to cite or link to this publication:
author
Lundberg, Hampus LU
supervisor
organization
course
EITM01 20141
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Fuzzy, Graph, Matching, Graph Database, Neo4j, Lucene, Family Trees
report number
LU/LTH-EIT 2015-432
language
English
id
5154052
date added to LUP
2015-03-17 15:34:36
date last changed
2015-03-17 15:34:36
@misc{5154052,
  abstract     = {Association for computer aided genealogy research of Sweden(DIS) is investigating the possibility of a nationwide genealogical database (RGD) of Sweden's historical population. The finished database should contain basic information about individuals' names, birth, marriage, death, and individuals' relationships such as father, mother, husband/wife and children. This will make up a kind of graph over the ancestry of Sweden historical population were persons are connected to each other according to ancestry. The idea is that different genealogists should add already finished pedigrees (family trees). The problem is that the same pedigree can be inserted by two different genealogist. RGD has to find these duplicates and merge them. In this thesis a test application for RGD is made in small scale using the graph database Neo4j and the document search tool Lucene. The focus is on finding and merging duplicated pedigrees. The test application made was able to upload files containing multiple pedigrees and merge them into a graph that was stored in Neo4j. This merging process had a linear time complexity in relation to how many families that were merged. A family in this thesis means family unit containing father, mother and children. Storing the biggest available file (20000 persons and 6500 families) in an empty database and then inserting the second biggest (10000 persons and 3000 families) took about 2 min and 30 seconds. The result was about 1200 family merges. This was done on the laptop Lenovo N500. Lenovo N500 has a Dual CPU T3200(2GHz) and 3GB RAM. The accuracy of the algorithm was compared to a previously made application. Both applications were tested on the same data-set. The result from the tests was the overlap of families merged by both applications. Then precision and recall was calculated for the test application considering the previous application gave all the correct family merges. There were two precision and recall scores estimated. The better one of those two gave the precision 97% and the recall 81%. The F-score for the system was 88%.},
  author       = {Lundberg, Hampus},
  keyword      = {Fuzzy,Graph,Matching,Graph Database,Neo4j,Lucene,Family Trees},
  language     = {eng},
  note         = {Student Paper},
  title        = {Fuzzy Matching and Merging of Family Trees using a Graph Database},
  year         = {2015},
}