Using Natural Language Processing to Identify Similar Patent Documents

Navrozidis, Jakob; Jansson, Hannes

Using Natural Language Processing to Identify Similar Patent Documents

Mark

Navrozidis, Jakob ^LU and Jansson, Hannes ^LU (2020) In LU-CS-EX EDAM05 20192
Department of Computer Science

Abstract: The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.... (More); The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.
We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process.
Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9008699

author

Navrozidis, Jakob ^LU and Jansson, Hannes ^LU

supervisor

organization

Department of Computer Science

course

EDAM05 20192

year

2020

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

natural language processing, patent search, document similarity, word embeddings, sentence embeddings, machine learning

publication/series

LU-CS-EX

report number

2020-05

ISSN

1650-2884

language

English

id

9008699

date added to LUP

2020-08-17 09:14:36

date last changed

2020-08-17 09:14:36

@misc{9008699,
  abstract     = {{The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.
We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process.
Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents.}},
  author       = {{Navrozidis, Jakob and Jansson, Hannes}},
  issn         = {{1650-2884}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{LU-CS-EX}},
  title        = {{Using Natural Language Processing to Identify Similar Patent Documents}},
  year         = {{2020}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Using Natural Language Processing to Identify Similar Patent Documents