Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Using Natural Language Processing to Identify Similar Patent Documents

Navrozidis, Jakob LU and Jansson, Hannes LU (2020) In LU-CS-EX EDAM05 20192
Department of Computer Science
Abstract
The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.... (More)
The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.
We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process.
Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents. (Less)
Please use this url to cite or link to this publication:
author
Navrozidis, Jakob LU and Jansson, Hannes LU
supervisor
organization
course
EDAM05 20192
year
type
H2 - Master's Degree (Two Years)
subject
keywords
natural language processing, patent search, document similarity, word embeddings, sentence embeddings, machine learning
publication/series
LU-CS-EX
report number
2020-05
ISSN
1650-2884
language
English
id
9008699
date added to LUP
2020-08-17 09:14:36
date last changed
2020-08-17 09:14:36
@misc{9008699,
  abstract     = {{The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.
We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process.
Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents.}},
  author       = {{Navrozidis, Jakob and Jansson, Hannes}},
  issn         = {{1650-2884}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{LU-CS-EX}},
  title        = {{Using Natural Language Processing to Identify Similar Patent Documents}},
  year         = {{2020}},
}