Using Natural Language Processing to Identify Similar Patent Documents
(2020) In LU-CS-EX EDAM05 20192Department of Computer Science
- Abstract
- The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.... (More) - The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked.
In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords.
We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process.
Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9008699
- author
- Navrozidis, Jakob LU and Jansson, Hannes LU
- supervisor
- organization
- course
- EDAM05 20192
- year
- 2020
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- natural language processing, patent search, document similarity, word embeddings, sentence embeddings, machine learning
- publication/series
- LU-CS-EX
- report number
- 2020-05
- ISSN
- 1650-2884
- language
- English
- id
- 9008699
- date added to LUP
- 2020-08-17 09:14:36
- date last changed
- 2020-08-17 09:14:36
@misc{9008699, abstract = {{The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked. In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords. We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process. Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents.}}, author = {{Navrozidis, Jakob and Jansson, Hannes}}, issn = {{1650-2884}}, language = {{eng}}, note = {{Student Paper}}, series = {{LU-CS-EX}}, title = {{Using Natural Language Processing to Identify Similar Patent Documents}}, year = {{2020}}, }