Evaluation of methods for full-text search in patents

Lagerquist, Otto; Toreheim, Ebba

Evaluation of methods for full-text search in patents

Mark

Lagerquist, Otto ^LU and Toreheim, Ebba (2021) EITM01 20211
Department of Electrical and Information Technology

Abstract: In this thesis we have evaluated methods for doing full-text searches in patent documents. The aim of patent searches is to find evidence and relevant documents when an invalidity search is done on a patent.

With three different language models, BOW, SPECTER and SBERT, we have evaluated the results of two different text segmentation methods, greedy sentence split and paragraph split, and two different clustering methods, euclidean and spherical. We have found that the spherical clustering outperforms the euclidean one and that both segmentation methods works well for finding relevant parts of documents, both methods with its own advantages and drawbacks.

The configurations were evaluated in four stages, where the first three were... (More); In this thesis we have evaluated methods for doing full-text searches in patent documents. The aim of patent searches is to find evidence and relevant documents when an invalidity search is done on a patent.

With three different language models, BOW, SPECTER and SBERT, we have evaluated the results of two different text segmentation methods, greedy sentence split and paragraph split, and two different clustering methods, euclidean and spherical. We have found that the spherical clustering outperforms the euclidean one and that both segmentation methods works well for finding relevant parts of documents, both methods with its own advantages and drawbacks.

The configurations were evaluated in four stages, where the first three were automatic and the last one was a manual evaluation by employees at AWA and Lund University. We conclude that our methods have great potential but more testing on a better engineered test set as well as more data from the manual evaluation is needed to draw further conclusions. (Less)
Popular Abstract: Every day the amount of publicly available information increases and with this, a need to quickly and efficiently navigate this jungle of information arises. Wouldn't it be nice if there was a method that understood what you would like to find and then picks out the most relevant texts for you? Well, that's exactly what we've been trying to find and we can tell you that the results are promising.

How many times have you had to go back and change your search query due to not finding what you searched for? How many times have you found tons of document but none that seemed to contain precisely what you were looking for? We believe there is a solution to these problems. The language models of today are actually quite capable of capturing... (More); Every day the amount of publicly available information increases and with this, a need to quickly and efficiently navigate this jungle of information arises. Wouldn't it be nice if there was a method that understood what you would like to find and then picks out the most relevant texts for you? Well, that's exactly what we've been trying to find and we can tell you that the results are promising.

How many times have you had to go back and change your search query due to not finding what you searched for? How many times have you found tons of document but none that seemed to contain precisely what you were looking for? We believe there is a solution to these problems. The language models of today are actually quite capable of capturing the meaning of shorter texts. Combine them with a method for cutting up long documents into smaller parts and we are almost good to go.

This scenario described is especially troublesome for patent attorneys in their search for prior arts when for instance doing an invalidity search. The most common method today is searching in a patent database using keywords, which makes it easy to miss relevant documents due to synonyms and language variability from different authors. The patent documents are also very long and sometimes all it takes is a short relevant sentence in a long, otherwise irrelevant, document to prove that something is already known.

We believe that the work in our thesis is a good step towards a tool that could help the patent attorneys in their work. By doing searches based on claims we've managed to find shorter passages of patent documents that were deemed relevant by professionals. Even though this might not do all the job for the patent attorneys, we see a big potential for an application that gives a list of potentially relevant documents that otherwise might have been missed.

In our thesis we have tried different methods to segment the long patents into smaller texts which then can be represented by embeddings created by our language models. Of course this leads to a very large number of texts, but by using clustering we can efficiently limit the number of texts we have to search to find the right hits. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9058646

author

Lagerquist, Otto ^LU and Toreheim, Ebba

supervisor

Fredrik Edman ^LU

organization

Department of Electrical and Information Technology

course

EITM01 20211

year

2021

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

natural language processing, full-text patent search, legal tech, document similarity, sentence embeddings, clustering

report number

LU/LTH-EIT 2021-837

language

English

id

9058646

date added to LUP

2021-08-12 10:32:28

date last changed

2021-08-12 10:32:28

@misc{9058646,
  abstract     = {{In this thesis we have evaluated methods for doing full-text searches in patent documents. The aim of patent searches is to find evidence and relevant documents when an invalidity search is done on a patent. 

With three different language models, BOW, SPECTER and SBERT, we have evaluated the results of two different text segmentation methods, greedy sentence split and paragraph split, and two different clustering methods, euclidean and spherical. We have found that the spherical clustering outperforms the euclidean one and that both segmentation methods works well for finding relevant parts of documents, both methods with its own advantages and drawbacks. 

The configurations were evaluated in four stages, where the first three were automatic and the last one was a manual evaluation by employees at AWA and Lund University. We conclude that our methods have great potential but more testing on a better engineered test set as well as more data from the manual evaluation is needed to draw further conclusions.}},
  author       = {{Lagerquist, Otto and Toreheim, Ebba}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Evaluation of methods for full-text search in patents}},
  year         = {{2021}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Evaluation of methods for full-text search in patents