DOI: https://doi.org/10.15368/theses.2018.159
Available at: https://digitalcommons.calpoly.edu/theses/1976

Date of Award

3-2019

Degree Name

MS in Computer Science

Department/Program

Computer Science

Advisor

Lubomir Stanchev

Abstract

Document retrieval systems recover documents from a dataset and order them according to their perceived relevance to a user’s search query. This is a diﬃcult task for machines to accomplish because there exists a semantic gap between the meaning of the terms in a user’s literal query and a user’s true intentions. Even with this ambiguity that arises with a lack of context, users still expect that the set of documents returned by a search engine is both highly relevant to their query and properly ordered. The focus of this thesis is on document retrieval systems that explore methods of ordering documents from unstructured, textual corpora using text queries. The main goal of this study is to enhance the Okapi BM25 document retrieval model. In doing so, this research hypothesizes that the structure of text inside documents and queries hold valuable semantic information that can be incorporated into the Okapi BM25 model to increase its performance. Modiﬁcations that account for a term’s part of speech, the proximity between a pair of related terms, the proximity of a term with respect to its location in a document, and query expansion are used to augment Okapi BM25 to increase the model’s performance. The study resulted in 87 modiﬁcations which were all validated using open source corpora. The top scoring modiﬁcation from the validation phase was then tested under the Lisa corpus and the model performed 10.25% better than Okapi BM25 when evaluated under mean average precision. When compared against two industry standard search engines, Lucene and Solr, the top scoring modiﬁcation largely outperforms these systems by upwards to 21.78% and 23.01%, respectively.

Download

Included in

Computer and Systems Architecture Commons, Data Storage Systems Commons

COinS

Master's Theses

Relevance Analysis for Document Retrieval

Date of Award

Degree Name

Department/Program

Advisor

Abstract

Included in

Search

Browse

Author Corner

LINKS

Master's Theses

Relevance Analysis for Document Retrieval

Author

Date of Award

Degree Name

Department/Program

Advisor

Abstract

Included in

Share

Search

Browse

Author Corner

LINKS