Computer Science and Software Engineering

Combining Parts of Speech, Term Proximity, and Query Expansion for Document Retrieval

Lubomir Stanchev, California Polytechnic State University, San Luis ObispoFollow
Eric LaBouve, California Polytechnic State University, San Luis Obispo

Recommended Citation

Postprint version. Published in 13th IEEE International Conference on Semantic Computing Proceedings: Newport Beach, CA, January 30, 2019, pages 150-153.

The definitive version is available at https://doi.org/10.1109/ICOSC.2019.8665507.

Abstract

Document retrieval systems recover documents from a database and order them according to their perceived relevance to a user's search query. This is a difficult task for machines to accomplish because there exists a semantic gap between the meaning of the terms in a user's literal query and a user's true intentions. The main goal of this study is to modify the Okapi BM25 document retrieval system to improve search results for textual queries and unstructured, textual corpora. This research hypothesizes that Okapi BM25 is not taking full advantage of the structure of text inside documents. This structure holds valuable semantic information that can be used to increase the model's accuracy. Modifications that account for a term's part of speech, the proximity between a pair of related terms, the proximity of a term with respect to its location in a document, and query expansion are used to augment Okapi BM25. The study resulted in 87 modifications which were all validated using open source corpora. The top scoring modification from the validation set was then tested under the Lisa corpus and the model performed 10.25% better than Okapi BM25 when evaluated under mean average precision.

Disciplines

Computer Sciences

Copyright

2019 IEEE.

Publisher statement

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download

Included in

Computer Sciences Commons

COinS

URL: https://digitalcommons.calpoly.edu/csse_fac/267

Computer Science and Software Engineering

Combining Parts of Speech, Term Proximity, and Query Expansion for Document Retrieval

Recommended Citation

Abstract

Disciplines

Copyright

Publisher statement

Included in

Search

Browse

Author Corner

LINKS

Computer Science and Software Engineering

Combining Parts of Speech, Term Proximity, and Query Expansion for Document Retrieval

Author Info

Recommended Citation

Abstract

Disciplines

Copyright

Publisher statement

Included in

Share

Search

Browse

Author Corner

LINKS