Postprint version. Published in 9th IEEE International Conference on Semantic Computing Proceedings: Anaheim, CA, February 7, 2015, pages 93-100.
NOTE: At the time of publication, the author Lubomir Stanchev was not yet affiliated with Cal Poly.
The definitive version is available at https://doi.org/10.1109/ICOSC.2015.7050785.
Given a set of documents and an input query that is expressed in a natural language, the problem of document search is retrieving the most relevant documents. Unlike most existing systems that perform document search based on keywords matching, we propose a search method that considers the meaning of the words in the query and the document. As a result, our algorithm can return documents that have no words in common with the input query as long as the documents are relevant. For example, a document that contains the words “Ford”, “Chrysler” and “General Motors” multiple times is surely relevant for the query “car” even if the word “car” does not appear in the document. Our semantic search algorithm is based on a similarity graph that contains the degree of semantic similarity between terms, where a term can be a word or a phrase. We experimentally validate our algorithm on the Cranfield benchmark that contains 1400 documents and 225 natural language queries. The benchmark also contains the relevant documents for every query as determined by human judgment. We show that our semantic search algorithm produces a higher value for the mean average precision (MAP) score than a keywords matching algorithm. This shows that our approach can improve the quality of the result because the meaning of the words and phrases in the documents and the queries is taken into account.
Number of Pages
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.