Postprint version. Published in 2017 IEEE 11th International Conference on Semantic Computing (ICSC), February 1, 2017, pages 1-8.
The definitive version is available at https://doi.org/10.1109/ICSC.2017.13.
Given a set of documents and an input query that is expressed using natural language, the problem of document search is retrieving all relevant documents ordered by the degree of relevance. Semantic document search fetches not only documents that contain words from the input query, but also documents that are semantically relevant. For example, the query "friendly pets" will consider documents that contain the words "dog" and "cat", among others. One way to implement semantic search is to use a probabilistic graph in which the input query is connected to the documents through paths that contain semantically similar words and phrases, where we use WordNet to initially populate the graph. Each edge in the graph is labeled with the conditional probability that the destination node is relevant given that the source node is relevant. Our semantic document search algorithm works in two phases. In the first phase, we find all documents in the graph that are close to the input query and create a bounded subgraph that includes the query, the found documents, and the paths that connect them. In the second phase, we simulate multiple random walks. Each random walk starts at the input query and continues until a document is reached, a jump outside the bounding subgraph is made, or the number of allowed jumps is exhausted. This allows us to rank the documents based on the number of random walks that terminated in them. We experimentally validated the algorithm on the Cranfield benchmark that contains 1400 documents and 225 natural language queries. We show that we achieve higher value for the mean average precision (MAP) measure than a keywords-based search algorithm and a previously published algorithm that relies on a variation of the probabilistic graph.
Number of Pages
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.