DOI: https://doi.org/10.15368/theses.2012.90
Available at: https://digitalcommons.calpoly.edu/theses/773
Date of Award
6-2012
Degree Name
MS in Computer Science
Department/Program
Computer Science
Advisor
Franz Kurfess
Abstract
This thesis presents a look at the suitability of Suffix Trees for full text indexing and retrieval. Typically suffix trees are built on a character level, where the tree records which characters follow each other character. By building suffix trees for documents based on words instead of characters, the resulting tree effectively indexes every word or sequence of words that occur in any of the documents. Ukkonnen's algorithm is adapted to build word-level suffix trees. But the primary focus is on developing Algorithms for searching the suffix tree for exact and approximate, or fuzzy, matches to arbitrary query strings. A proof-of-concept implementation is built and compared to a Lucene index for retrieval over a subset of the Reuters RCV1 data set.