Available at: http://digitalcommons.calpoly.edu/theses/1462
Date of Award
MS in Computer Science
Applications of Non-negative Matrix Factorization are ubiquitous, and there are several well known algorithms available. This paper is concerned with the preprocessing of the documents and how the preprocessing effects document classification. The preprocessing discussed in this paper will run the classification on a variety of inner dimensions to see how my initialization compares to random initialization across an assortment of inner dimensions. The document classification is accomplished by using Non-negative Matrix Factorization and a Support Vector Machine. Several of the well known algorithms call for a random initialization of matrices before starting an iterative process to a locally best solution. Not only is the initialization often random, but choosing the size of the inner dimension also remains a difficult and mysterious task.\\ This paper explores the possible gains in categorization accuracy given a more intelligently chosen initialization as opposed to a random initialization through the use of the Reuters-21578 document collection. This paper presents two new and different approaches for initialization of the data matrix. The first approach uses the most important words for a given document that are least important to all the other documents. The second approach will incorporate the words that appear in the title and header of the documents that are not stop words. The motivation for this is that the title usually tells the reader what the document is about. As a result, the words should be relevant to the category of the document. This paper will also present an entire framework for testing and comparing different Non-negative Matrix Factorization initialization methods. A thorough overview of the implementation and results are presented to ease the interfacing with future work.