The ability to sift through massive amounts of unstructured text data in a meaningful and impactful way can yield tremendous value towards businesses across a multitude of domains.
One field which derives value from unstructured text data is text mining. Text mining is concerned with yielding quality information from unstructured text, processing it in a way that can be consumed by computers and statistical models, with the goal of identifying patterns and knowledge to drive value. This high quality information can then in turn be used for a variety of problems, including machine learning based classification.
This project explored state of the art mechanisms for capturing information from unstructured text for the purposes of classification via word vectors.
Previous research has applied Neural Network Language Models (NNLMs) to document classification performance, and word vector representations have been used to measure semantics among text. Never have they been combined together and shown to have improved text classification performance. This project endeavored to show that the inference and clustering abilities of word vectors coupled with the power of a neural network can create more accurate classification predictions. The first phase of work focused on word vector representations for classification purposes. This approach included analyzing two distinct text sources with pre-marked binary outcomes for classification, creating a benchmark metric, and comparing against word vector representations within the feature space as a classifier.
The results showed promise, obtaining an area under the curve of 0.95 utilizing word vectors, relative to the benchmark case of 0.93. The second phase of the project focused on utilizing an extension of the neural network model used in phase one to represent a document in its entirety as opposed to being represented word by word. Preliminary results indicated a slight improvement over the baseline model of approximately 2-3 percent.