I’m currently working on a group project for school in the domain of document categorization, but a large portion of my group knows very little about the area. So, I decided I would do a write-up that explains the domain, the basics of common approaches, and things of that nature. What follows is meant to be a high-level crash-course in document categorization, but it is written by someone who is not an expert in the domain. If you see anything wrong with anything in this post, please let me know.
Document categorization deals with the assignment of documents to one or more categories. For the purposes of this article, I’m going to deal only with document categorization using supervised machine learning techniques, but there are other approaches. Unsupervised learning techniques, such as clustering, and semi-supervised learning techniques are also used, but they’re beyond the scope of this write-up.
Supervised machine learning involves learning a function from input data that is labeled with the correct output. In the case of document categorization, the training data typically consists of documents that have been labeled with the correct category or categories. A supervised machine learner will analyze the training documents and attempt to learn some function f(x) (called a classifier) that maps new documents to the correct categories. If this sounds difficult, that’s because it is. First, you need training data. Next, you need to come up with some way to represent the documents to the machine learner. Finally, you have to choose a specific machine learner to use.
How many documents do you need to train the machine learner? The more, the better. In pure supervised learning approaches, that typically means manually identifying training documents and painstakingly labeling each of them with the appropriate categories. This can be a very expensive and difficult process. There has been a lot of work, especially in the area of semi-supervised learning, on reducing the burden of acquiring training data. Those techniques are beyond the scope of this article.
Assuming you have training data, what next? The documents need to be processed into some form that’s suitable for machine learning. Unsurprisingly, you can’t just feed in a massive blob of text and expect magic; you have to do a little more work than that. The most common representation is the document vector space model. Each document is tokenized and broken into its constituent terms. Often stop lists are employed that filter out common, useless terms such as ‘the’. Other preprocessing steps, such as stemming and normalization, are also common. The filtered and processed terms are then used to create a vector for the document. There are a lot of variants, but most are derived from simple term frequency vectors. In a term frequency vector, the dimensions of a vector correspond to terms, and the values of the dimensions correspond to the number of occurrences of the term in the document. Take this is an example:
Document: The brown fox jumped over the brown cow.
|brown ||fox ||jump ||over ||cow ||green ||red ||blue ||… |
|2 ||1 ||1 ||1 ||1 ||0 ||0 ||0 ||0 |
In this example, ‘the’ was filtered due to being a stop word, ‘jumped’ was normalized via stemming, ‘brown’ has a score of 2 because it appeared twice in the document, and the terms that didn’t appear in the document (green, red, blue, etc) all have a score of 0.
Vectors are N dimensional, where N is the number of distinct terms across all the training documents. As you can imagine, N is a really big number in most domains. Typically, vectors are represented using a sparse data structure since individual vectors will consist mostly of zeroes.
In addition to filtering stop words, a lot of work has been done on feature selection, which deals with deciding which terms (and how many) are actually useful for a given categorization task. Feature selection typically involves statistical analysis or domain knowledge to identify which terms are important. Good results have been achieved in many domains by using only a very small subset of the available terms (think tens instead of hundreds).
Choosing a Machine Learner
At this point, the training data has been converted to some form that is useful to a machine learner. Now its time to choose which learner to use. Classification is a huge area of machine learning, so you have a lot of options. Popular choices in document categorization are artificial neural networks, naive bayes, and support vector machines. Support vector machines have been heavily used in recent years due to their ability to work well with high dimensional, noisy data. A popular open-source support vector machine package is libSVM. Unlike artificial neural networks, which require extensive manual setup in the form of network topology, support vector machines have few parameters that require adjustment. Often the default settings for libSVM will achieve good results.
Evaluating the performance of a trained document classifier can be difficult. In document categorization, what is typically desired is a estimate of how well the classifier will perform on new documents that were not part of the training set. Evaluating performance on the same documents that were used to train the classifier will lead to an overly optimistic estimate. Instead, holdout, cross validation, and bootstrapping can be used. 10-fold cross validation has been shown to have low variance and bias, but it can be computationally expensive. This is especially true in document categorization due to the high dimensionality and number of documents involved.
Common measures of performance in document categorization are:
- Accuracy – The number of documents correctly classified divided by the total number of documents.
- Precision – The number of documents correctly assigned to the category divided by the total number of documents assigned to the category.
- Recall – The number of documents that were correctly assigned to a category divided by the total number of documents that actual belong to the category.
- F-Measure – A combination of precision and recall (typically it is the harmonic mean of the two).
Though accuracy is simple to calculate, it is not a very useful metric for document categorization. Assume a simple training set that consists of 100 documents, 10 of which belong to a category. A naive classifier can achieve 90% accuracy by simply classifying each document as ‘false’ for membership in the category. Precision and recall are better measures. The two are often inversely related. Improving the classifier’s precision will lower its recall. When a single number measurement is desired, F-Measure can be used.
Text categorization is a very broad (and very active) area of research. This summary should help point you in the right direction, but it should not be considered exhaustive. If there is interest in future posts on the topic, let me know, and I’ll dive deeper into some of the issues in text categorization.