Research

next up previous
Next: Semi supervised classification Up: New directions Previous: Estimating information theoretic quantities

Classification

In [22], we adapted some of these techniques to implement a text classification scheme that compares favorably with the current state of the art classification schemes. (See Table 1)

The collection Newsgroups is a list of 1000 articles collected from 20 newgroups. Among the newsgroups are collections of closely related newsgroups such as comp.os.ms-windows.misc, comp.windows.x or rec.autos and rec.motorcycles. The task is to identify all the newsgroups. Roughly of the documents were randomly chosen for training, while the rest are used for testing, and the results confirmed by repeat trials over random splits. CADE is a Portugese dataset with 12 classes. The improvements in classification accuracy are statistically significant at better than .05 level.

Newsgroups CADE
SVM 73.09 52.84
New 73.46 59.24
.
Table 1: Percentage of documents accurately classified. (SVM: support vector machines)

We are currently combining the approach above with graphical model inference techniques to yield a richer class of classifiers. For details, please contact me at prasadsnateecsdotberkeleydotedu.



Subsections

Prasad Santhanam 2007-12-28