SVM Text Classification

Support Vector Machines (SVMs) are learning systems that utilize a hypothesis space of separating functions in a high dimensional feature space. With rigorous mathematical foundations from optimization theory and statistical learning theory, this approach first introduced by Vapnik has been shown to outperform many other systems in a variety of Machine Learning applications. One of the successful uses of SVM algorithms is the task of text categorization into fixed number of predefined categories based on their content.

Commonly utilized representation of text documents from the field of information retrieval (IR) provides a natural mapping for construction of Mercer kernels utilized in SVM algorithms; when dealing with hypertext and plaintext documents which do not have a natural vector representation, explicit kernel structures have to be constructed, a procedure for which a number of construction algorithms have been proposed and investigated: TFIDF (Bag of Words), String, Syllable and Composite kernels.In this paper we investigate the efficacy of these methods for multi-category classification of hypertext documents.

We discuss the theory and computational construction algorithms for each of the aforementioned kernel structures and the possibility of creating and utilizing composite kernel structures to simulate information boosting. We also use the SVMPython package on a set of 1000 categorized documents to empirically evaluate the performance of the discussed kernels.

Ilya GrigorikIlya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.