Wikipedia Document Classification

Download as .zip Download as .tar.gz View on GitHub

Introduction

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. We tried to use different kinds of models and study their behaviour to understand the differences in various kinds of topic models, their advantages and their disadvantages while also learning about advances in this field in the last couple of years.

Dataset

We extracted articles from Wiki10+ using xmltree, bleach and couple of handcrafted regexs. Also, we extracted the top tag associated with them so that each document has a tag associated with them, this will be our topic. We reduced the number of topics from 470 to 24 to make it a feasible classification problem.

Methods Used

TF - IDF

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take, and we firstly use the tf-idf formulation.

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The number of times a term occurs in a document is called its term frequency while inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. tf–idf is the product of two statistics, term frequency and inverse document frequency.

Latent Dirichlet Allocation

Latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups (latent groups) that explain why some parts of the data are similar. If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

word2vec

Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.

Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.

doc2vec/paragraph2vec

The main purpose of Doc2Vec is associating arbitrary documents with labels, so labels are required. Doc2vec is an extension of word2vec that learns to correlate labels and words, rather than words with other words. The first step is coming up with a vector that represents the “meaning” of a document, which can then be used as input to a supervised machine learning algorithm to associate documents with labels.

Tech used

nltk, scikit-learn, gensim, bleach

Huge shoutout to the library developers! :)

Authors and Contributors

@anuragxel, @NarendraBabu-U have contributed to this project.

Tags

'Information Retrieval and Extraction Course', 'IIIT-H', 'Major Project', 'Wikipedia', 'Topic Modelling', 'Document Classification', 'Deep Learning', 'word2vec', 'doc2vec', 'tf-idf', 'lda', 'nltk', 'python', 'parsing', 'stemming', 'bleach'