September 23, 2013
Recently I have
developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data
Analytics such as Predictive analysis, text mining. I’m putting my next blog on Data Mining- more
specifically document classification using R Programming language, one of the
powerful languages used for Statistical Analysis.
Document
classification or Document categorization is to classify documents into one or
more classes/categories manually or algorithmically. Today we try to classify
algorithmically. Document classification falls into Supervised Machine learning
Technique.
Technically
speaking, we create a machine learning model using a number of text documents (called
Corpus) as Input & its corresponding class/category (called Labels) as
Output. The model thus generated will be able to classify into classes when a
new text is supplied.
Let’s have
a look of what happens inside the black box in the above figure. We can divide
the steps into:
- Creation of Corpus
- Preprocessing of Corpus
- Creation of Term Document Matrix
- Preparing Features & Labels for Model
- Creating Train & test data
- Running the model
- Testing the model
We have speeches of US presidential contestants
of Mr. Obama & Mr. Romney. We need to create a classifier which should be
able to classify whether a particular new speech belongs to Mr. Obama or Mr.
Romney.
Implementation
We implement
the document classification using tm/plyr packages, as preliminary steps, we
need to load the required libraries into R environment:
- Step I: Corpus creation:
In our case, we create two corpuses- one each
for contestant.
- Step II: Preprocessing of Corpus
Preprocessing involves removal of punctuations, white spaces, Stop
words such as is,
the, for, etc.
- Step III: Term Document Matrix This step involves creation of Term Document Matrix, i.e. matrix which has the frequency of terms that occur in a collection of documents. for example: D1 = “I love Data analysis” D2 = “I love to create data models” TDM:
Step IV: Feature Extraction & Labels for the model:
In this step, we extract input feature words which are useful in
distinguishing the
documents and attaching the corresponding classes as Labels.
- Step V: Train & test data preparation
Labels into Training (70%) & Test data (30%)
before we feed into our Model.
- Step VI: Running the model: For creating our model using the training data we have separated in the earlier step. We use KNN-model, whose description can be found from here.
- Step VII: Test Model Now that the model is created, we have to test the accuracy of the model using the test data created in the Step V.
Find the complete code here.
I would like to show appreciation to the writer just for bailing me out of this predicament. After surfing around through the world-wide-web and obtaining recommendations that were not pleasant, I assumed my entire life was done. Existing devoid of the solutions to the issues you have fixed by means of this guideline is a serious case, and the ones which could have adversely damaged my entire career if I hadn't come across your site. Your good knowledge and kindness in touching every aspect was very helpful. I am not sure what I would've done if I hadn't discovered such a stuff like this. I can also at this time look ahead to my future. Thanks a lot very much for your impressive and results-oriented help. I won't hesitate to refer the website to anybody who should have guidance on this subject. seo keywords singapore
ReplyDeleteThe customary measure given by USA Today AdMeter recommended that Coca-Cola had done rather inadequately, yet when rethought, the genuine degrees of shopper reaction and commitment Coca-Cola's was top of the outlines. machine learning course
ReplyDeleteGreat post - thanks for sharing! Lots of great info about document classification
ReplyDeletehttps://www.bisok.com/grooper-data-capture-method-features/document-classification/
Our R programming course in Gurgaon with placement assistance helps you to fabricate your resume to make you a job-ready candidate toward the end of the training. As a large portion of the organizations are depending on a data analytics device, there is constantly a high demand for R developers in IT current market.
ReplyDeleteFor More Info: R Programming Course in Gurgaon
This comment has been removed by the author.
ReplyDelete