Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.
Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.
Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.
Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.
In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.
let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.
We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().
Let’s plot all the principal components and see how the variance is accounted with each component.
Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.
In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them.
Complete Code for PCA implementation in R:
So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.
Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.
Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.
Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.
In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.
PCA implementation in R:
For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales.
The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below.
head(crimtab) 142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34 9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 182.88 185.42 187.96 190.5 193.04 195.58 9.4 0 0 0 0 0 0 9.5 0 0 0 0 0 0 9.6 0 0 0 0 0 0 9.7 0 0 0 0 0 0 9.8 0 0 0 0 0 0 9.9 0 0 0 0 0 0 dim(crimtab) [1] 42 22 str(crimtab) 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ... ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ... sum(crimtab) [1] 3000 colnames(crimtab) [1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34" [17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58"
let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.
apply(crimtab,2,var)
We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().
pca =prcomp(crimtab)pca
Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.
Let’s plot all the principal components and see how the variance is accounted with each component.
par(mar = rep(2, 4)) plot(pca)
Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.
#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0)
The output of the preceding code is as follows:
In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them.
Complete Code for PCA implementation in R:
So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.
Nice article. I liked very much. All the informations given by you are really helpful for my research. keep on posting your views.
ReplyDeleteJava Training in Chennai
Java course in Chennai
Big Data Training in Chennai
Advanced Java Training in Chennai
German Language Course in Chennai
Java Training in Velachery
Thanks to the admin for spending time to share this valuable information with us. This was a wonderful post.
ReplyDeleteSpoken English Class in Thiruvanmiyur
Spoken English Classes in Adyar
Spoken English Classes in T-Nagar
Spoken English Classes in Vadapalani
Spoken English Classes in Porur
Spoken English Classes in Anna Nagar
Spoken English Classes in Chennai Anna Nagar
Spoken English Classes in Perambur
Spoken English Classes in Anna Nagar West
It’s really a Great Post .Looking for Some More Stuff
ReplyDeleteAviation Academy in Chennai
Air hostess training in Chennai
Airport management courses in Chennai
Ground staff training in Chennai
Aviation Courses in Chennai
air hostess academy in Chennai
Airport management courses in Chennai
Airport Management Training in Chennai
I'm extremely inspired together with your writing skills and also with the format in your weblog. Is that this a paid topic or did you modify it your self? Anyway keep up the nice quality writing, it’s rare to peer a great weblog like this one today.. maternity shops Singapore
ReplyDeleteI would like to thanks for your comprehensive article and these concepts are very helped to increase my knowledge. I hope more unique information from your post...
ReplyDeleteExcel Training in Chennai
Advanced Excel Training in Chennai
Excel Advanced course
corporate training in chennai
Embedded System Course Chennai
Linux Training in Chennai
Excel Training in Chennai
Advanced Excel Training in Chennai
Oh my goodness! an incredible article dude. Thank you Nonetheless I'm experiencing difficulty with ur rss . Don’t know why Unable to subscribe to it. Is there anybody getting an identical rss drawback? Anyone who is aware of kindly respond. Thnkx digital marketing jobs singapore
ReplyDeletehello!,I like your writing so much! share we communicate more about your post on AOL? I require an expert on this area to solve my problem. May be that's you! Looking forward to see you. cost per click singapore
ReplyDeleteThis is the appropriate blog for anybody who desires to search out out about this topic. You realize a lot its almost laborious to argue with you (not that I truly would need…HaHa). You undoubtedly put a new spin on a subject thats been written about for years. Great stuff, just nice!
ReplyDeletemarketing companies in singapore
fantastic points altogether, you simply gained a new reader. What would you suggest about your post that you made some days ago? Any positive? help desk software
ReplyDeleteAre you sure your data are centered and scaled?
ReplyDeleteI mean, before doing PCA
DeleteAmazing Post. keep update more information.
ReplyDeleteSelenium Training in Chennai
Selenium Training in Bangalore
Selenium Training in Coimbatore
Best Selenium Training in Bangalore
Selenium Course in Bangalore
Selenium Training Institute in Bangalore
selenium training in marathahalli
Software Testing Course in Chennai
Hacking Course in Bangalore
Thanks for sharing this factual article. I have got satisfied with this illuminating blog. Waiting for the next post. Web Designing Course Training in Chennai | Web Designing Course Training in annanagar | Web Designing Course Training in omr | Web Designing Course Training in porur | Web Designing Course Training in tambaram | Web Designing Course Training in velachery
ReplyDeleteVery informative, thanks for sharing this information with us.
ReplyDeletebuy logo online
Buy the beautiful logo 50% Off, impressive service,
ReplyDeleteI expect You'll be satisfied with us. Custom Logo
http://www.dataperspective.info/
ReplyDeletehttp://dli.nkut.edu.tw/community/
https://us.geoflypages.com/
https://www.gyanbest.com/
https://btecho.blogspot.com/
https://www.questioncage.com/
https://honeybeedigital.com/
https://www.websiteperu.com/
https://blogpressid.blogspot.com/
Pretty nice post. I just stumbled upon your blog and wished to say that I've really enjoyed surfing around your blog posts. In any case I will be subscribing to your rss feed and I hope you write again very soon! pest control
ReplyDeleteWow,
ReplyDeleteSuch a amazing and helpful for me.
law dissertation Writing Service
Aivivu chuyên vé máy bay, tham khảo
ReplyDeletesăn vé máy bay giá rẻ đi Mỹ
mua vé máy bay về việt nam từ mỹ
chuyến bay thương mại từ nhật về việt nam
bao giờ có chuyến bay từ đức về việt nam
vé máy bay từ canada về việt nam bao nhiêu tiền
mua ve may bay tu han quoc ve viet nam
ve may bay chuyen gia sang Viet Nam
This comment has been removed by the author.
ReplyDelete