Dimensionality Reduction

Principal component analysis (PCA)

19 Feb , 2015  

Why PCA?

A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. This is analogous to lowering down the Rank of the Matrix which means that we decompose the Matrix into lower order one such that there is no more linear  dependency of one feature , on the other features or combination of features. The algorithm

  1. From every matrix element of P we subtract the mean of every element located in the same column. This new matrix we name P’. Thas is called mean-correction.
  2. We calculate the covariance matrix C of P’.
  3. We calculate the eigenvectors and eigenvalues of C which we name v1, v2, v3 and e1, e2, e3. The eigenvalues and eigenvectors a numbered such that e1 ≥ e2 ≥ e3.
  4. You multiply every point with every eigenvector or – which is effectively the same -you multiply P’ with every eigen vector. This gives you the same point set, but represented using a new coordinate system.

  So P’ x e1 is the new x-coordinate for every point, P’ x e2 the new y-coordinate and so on. In the illustration the eigenvectors are are normed to length 1 and shown as red, blue and green. Those eigen vectors are in this context called “principal components”. e1 is the first principal component, e2 the second and e3 the third. The ordering by their associated eigenvalues is important because the larger the eigenvalue, the more important is the component. In this case it might seem odd to call all three components “principal” – but when we are using PCA for a case with originally 1000 dimensions involved and accordingly many components, then the first N components with comparatively large eigenvalues are of principal importance. Lets Look one simple code in R for principal component analysis library(caret) library(plyr) #### for prinicipal Component Analysis correlation<- abs(cor(subset (trips ,select =-c(source,pilot,Risk_involved)))) diag(correlation) <- 0 which (correlation >0.3 , arr.ind=T) start <- as.matrix(trips[trips$evt_cnt >0,]) start <- scale(start, center=TRUE,scale=TRUE) pca_events <- prcomp(start) summary(pca_events) write.csv(pca_events$rotation,"pca_events.csv")

, ,


Leave a Reply

Your email address will not be published. Required fields are marked *