Principal component analysis (PCA) has been called one of the most valuable results from applied linear algebra.PCA is used abundantly in all forms of analysis -from neuroscience to computer graphics - because it is a simple, non-parametric method of extracting relevant information from confusing data sets. With minimal additional effort PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the sometimes hidden, simplified structure that often underlie it.
PCA技术的一大好处是对数据进行降维的处理。我们可以对新求出的“主元”向量的重要性进行排序,根据需要取前面最重要的部分,将后面的维数省去,可以达到降维从而简化模型或是对数据进行压缩的效果。同时最大程度的保持了原有数据的信息。
PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of
high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analysing data.The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information. This technique used in image compression,.
method:
Step 1: Get some data. the N dimensions column. vector x1...xN.
Step 2: Subtract the mean. The mean subtracted is the average across each dimension.
Step 3: Calculate the covariance matrix.
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix. By this process of taking the eigenvectors of the covariance matrix, we have been able to extract lines that characterise the data(if 2 dimension). The rest of the steps involve transforming the data so that it is expressed in terms of them lines.
Step 5: Choosing components and forming a feature vector.
Here is where the notion of data compression and reduced dimensionality comes into it.In fact, it turns out that the eigenvector with the highest eigenvalue is the principle component of the data set.
In general, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance.
What needs to be done now is you need to form a feature vector, which is just a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors that you want to keep from the list of eigenvectors, and forming a matrix with these eigenvectors in the columns.
Step 5: Deriving the new data set
Once we have chosen the components (eigenvectors) that we wish to keep in our data and formed a feature vector, we simply take the transpose of the vector and multiply it on the left of the original data set, transposed.
It will give us the original data solely in terms of the vectors we chose.
Reference:AnalysisLindsay I Smith. A tutorial on Principal Components. February 26, 2002
数学原理上的解释:
主成分分析时: 协方差矩阵Cx包含了所有观测变量之间的相关性度量。更重要的是,这些相关性度量反映了数据的噪音和冗余的程度。
l 在对角线上的元素越大,表明信号越强,变量的重要性越高;元素越小则表明可能是存在的噪音或是次要变量。
l 在非对角线上的元素大小则对应于相关观测变量对之间冗余程度的大小。
一般情况下,初始数据的协方差矩阵总是不太好的,表现为信噪比不高且变量间相关度大。PCA的目标就是通过基变换对协方差矩阵进行优化,找到相关“主元”。那么,如何进行优化?矩阵的那些性质是需要注意的呢?
主元分析以及协方差矩阵优化的原则是:1)最小化变量冗余,对应于协方差矩阵的非对角元素要尽量小(即0);2)最大化信号,对应于要使协方差矩阵的对角线上的元素尽可能的大。而优化矩阵Cy对角线上的元素越大,就说明信号的成分越大,换句话就是对应于越重要的“主元”。
PCA的假设条件(和局限)包括:
1. 线形性假设。如同文章开始的例子,PCA的内部模型是线性的。这也就决定了它能进行的主元分析之间的关系也是线性的。现在比较流行的kernel-PCA的一类方法就是使用非线性的权值对原有PCA技术的拓展。根据先验知识对数据预先进行非线性转换的方法就成为kernel-PCA,它扩展了PCA能够处理的问题的范围,又可以结合一些先验约束,是比较流行的方法。
2. 使用中值和方差进行充分统计。使用中值和方差进行充分的概率分布描述的模型只限于指数型概率分布模型。(例如高斯分布),也就是说,如果我们考察的数据的概率分布并不满足高斯分布或是指数型的概率分布,那么PCA将会失效。在这种模型下,不能使用方差和协方差来很好的描述噪音和冗余,对教化之后的协方差矩阵并不能得到很合适的结果。事实上,去除冗余的最基础的方程是:P(x,y)=P(x)*P(y)其中P(x)代表概率分布的密度函数。基于这个方程进行冗余去除的方法被称作独立主元分析(ICA)方法(Independent Component Analysis)。不过,所幸的是,根据中央极限定理,现实生活中所遇到的大部分采样数据的概率分布都是遵从高斯分布的。所以PCA仍然是一个使用于绝大部分领域的稳定且有效的算法。
3. 大方差向量具有较大重要性。PCA方法隐含了这样的假设:数据本身具有较高的信噪比,所以具有最高方差的一维向量就可以被看作是主元,而方差较小的变化则被认为是噪音。这是由于低通滤波器的选择决定的。
4. 主元正交。PCA方法假设主元向量之间都是正交的,从而可以利用线形代数的一系列有效的数学工具进行求解,大大提高了效率和应用的范围。
由简单推导可知,如果对奇异值分解(A=USV')加以约束:U的向量必须正交,则矩阵S即为PCA的特征值分解中的E(特征向量矩阵),则说明PCA并不一定需要求取Cy特征值,也可以直接对原数据矩阵A进行奇异值分解即可得到特征向量矩阵,也就是主元向量。
参考 http://www.cad.zju.edu.cn/home/chenlu/pca.htm