Key Moments
Lecture 15 - PCA and ICA | Stanford CS229: Machine Learning Andrew Ng - Autumn 2018
Key Moments
PCA and ICA: theory, practice, and practical use.
Key Insights
PCA identifies the principal axis of variance by maximizing the variance of projected data, which is equivalent to finding the top eigenvectors of the data's covariance matrix.
Before applying PCA, data should be centered (mean-subtracted) and typically standardized (variance-scaled) to ensure fair axis selection and to avoid domination by features with larger scales.
PCA can be extended to reduce to K dimensions by using the top K eigenvectors, enabling reconstruction in the original space and enabling a compressed yet informative representation.
PCA is most valuable for visualization and compression, but it is not a universal fix for all problems; it can be misapplied for overfitting reduction or outlier detection, where other methods may be more reliable.
ICA (independent components analysis) goes beyond PCA by seeking statistically independent sources rather than uncorrelated directions, with the cocktail party problem as a motivating example where ICA can separate overlapping audio signals.
INTRODUCTION TO PCA AND ICA
The lecture opens with a broad survey of unsupervised learning methods, focusing on PCA (principal components analysis) and ICA (independent components analysis). Andrew Ng emphasizes that PCA is a non-probabilistic method that aims to discover a low-dimensional subspace in which high-dimensional data approximately lies, enabling dimensionality reduction without modeling P(X) directly. In contrast, factor analysis is probabilistic and models the density of X via latent structure. The intuition behind PCA is illustrated with simple 2D examples (height measured in centimeters and inches) to show that data often lies near a low-dimensional subspace with noise in orthogonal directions. The talk also introduces ICA as a follow-up topic, highlighting its goal of finding independent sources rather than merely uncorrelated directions and foreshadowing a practical cocktail party problem as a motivating example for ICA. The overall aim is to equip students with both the theoretical foundations and practical considerations when applying PCA (and eventually ICA) to real data.
PCA DERIVATION: FROM PROJECTION TO VARIANCE
The core idea is to find a unit vector u that defines a line onto which to project the data X. The projection length of a data point x_i is u^T x_i, and the goal is to maximize the sum of squared projections across all training examples. This leads to maximizing (1/m) ∑ (u^T x_i)^2, subject to ||u|| = 1. By collecting the data after centering (subtracting the mean) and standardizing its scale, the problem can be rewritten in terms of the covariance matrix σ of the data. The maximization reduces to maximizing u^T σ u with the constraint ||u|| = 1. The solution is the principal eigenvector of σ, i.e., the top eigenvector corresponding to the largest eigenvalue. This eigenvector defines the first principal component. The same framework extends to multiple components by taking U1, U2, ..., Uk as the top-k eigenvectors, forming an orthogonal basis for a k-dimensional subspace.
COVARIANCE, EIGENVALUES, AND THE K-DIMENSIONAL SUBSPACE
A concise mathematical story emerges: after centering (and often standardizing) the data, the covariance matrix σ captures how features co-vary. The objective of PCA for rank-1 projection leads to finding the leading eigenvector (and eigenvalue). For higher-dimensional reductions, the top-k eigenvectors form an orthogonal basis spanning the subspace that captures the most variance. The data are then projected onto this subspace to obtain reduced coordinates y_i = [u1^T x_i, ..., uk^T x_i]. If needed, one can reconstruct an approximation in the original space via x_i ≈ ∑_{j=1}^k y_i_j u_j. The transcript also emphasizes how the mean-centering and normalization steps interact with the covariance-based derivation and why the eigenvectors are orthogonal. In practice, the eigenvectors form an orthonormal basis for the subspace of interest, and their corresponding eigenvalues quantify how much variance each direction explains.
PCA IN PRACTICE: CHOOSING K AND MAPPING BACK
Practical PCA involves selecting the number of components K to retain. A common heuristic is to keep the components with the largest eigenvalues such that the ratio (λ1 + ... + λK) / (λ1 + ... + λN) exceeds a chosen threshold (often 90–99% of the total variance). The practical workflow is: compute the covariance σ on the training data after centering and scaling, extract eigenvectors and eigenvalues, choose K, form the projection matrix U = [u1, ..., uk], and transform X to Y = XU. To map back from Y to X (for interpretation or reconstruction), use X_hat = YU^T. The lecture stresses that in real-world pipelines, the exact eigenvectors are less important than the subspace they span; small perturbations in the data may rotate the individual eigenvectors but typically preserve the subspace spanned by the top-k directions. This insight is crucial when applying PCA across training and test splits, where the projection basis should be learned from training data and consistently applied to test data.
PCA AS A PRACTICAL TOOL: VISUALIZATION, COMPRESSION, AND WARNINGS
Ng highlights several pragmatic uses of PCA alongside important caveats. Visualization is a primary application: high-dimensional data can be projected into 2D or 3D for human interpretation, such as observing how neural or behavioral data clusters or evolves over time. PCA can also serve as a compression tool to speed up downstream learning algorithms by reducing dimensionality before training, thereby saving memory bandwidth and computation without substantial loss of information. However, there are cautions: PCA is not a universal remedy for overfitting or outlier detection, and it should not be used blindly to replace regularization or robust methods. Some tasks—like certain face-recognition approaches—historically used PCA (eigenfaces), but Ng cautions against overinterpreting individual eigenvectors; what remains meaningful is often the subspace spanned by the top components rather than a single eigenvector. Finally, practitioners should compare performance with and without PCA, ensuring it provides tangible benefits for the specific problem at hand.
PCA PREPROCESSING, K SELECTION, AND IMPLEMENTATION NOTES
Several practical notes are offered. Preprocessing—subtracting the mean and scaling features to unit variance—ensures that PCA captures genuine structure rather than scale-related artifacts and prevents a feature with a large scale from dominating the principal direction. In supervised workflows, the standard approach is to fit the PCA on the training data and apply the learned projection to both training and test data, keeping the projection fixed to avoid data leakage. Choosing K is often guided by the retained variance criterion, but one should also consider domain-specific needs (e.g., visualization vs. accuracy) and the computational budget. The orthogonality of eigenvectors is a key property that ensures the projected axes are uncorrelated, though individual eigenvectors can be sensitive to small data changes; what is robust is the subspace they span. The speaker also notes that many PCA implementations return eigenvectors sorted by eigenvalue, making K selection straightforward, and that whitening is a related but separate step sometimes used to standardize the data further before downstream processing.
ICA: INDEPENDENCE, THE COCKTAIL PARTY PROBLEM, AND A PREVIEW
The talk then pivots to ICA, previewing a second unsupervised learning paradigm that seeks statistically independent sources rather than merely uncorrelated directions. The cocktail party problem serves as a motivating example: two or more speakers overlapping in recording from multiple microphones can be separated if the sources are independent and the mixing is linear. The data model is X at time i equals A times S at time i, where A is the mixing matrix and S contains the source signals. The goal is to recover the unmixing matrix W such that S_hat = W X recovers the original sources (up to permutation and sign). The lecture provides intuition: by examining the joint distribution of the observed mixtures, ICA aims to transform the data into statistically independent components, which are easier to interpret and separate. Ng also points out that ICA is more sensitive to assumptions about independence and can be demonstrated with real audio data, though the method has its own ambiguities (ordering and sign). The session closes with an encouragement to explore ICA further in the upcoming problem set, where a multi-speaker cocktail party problem will be implemented.
Mentioned in This Episode
●People Referenced
PCA Quick Reference Card
Practical takeaways from this episode
Do This
Avoid This
Common Questions
PCA is a non-probabilistic method that seeks directions of maximum variance to reduce dimensionality. Factor Analysis is a probabilistic model that explains observed data as arising from latent factors plus noise. In practice, PCA is often used for visualization and compression, while Factor Analysis is used when you want a probabilistic interpretation of the latent structure.
Topics
Mentioned in this video
More from Stanford Online
View all 12 summaries
2 minDesign and Control of Haptic Systems: The Challenges of Robotics
2 minBiomechanics and Mechanobiology: Understanding How Human Beings Work
2 minWhat Are Biomechanics and Mechanobiology? Associate Professor Marc Levenston Explains
1 minWhat Is A Haptic Device? Professor Allison Okamura Explains.
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free