Why do we subtract the mean and standardize before applying PCA?

Subtracting the mean centers the data so that PCA identifies directions of variance around the origin. Standardizing to unit variance prevents features with larger scales from dominating the principal directions. Both steps ensure the covariance structure accurately reflects true variation rather than scale or offset.

How many principal components should I keep in PCA?

Choose the number k based on how much variance you want to retain. A common practice is to keep the top-k eigenvectors whose eigenvalues sum to a chosen fraction of the total variance (e.g., 90%). This is often reported as the retained percentage of variance.

What does Y = U^T X mean in PCA, and how do you go back to X?

Y is the low-dimensional representation of X obtained by projecting onto the PCA subspace spanned by U. To reconstruct X approximately, you map back with X ≈ U Y, i.e., a linear combination of the top-k eigenvectors weighted by the components in Y.

What is the main objective in PCA?

The core objective is to maximize the variance of the projected data along a direction (or a subspace). This equivalently minimizes the sum of squared distances from the data to the projection line, depending on which intuition you use.

What is the role of the covariance matrix and its eigenvectors in PCA?

The covariance matrix captures how features vary together. The eigenvectors of the covariance matrix define the principal axes, and the corresponding eigenvalues quantify how much variance is captured along each axis. The top eigenvectors form the PCA subspace.

How do you project data onto K dimensions in PCA?

Compute the top-k eigenvectors U1,...,Uk of the covariance matrix. Represent each data point x_i as y_i = [U1^T x_i, U2^T x_i, ..., Uk^T x_i], reducing from the original dimension to k.

What is a good practical use of PCA for visualization?

PCA is commonly used to reduce high-dimensional data to 2D or 3D so you can visualize structure, trajectories, or clusters, or to inspect how data clusters in a lower-dimensional space (e.g., monkey brain activity trajectories in 3D).

What are some questionable uses of PCA, like for outlier detection?

PCA can be used for outlier detection or as a preprocessing step for some pipelines, but these uses are not universally reliable. PCA can sometimes help, but it can also mislead if the data doesn't lie near a low-dimensional subspace.

What is the cocktail party problem in ICA?

The cocktail party problem asks how to separate multiple overlapping audio sources recorded by multiple microphones. ICA aims to recover the independent source signals from mixed observations by finding an unmixing matrix.

What are the ambiguities in ICA?

ICA faces axis ambiguity (which recovered signal corresponds to which source) and sign ambiguity (the recovered component could be positive or negative). In practice, these ambiguities are not critical for most perceptual judgments, e.g., reversing the sign of an audio signal sounds the same to listeners.

Can ICA recover the negative of a source and still be valid?

Yes. ICA can recover the negative of a source and it is still valid because reversing the sign of a source yields the same observable mixtures up to a phase flip; in audio, playing the opposite sign sounds essentially the same.

Key Moments

Lecture 15 - PCA and ICA | Stanford CS229: Machine Learning Andrew Ng - Autumn 2018

Stanford Online

Education7 min read79 min video

Feb 26, 2026|5,345 views|113|9

Stanford Stanford Online CS229 Andrew Ng AI Machine Learning

Save to Pod

Key Moments

TL;DR

PCA and ICA: theory, practice, and practical use.

Key Insights

PCA identifies the principal axis of variance by maximizing the variance of projected data, which is equivalent to finding the top eigenvectors of the data's covariance matrix.

Before applying PCA, data should be centered (mean-subtracted) and typically standardized (variance-scaled) to ensure fair axis selection and to avoid domination by features with larger scales.

PCA can be extended to reduce to K dimensions by using the top K eigenvectors, enabling reconstruction in the original space and enabling a compressed yet informative representation.

PCA is most valuable for visualization and compression, but it is not a universal fix for all problems; it can be misapplied for overfitting reduction or outlier detection, where other methods may be more reliable.

ICA (independent components analysis) goes beyond PCA by seeking statistically independent sources rather than uncorrelated directions, with the cocktail party problem as a motivating example where ICA can separate overlapping audio signals.

INTRODUCTION TO PCA AND ICA

The lecture opens with a broad survey of unsupervised learning methods, focusing on PCA (principal components analysis) and ICA (independent components analysis). Andrew Ng emphasizes that PCA is a non-probabilistic method that aims to discover a low-dimensional subspace in which high-dimensional data approximately lies, enabling dimensionality reduction without modeling P(X) directly. In contrast, factor analysis is probabilistic and models the density of X via latent structure. The intuition behind PCA is illustrated with simple 2D examples (height measured in centimeters and inches) to show that data often lies near a low-dimensional subspace with noise in orthogonal directions. The talk also introduces ICA as a follow-up topic, highlighting its goal of finding independent sources rather than merely uncorrelated directions and foreshadowing a practical cocktail party problem as a motivating example for ICA. The overall aim is to equip students with both the theoretical foundations and practical considerations when applying PCA (and eventually ICA) to real data.

PCA DERIVATION: FROM PROJECTION TO VARIANCE

The core idea is to find a unit vector u that defines a line onto which to project the data X. The projection length of a data point x_i is u^T x_i, and the goal is to maximize the sum of squared projections across all training examples. This leads to maximizing (1/m) ∑ (u^T x_i)^2, subject to ||u|| = 1. By collecting the data after centering (subtracting the mean) and standardizing its scale, the problem can be rewritten in terms of the covariance matrix σ of the data. The maximization reduces to maximizing u^T σ u with the constraint ||u|| = 1. The solution is the principal eigenvector of σ, i.e., the top eigenvector corresponding to the largest eigenvalue. This eigenvector defines the first principal component. The same framework extends to multiple components by taking U1, U2, ..., Uk as the top-k eigenvectors, forming an orthogonal basis for a k-dimensional subspace.

COVARIANCE, EIGENVALUES, AND THE K-DIMENSIONAL SUBSPACE

A concise mathematical story emerges: after centering (and often standardizing) the data, the covariance matrix σ captures how features co-vary. The objective of PCA for rank-1 projection leads to finding the leading eigenvector (and eigenvalue). For higher-dimensional reductions, the top-k eigenvectors form an orthogonal basis spanning the subspace that captures the most variance. The data are then projected onto this subspace to obtain reduced coordinates y_i = [u1^T x_i, ..., uk^T x_i]. If needed, one can reconstruct an approximation in the original space via x_i ≈ ∑_{j=1}^k y_i_j u_j. The transcript also emphasizes how the mean-centering and normalization steps interact with the covariance-based derivation and why the eigenvectors are orthogonal. In practice, the eigenvectors form an orthonormal basis for the subspace of interest, and their corresponding eigenvalues quantify how much variance each direction explains.

PCA IN PRACTICE: CHOOSING K AND MAPPING BACK

Practical PCA involves selecting the number of components K to retain. A common heuristic is to keep the components with the largest eigenvalues such that the ratio (λ1 + ... + λK) / (λ1 + ... + λN) exceeds a chosen threshold (often 90–99% of the total variance). The practical workflow is: compute the covariance σ on the training data after centering and scaling, extract eigenvectors and eigenvalues, choose K, form the projection matrix U = [u1, ..., uk], and transform X to Y = XU. To map back from Y to X (for interpretation or reconstruction), use X_hat = YU^T. The lecture stresses that in real-world pipelines, the exact eigenvectors are less important than the subspace they span; small perturbations in the data may rotate the individual eigenvectors but typically preserve the subspace spanned by the top-k directions. This insight is crucial when applying PCA across training and test splits, where the projection basis should be learned from training data and consistently applied to test data.

PCA AS A PRACTICAL TOOL: VISUALIZATION, COMPRESSION, AND WARNINGS

Ng highlights several pragmatic uses of PCA alongside important caveats. Visualization is a primary application: high-dimensional data can be projected into 2D or 3D for human interpretation, such as observing how neural or behavioral data clusters or evolves over time. PCA can also serve as a compression tool to speed up downstream learning algorithms by reducing dimensionality before training, thereby saving memory bandwidth and computation without substantial loss of information. However, there are cautions: PCA is not a universal remedy for overfitting or outlier detection, and it should not be used blindly to replace regularization or robust methods. Some tasks—like certain face-recognition approaches—historically used PCA (eigenfaces), but Ng cautions against overinterpreting individual eigenvectors; what remains meaningful is often the subspace spanned by the top components rather than a single eigenvector. Finally, practitioners should compare performance with and without PCA, ensuring it provides tangible benefits for the specific problem at hand.

PCA PREPROCESSING, K SELECTION, AND IMPLEMENTATION NOTES

Several practical notes are offered. Preprocessing—subtracting the mean and scaling features to unit variance—ensures that PCA captures genuine structure rather than scale-related artifacts and prevents a feature with a large scale from dominating the principal direction. In supervised workflows, the standard approach is to fit the PCA on the training data and apply the learned projection to both training and test data, keeping the projection fixed to avoid data leakage. Choosing K is often guided by the retained variance criterion, but one should also consider domain-specific needs (e.g., visualization vs. accuracy) and the computational budget. The orthogonality of eigenvectors is a key property that ensures the projected axes are uncorrelated, though individual eigenvectors can be sensitive to small data changes; what is robust is the subspace they span. The speaker also notes that many PCA implementations return eigenvectors sorted by eigenvalue, making K selection straightforward, and that whitening is a related but separate step sometimes used to standardize the data further before downstream processing.

ICA: INDEPENDENCE, THE COCKTAIL PARTY PROBLEM, AND A PREVIEW

The talk then pivots to ICA, previewing a second unsupervised learning paradigm that seeks statistically independent sources rather than merely uncorrelated directions. The cocktail party problem serves as a motivating example: two or more speakers overlapping in recording from multiple microphones can be separated if the sources are independent and the mixing is linear. The data model is X at time i equals A times S at time i, where A is the mixing matrix and S contains the source signals. The goal is to recover the unmixing matrix W such that S_hat = W X recovers the original sources (up to permutation and sign). The lecture provides intuition: by examining the joint distribution of the observed mixtures, ICA aims to transform the data into statistically independent components, which are easier to interpret and separate. Ng also points out that ICA is more sensitive to assumptions about independence and can be demonstrated with real audio data, though the method has its own ambiguities (ordering and sign). The session closes with an encouragement to explore ICA further in the upcoming problem set, where a multi-speaker cocktail party problem will be implemented.

Mentioned in This Episode

●People Referenced

PCA Quick Reference Card

Practical takeaways from this episode

Do This

Standardize features: subtract mean and divide by standard deviation before PCA.

Compute the covariance matrix and its eigenvectors; project data onto the top-k eigenvectors.

Use the same eigenvectors fitted on training data to transform test data.

Avoid This

Don’t over-interpret individual eigenvectors; focus on the subspace spanned by the top-k eigenvectors.

Don’t rely on PCA to fix overfitting; consider regularization and other methods for that purpose.

Don’t assume PCA will separate clusters unless data truly lies near a low-dimensional subspace.

Common Questions

PCA is a non-probabilistic method that seeks directions of maximum variance to reduce dimensionality. Factor Analysis is a probabilistic model that explains observed data as arising from latent factors plus noise. In practice, PCA is often used for visualization and compression, while Factor Analysis is used when you want a probabilistic interpretation of the latent structure.