Case Study 2: Principal Component Analysis for Data Visualization

Overview

High-dimensional data is the norm in AI. Word embeddings have 300 dimensions, image features have thousands, and genomic data can have tens of thousands. Yet we live in a 3D world and our screens are 2D. Principal Component Analysis (PCA) bridges this gap by finding the lower-dimensional subspace that captures the most variance in the data.

PCA is a direct application of the eigendecomposition we studied in Section 2.4. It works by computing the eigenvectors of the data's covariance matrix and projecting the data onto the top eigenvectors---the principal components. This case study builds PCA from scratch using NumPy, applies it to a real-world-style dataset, and visualizes the results.

The Mathematics of PCA

Step 1: Center the Data

Given a data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ with $n$ samples and $d$ features, first center it by subtracting the mean:

$$ \bar{\mathbf{X}} = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^T $$

where:

  • $\boldsymbol{\mu} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i$ is the mean vector,
  • $\mathbf{1}$ is a column vector of ones,
  • centering ensures the principal components pass through the origin.

Step 2: Compute the Covariance Matrix

$$ \mathbf{C} = \frac{1}{n-1} \bar{\mathbf{X}}^T \bar{\mathbf{X}} $$

where:

  • $\mathbf{C} \in \mathbb{R}^{d \times d}$ is the covariance matrix,
  • $C_{ij}$ measures how features $i$ and $j$ co-vary,
  • $\mathbf{C}$ is symmetric and positive semidefinite.

Step 3: Eigendecomposition

$$ \mathbf{C} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^T $$

where:

  • $\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$,
  • each eigenvalue $\lambda_i$ represents the variance captured by the $i$-th principal component,
  • each column $\mathbf{q}_i$ of $\mathbf{Q}$ is the $i$-th principal component direction.

Step 4: Project

To reduce from $d$ dimensions to $k$, project onto the top $k$ eigenvectors:

$$ \mathbf{Z} = \bar{\mathbf{X}} \mathbf{Q}_k $$

where:

  • $\mathbf{Q}_k \in \mathbb{R}^{d \times k}$ contains the first $k$ columns of $\mathbf{Q}$,
  • $\mathbf{Z} \in \mathbb{R}^{n \times k}$ is the low-dimensional representation,
  • the fraction of variance retained is $\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i}$.

Step-by-Step Implementation

Step 1: Generate a Synthetic Dataset

We will create a dataset that mimics a realistic scenario: clusters of data points in high dimensions where the true structure lies in a lower-dimensional subspace.

import numpy as np
import matplotlib.pyplot as plt

def generate_clustered_data(
    n_samples: int = 300,
    n_features: int = 10,
    n_clusters: int = 3,
    n_informative: int = 3,
    random_state: int = 42
) -> tuple[np.ndarray, np.ndarray]:
    """Generate high-dimensional clustered data with low intrinsic dimension.

    The data consists of clusters that differ primarily along a few
    informative dimensions, with the remaining dimensions being noise.

    Args:
        n_samples: Total number of data points.
        n_features: Total number of features (dimensions).
        n_clusters: Number of clusters.
        n_informative: Number of dimensions that carry cluster information.
        random_state: Random seed for reproducibility.

    Returns:
        Tuple of (data matrix of shape (n_samples, n_features),
                  cluster labels of shape (n_samples,)).
    """
    rng = np.random.RandomState(random_state)
    samples_per_cluster = n_samples // n_clusters

    data_list = []
    labels_list = []

    # Generate cluster centers in the informative subspace
    centers = rng.randn(n_clusters, n_informative) * 5

    for i in range(n_clusters):
        # Samples around this cluster center
        cluster_data = rng.randn(samples_per_cluster, n_informative) * 0.8
        cluster_data += centers[i]

        # Add noise dimensions
        noise = rng.randn(samples_per_cluster, n_features - n_informative) * 0.3

        full_data = np.hstack([cluster_data, noise])
        data_list.append(full_data)
        labels_list.append(np.full(samples_per_cluster, i))

    data = np.vstack(data_list)
    labels = np.concatenate(labels_list)

    # Apply a random rotation to mix the informative and noise dimensions
    # This makes PCA necessary (the structure is not aligned with axes)
    rotation = np.linalg.qr(rng.randn(n_features, n_features))[0]
    data = data @ rotation

    return data, labels


X, labels = generate_clustered_data(
    n_samples=300,
    n_features=10,
    n_clusters=3,
    n_informative=3,
    random_state=42
)
print(f"Data shape: {X.shape}")
print(f"Labels: {np.unique(labels)}")
print(f"Samples per cluster: {np.bincount(labels)}")

Step 2: Implement PCA from Scratch

class PCAFromScratch:
    """Principal Component Analysis implemented from scratch using NumPy.

    This implementation uses the eigendecomposition of the covariance matrix.
    For very high-dimensional data where n < d, using the SVD of the centered
    data matrix would be more efficient.

    Attributes:
        n_components: Number of principal components to keep.
        components_: Principal component directions (eigenvectors).
        explained_variance_: Variance explained by each component (eigenvalues).
        explained_variance_ratio_: Fraction of total variance per component.
        mean_: Mean of the training data.
    """

    def __init__(self, n_components: int = 2) -> None:
        """Initialize PCA.

        Args:
            n_components: Number of components to retain.
        """
        self.n_components = n_components
        self.components_: np.ndarray | None = None
        self.explained_variance_: np.ndarray | None = None
        self.explained_variance_ratio_: np.ndarray | None = None
        self.mean_: np.ndarray | None = None

    def fit(self, X: np.ndarray) -> "PCAFromScratch":
        """Fit PCA on training data.

        Args:
            X: Data matrix of shape (n_samples, n_features).

        Returns:
            self, for method chaining.
        """
        n_samples, n_features = X.shape

        # Step 1: Center the data
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_

        # Step 2: Compute covariance matrix
        cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)

        # Step 3: Eigendecomposition (use eigh for symmetric matrices)
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

        # eigh returns eigenvalues in ascending order; reverse for descending
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]

        # Step 4: Select top k components
        self.components_ = eigenvectors[:, :self.n_components].T
        self.explained_variance_ = eigenvalues[:self.n_components]
        self.explained_variance_ratio_ = (
            eigenvalues[:self.n_components] / eigenvalues.sum()
        )

        # Store all eigenvalues for analysis
        self._all_eigenvalues = eigenvalues

        return self

    def transform(self, X: np.ndarray) -> np.ndarray:
        """Project data onto the principal components.

        Args:
            X: Data matrix of shape (n_samples, n_features).

        Returns:
            Projected data of shape (n_samples, n_components).
        """
        X_centered = X - self.mean_
        return X_centered @ self.components_.T

    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """Fit and transform in one step.

        Args:
            X: Data matrix of shape (n_samples, n_features).

        Returns:
            Projected data of shape (n_samples, n_components).
        """
        self.fit(X)
        return self.transform(X)

    def inverse_transform(self, Z: np.ndarray) -> np.ndarray:
        """Reconstruct data from the low-dimensional representation.

        Args:
            Z: Projected data of shape (n_samples, n_components).

        Returns:
            Reconstructed data of shape (n_samples, n_features).
        """
        return Z @ self.components_ + self.mean_

Step 3: Apply PCA and Visualize

# Fit PCA with 2 components for visualization
pca = PCAFromScratch(n_components=2)
Z = pca.fit_transform(X)

print(f"Projected data shape: {Z.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")

# Visualize the 2D projection
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot of projected data
colors = ['#e41a1c', '#377eb8', '#4daf4a']
cluster_names = ['Cluster A', 'Cluster B', 'Cluster C']
for i in range(3):
    mask = labels == i
    axes[0].scatter(
        Z[mask, 0], Z[mask, 1],
        c=colors[i], label=cluster_names[i],
        alpha=0.6, edgecolors='white', linewidths=0.5, s=40
    )
axes[0].set_xlabel(f'PC 1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0].set_ylabel(f'PC 2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[0].set_title('PCA: 10D Data Projected to 2D')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Scree plot (explained variance by component)
all_ratios = pca._all_eigenvalues / pca._all_eigenvalues.sum()
cumulative = np.cumsum(all_ratios)
x_ticks = np.arange(1, len(all_ratios) + 1)

axes[1].bar(x_ticks, all_ratios, alpha=0.7, color='steelblue',
            label='Individual')
axes[1].plot(x_ticks, cumulative, 'ro-', markersize=6, label='Cumulative')
axes[1].axhline(y=0.95, color='gray', linestyle='--', alpha=0.5,
                label='95% threshold')
axes[1].set_xlabel('Principal Component')
axes[1].set_ylabel('Explained Variance Ratio')
axes[1].set_title('Scree Plot')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xticks(x_ticks)

plt.tight_layout()
plt.savefig('pca_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

Step 4: Choosing the Number of Components

def select_n_components(
    pca: PCAFromScratch,
    threshold: float = 0.95
) -> int:
    """Determine the number of components needed to explain a given
    fraction of the total variance.

    Args:
        pca: A fitted PCAFromScratch instance.
        threshold: Minimum fraction of variance to explain.

    Returns:
        Number of components needed.
    """
    all_ratios = pca._all_eigenvalues / pca._all_eigenvalues.sum()
    cumulative = np.cumsum(all_ratios)
    n_components = np.searchsorted(cumulative, threshold) + 1
    return n_components


for threshold in [0.80, 0.90, 0.95, 0.99]:
    k = select_n_components(pca, threshold)
    print(f"{threshold:.0%} variance: {k} components needed "
          f"(out of {X.shape[1]})")

Step 5: Reconstruction Error Analysis

def reconstruction_analysis(X: np.ndarray, max_k: int | None = None) -> None:
    """Analyze reconstruction error as a function of the number of components.

    Args:
        X: Data matrix of shape (n_samples, n_features).
        max_k: Maximum number of components to test.
    """
    if max_k is None:
        max_k = X.shape[1]

    errors = []
    for k in range(1, max_k + 1):
        pca_k = PCAFromScratch(n_components=k)
        Z_k = pca_k.fit_transform(X)
        X_reconstructed = pca_k.inverse_transform(Z_k)
        error = np.sqrt(np.mean((X - X_reconstructed) ** 2))
        errors.append(error)

    plt.figure(figsize=(8, 5))
    plt.plot(range(1, max_k + 1), errors, 'b-o', markersize=5)
    plt.xlabel('Number of Components (k)')
    plt.ylabel('RMSE Reconstruction Error')
    plt.title('PCA Reconstruction Error vs. Number of Components')
    plt.grid(True, alpha=0.3)
    plt.xticks(range(1, max_k + 1))
    plt.tight_layout()
    plt.savefig('pca_reconstruction_error.png', dpi=150, bbox_inches='tight')
    plt.show()

    for k, err in zip(range(1, max_k + 1), errors):
        print(f"k={k:2d}: RMSE = {err:.4f}")


reconstruction_analysis(X)

Step 6: Comparison with SVD-Based PCA

PCA can also be computed via the SVD of the centered data matrix, which is often more numerically stable and efficient when $n < d$.

def pca_via_svd(X: np.ndarray, n_components: int = 2) -> np.ndarray:
    """Compute PCA using SVD of the centered data matrix.

    Args:
        X: Data matrix of shape (n_samples, n_features).
        n_components: Number of components to retain.

    Returns:
        Projected data of shape (n_samples, n_components).
    """
    # Center
    X_centered = X - X.mean(axis=0)

    # SVD of centered data
    U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # The principal components are the rows of Vt
    # The projected data is U * Sigma (scaled)
    Z = U[:, :n_components] * sigma[:n_components]

    return Z


# Compare eigendecomposition-based PCA with SVD-based PCA
Z_eigen = pca.fit_transform(X)
Z_svd = pca_via_svd(X, n_components=2)

# The signs of eigenvectors are arbitrary; align them
for i in range(2):
    if np.dot(Z_eigen[:, i], Z_svd[:, i]) < 0:
        Z_svd[:, i] *= -1

print(f"Max difference between methods: "
      f"{np.max(np.abs(Z_eigen - Z_svd)):.2e}")
print("The two methods produce equivalent results (up to sign flips).")

Step 7: Visualizing Principal Component Directions

def visualize_pc_directions(pca: PCAFromScratch, feature_names: list[str]
                            ) -> None:
    """Visualize the loadings of the top principal components.

    Args:
        pca: A fitted PCAFromScratch instance.
        feature_names: Names of the original features.
    """
    n_components = pca.components_.shape[0]
    n_features = pca.components_.shape[1]

    fig, axes = plt.subplots(1, n_components, figsize=(6 * n_components, 4))
    if n_components == 1:
        axes = [axes]

    for i in range(n_components):
        loadings = pca.components_[i]
        sorted_idx = np.argsort(np.abs(loadings))[::-1]

        colors_bar = ['steelblue' if l >= 0 else 'salmon'
                      for l in loadings[sorted_idx]]
        axes[i].barh(
            range(n_features),
            loadings[sorted_idx],
            color=colors_bar,
            alpha=0.8
        )
        axes[i].set_yticks(range(n_features))
        axes[i].set_yticklabels([feature_names[j] for j in sorted_idx])
        axes[i].set_xlabel('Loading')
        axes[i].set_title(
            f'PC {i+1} ({pca.explained_variance_ratio_[i]:.1%} variance)'
        )
        axes[i].grid(True, alpha=0.3, axis='x')
        axes[i].invert_yaxis()

    plt.tight_layout()
    plt.savefig('pca_loadings.png', dpi=150, bbox_inches='tight')
    plt.show()


feature_names = [f'Feature {i+1}' for i in range(X.shape[1])]
visualize_pc_directions(pca, feature_names)

Results and Discussion

Our analysis reveals several important patterns:

  1. Effective dimensionality. Although the data lives in 10 dimensions, the scree plot shows that 3 components explain over 90% of the variance, matching the 3 informative dimensions we used to generate the data. PCA successfully recovers the intrinsic dimensionality.

  2. Cluster separation. The 2D projection clearly separates the three clusters, demonstrating that PCA preserves the most important structure when reducing dimensions. This is because the directions of maximum variance align with the directions that separate the clusters.

  3. Eigendecomposition vs. SVD equivalence. Both methods produce the same result (up to sign flips of the components), confirming the mathematical relationship between PCA and SVD.

  4. Interpretability. The loading plots show which original features contribute most to each principal component, providing interpretable axes in the reduced space.

Limitations of PCA

While PCA is powerful, it has important limitations:

  • Linearity: PCA can only capture linear relationships. Non-linear dimensionality reduction methods like t-SNE, UMAP, or autoencoders can handle curved manifolds. We will explore these in later chapters.
  • Variance assumption: PCA assumes that the directions of maximum variance are the most informative. This is not always true---high variance might be due to noise.
  • Sensitivity to scaling: Features with large magnitudes dominate the principal components. Always standardize your data (zero mean, unit variance) before applying PCA unless the scale differences are meaningful.
  • Orthogonality constraint: Principal components are forced to be orthogonal, which may not reflect the true structure of the data.

Connection to AI

PCA is used extensively in the AI pipeline:

  • Feature preprocessing: Reducing dimensionality before feeding data to classifiers reduces overfitting and speeds up training.
  • Visualization: Projecting high-dimensional embeddings (word2vec, BERT, etc.) to 2D or 3D for inspection.
  • Data compression: Storing lower-dimensional representations saves memory and computation.
  • Noise reduction: Discarding components with small eigenvalues removes noise.
  • Whitening: PCA followed by scaling by $\boldsymbol{\Lambda}^{-1/2}$ decorrelates features and normalizes their variance, which can accelerate convergence of gradient-based optimization.

Key Takeaways

  1. PCA finds the orthogonal directions of maximum variance in the data.
  2. It is computed via the eigendecomposition of the covariance matrix (or equivalently, via the SVD of the centered data matrix).
  3. The eigenvalues tell you how much variance each component captures; the scree plot guides the choice of dimensionality.
  4. PCA is a linear method---for non-linear structure, you need more advanced techniques.
  5. Always center (and usually standardize) your data before applying PCA.