Principal Component Analysis

by: Brent Austgen, PhD
May 25, 2019

Recently, I worked on a project that necessitated data exploration and preprocessing. Principal component analysis (PCA) is technique I had heard about and seen used many times before; however, I personally had no experience with it in either theory or practice. So I got to digging.

Now the point of this post is not to give a full-fledged lesson on PCA. Many others have done that in better ways than I could manage. But to provide context, I’ll give a brief overview.

PCA performs an affine transformation on the standard Cartesian basis. In the transformed basis, the origin is centered in the data and most of the variance in the data is in the direction of the new axes. In two dimensions, imagine there are a few data points scattered on a plot. Make a plus sign with your fingers and let it overlap the x- and y-axis of that plot, then shift the intersection of your fingers to the middle of the data and rotate them to best align with the linear trend of the data. In higher dimensions, an affine transformation is still just a shift (i.e. translation) and a rotation, but the concept is difficult to visualize.

Suppose the original data contains has features \(x\), \(y\), and \(z\). Using PCA, those features are transformed to \(a, b, c\), each a linear combination of \(x, y, z\). An example of such a transformation is \(a = 3x - 2y + z\). Throughout this article, I will refer to “feature contributions”. In this example, the \(x\) feature makes a contribution of 3 to \(a\), \(y\) makes a contribution of -2, and \(z\) just 1. Though \(y\)’s contribution is -2, it is still a larger contribution in magnitude than the \(z\)’s 1.

Now let’s proceed to the motivation for this post. In my research, I came across this excellent article that describes PCA. I had just one critique about this article – near the end, there is a heatmap illustration that falls a bit short on illustrating the contributions of the original data features to the principal components. In a few lines of Python code, we can create a near reproduction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn import datasets

X = datasets.load_breast_cancer()['data']
X_features = datasets.load_breast_cancer()['feature_names']

pca = PCA().fit(X)
data = pca.inverse_transform(np.eye(X.shape[-1]))
plt.imshow(np.log(data), cmap='hot')
plt.ylabel('Principal Components')
plt.xlabel('Original Features')
plt.colorbar()
plt.show()

original

At first glance, nothing seems wrong with this image. However, several of the white squares in the heatmap are misleading – they do not actually indicate strong contributions as the colorbar suggests. Rather, those squares indicate NaN (not a number) placeholder values resulting from the attempt to take the log of a negative value, a result that is formally undefined. (Note the exponential scale of the colorbar, i.e. 6 means 1e6 and -8 means 1e-8 which is a very small, but positive number). The offending line is highlighted in the code snippet above.

This creates a conflict. Log scale is appropriate for this context, but the PCA inverse transform is likely to have at least one negative value. And it is virtually a certainty for the inverse transform to contain negative values when the data has been normalized (depending on the normalization method).

So are we out of luck? If you’ve made it this far, you’ll be happy to know the answer is a resounding no! In fact, matplotlib has features that specifically address this issue – matplotlib.colors.LogNorm and matplotlib.colors.SymLogNorm. Using these mechanisms, I developed the following function for properly displaying such a heatmap visualization.

Note: The code blocks in the remainder of this post all depend on imports, functions, etc. that have been defined in previous code blocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


import matplotlib.colors as colors

def pca_viz(pca, X, X_features, sort=False, signed=True):

    pca.fit(X)
    n_rows, n_cols = len(X_features), len(pca.components_)

    inv = pca.inverse_transform(np.eye(n_cols))
    data = pd.DataFrame(inv, columns=X_features).transpose()

    # make zero-values just very, very small
    epsilon = 1e-16
    num_zeros = (data == 0).sum().sum()
    if num_zeros:
        data.loc[:, :] = np.where(data == 0, 1e-16, data)

    # sort by principal component contribution
    if sort:
        data['sum'] = data.abs().sum(axis=1)
        data.sort_values(by='sum', ascending=False, inplace=True)
        data.drop('sum', axis=1, inplace=True)

    # configure heatmap and log-scale colorbar
    scale_min = max(data.abs().min().min(), 1e-2)
    scale_max = data.abs().max().max()
    if signed:
        cmap = 'RdBu_r'
        scaler = colors.SymLogNorm(vmin=-scale_max, vmax=scale_max,
                                   linscale=1.0, linthresh=scale_min)
    else:
        data = data.abs()
        cmap='Reds'
        scaler = colors.LogNorm(vmin=scale_min, vmax=scale_max)

    fig, ax = plt.subplots(1, 1)
    im = ax.imshow(data, cmap=cmap)
    im.set_norm(scaler)
    fig.colorbar(im)

    # configure ticks and labels
    ylabels = data.index
    ax.set_yticklabels(ylabels, fontsize=6, rotation=0, va='center')
    ax.set_yticks(np.arange(n_rows))
    ax.set_ylabel('Original Features')
    xlabels = np.arange(n_cols)
    ax.set_xticklabels(xlabels, fontsize=6, rotation=90, ha='center')
    ax.set_xticks(np.arange(n_cols))
    ax.set_xlabel('Principal Components')

    return fig, ax

This method is robust in the case of negative values. Moreover, the option to toggle signed contributions (positive vs. negative), as opposed to absolute magnitude of contribution, is available through the signed argument. There is also a sort argument that will cause the original features to be ordered top-to-bottom by total magnitude of contribution. Let’s see it in action using first the original data as-is, then with a scaled version of that data (via sklearn.preprocessing.StandardScaler).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


from sklearn.preprocessing import StandardScaler

# visualize the original data
pca = PCA()
fig, ax = pca_viz(pca, X, X_features, signed=True)
plt.subplots_adjust(left=0.25, right=0.90, bottom=0.20, top=0.80)
plt.show()

# visualize the scaled data
X_ss = StandardScaler().fit_transform(X)
fig, ax = pca_viz(pca, X_ss, X_features, signed=True)
plt.subplots_adjust(left=0.25, right=0.90, bottom=0.20, top=0.80)
plt.show()

Note that my function produces a transposed version of the original illustration now with original features labeled on the y-axis and principal components enumerated on the x-axis. With the original data, you can see clearly in the following image that some features do in fact contribute negatively, though those contributions are marginal.

fixed_original

With the scaled (i.e. zero-mean feature) data, postive and negative contributions have similar frequency. It many data model applications, data normalization is standard practice. In such cases, the flawed method of illustration would have resulted in ~50% of the data being misrepresented.

fixed_scaled

This implementation is also suited to cases where PCA is instantiated with arguments, e.g. PCA(n_components=5), but that exercise is left up to you, as is the exercise of trying the other combinations of signed and sort arguments.