A Visual Introduction to Dimensionality Reduction with Isomap

"To deal with hyper-planes in a 14-dimensional space, visualize a 3D space and say 'fourteen' to yourself very loudly. Everyone does it." - Geoffrey Hinton

In many domains, ranging from computational imaging to finance, data comes in the form of high-dimensional signals that are challenging for humans to directly reason about, as our intuition is generally confined to two or three dimensions. Thankfully, while data may often lay in very high-dimensional spaces, it is often the case that the intrinsic dimensionality of the data is much lower. This is a concept known as the manifold hypothesis, and is a core assumption for many machine learning techniques. The goal of dimensionality reduction is to compress high-dimensional data into lower-dimensional forms that preserve their relevant structure while being much easier for people to interpret.

A simple dataset to start our investigation of dimensionality reduction is a one dimensional spiral embedded in a two dimensional space (see Figure 1). How can we capture the intrinsic structure of our low dimensional data (1D) embedded in a higher dimensional (2D) space? While simple, by studying this setting, we can gain an intuition for how dimensionality reduction works in much higher dimensional settings.

Figure 1: A noisy spiral dataset illustrating a 1D manifold embedded in 2D. The color of each point  represents its position along this intrinsic dimension.

In this article, I'll be exploring Isomap, a classic non-linear dimensionality reduction technique that seeks to create a low dimensional embedding of data that preserves its local similarity structure. Isomap builds upon the manifold hypothesis, forming a graph that captures the local structure of data,and projecting points in a way that preserves this structure. This is a pattern that is central to many modern dimensionality reduction techniques like t-SNE and UMAP.

The Isomap algorithm has the following high level steps:

  1. First, you construct a graph between points that captures the local structure in the data.
  2. Next, you measure the "geodesic" distances between all pairs of points in this graph. These distances approximate the true manifold distances between points.
  3. Then you apply multi-dimensional scaling to project the high dimensional data into a lower dimensional embedding that preserves the distances from the graph.

Embedding Data with Multidimensional Scaling

A critical question in dimensionality reduction is, how can we project data into a lower dimensional space, while preserving its intrinsic structure? The purpose of dimensionality reduction is to make data easier to visualize and interpret, while ensuring that insights derived from the reduced data generalize to the original high-dimensional data. A key design consideration then becomes how to effectively capture a property of the data that we wish to preserve in the low-dimensional representation. One popular choice is to preserve the pairwise distances between points. Isomap leverages a technique called Multidimensional Scaling (MDS), that achieves this goal by taking a matrix of pairwise distances between points and finding a low-dimensional embedding that best preserves these distances.

Concretely, MDS takes some high-dimensional data x1,,xnRdx_1, \dots, x_n \in \mathbb{R}^{d} and produce a lower-dimensional representation y1,,ynRpy_1, \dots, y_n \in \mathbb{R}^{p} where the pairwise distances between points are preserved as well as possible. These pairwise distances are represented by a distance matrix DRn×nD \in \mathbb{R}^{n \times n} where DijD_{ij} corresponds to some measure of distance between points xix_i and xjx_j. In it's simplest form, called classical MDS, we use the Euclidean distance metric dij=xixj2d_{ij} = ||x_i - x_j||_2. This choice of distance metric has serious consequences that we will discuss in later sections. MDS then seeks to find points y1,,ynRpy_1, \dots, y_n \in \mathbb{R}^{p} such that the distances between these new points approximate the original distances yiyj2dij||y_i - y_j||_2 \approx d_{ij}.

Instead of directly operating on our pairwise distances DD, classical MDS converts these distances into a set of similarity scores between points. This is done by double centering the squared distance matrix to produce a Gram matrix B=12HD2H,B = -\frac{1}{2} H D^2 H, where H=I1n11TH = I - \frac{1}{n} \mathbf{1}\mathbf{1}^T is the centering matrix. Here 1\mathbf{1} is a vector of all ones of length nn. The centering matrix HH centers the data by subtracting the mean from each point.

Given this Gram matrix we can then perform an eigen-decomposition B=VΛVTB = V \Lambda V^T to find the top pp eigenvectors corresponding to the largest eigenvalues. Here VV is the matrix of all eigenvectors and Λ\Lambda is the diagonal matrix of all eigenvalues. The low-dimensional coordinates YRn×pY \in \mathbb{R}^{n \times p} of our points are exactly given by these eigenvectors scaled by the square root of their corresponding eigenvalues: Y=VpΛp1/2 Y = V_p \Lambda_p^{1/2} where VpV_p contains the top pp eigenvectors and Λp\Lambda_p contains the top pp eigenvalues. These coordinates are shown in Figure 2.

An interesting fact about classical MDS is that when using the Euclidean distance metric, the resulting low-dimensional coordinates are equivalent to those obtained from Principal Component Analysis (PCA). Foreshadowing future sections, this equivalence highlights some of the limitations of applying MDS with a Euclidean distance metric.

Figure 2: Classical MDS with Euclidean distances projects data along the axes of maximum variance (i.e., the principal components).

Limitations of Classical MDS with Euclidean Distance

Unfortunately, naively applying classical MDS with Euclidean distance as the measure of distance between our points only works for data that has linear structure. We can see in Figure 2 that when using Euclidean distance as our metric for dijd_{ij}, MDS amounts to projecting data onto the first kk principal components. While this is a powerful technique for many types of data, for our spiral dataset it fails to capture the intrinsic structure of the data because it is restricted to linear axes in the original space.

A central assumption that Isomap and many other non-linear dimensionality reduction techniques takes advantage of is that while Euclidean distance may not be a good global measure of similarity for data with manifold structure, it works fairly well locally. That is, while Euclidean distance may not be a good measure of similarity between points that are far apart, it works reasonably well for points that are close together (Figure 3)! The closest points to a given point in Euclidean distance are indeed close on the intrinsic data axis (represened by color), however points that are a medium Euclidean distance away may be very far apart along the intrinsic axis.

Figure 3: Pairwise Euclidean distances drawn from a selected point to all others.

Capturing Local Similarity Structure with Epsilon Neighborhoods

One way to formalize the idea of locality is through neighborhoods. You can draw a circle (hypersphere) around a point, which we can call an epsilon neighborhood. That is the region about a point xix_i defined as Nϵ(xi)={xj:xixj2<ϵ}N_\epsilon(x_i) = \{ x_j : ||x_i - x_j||_2 < \epsilon \} for some small value of ϵ\epsilon. See a depiction of this in Figure 4. Look and notice that the points that fall within this circle when ϵ\epsilon is small are also close to the point along the data's intrinsic dimension. However, if we increase ϵ\epsilon, we start to include points that may be nearby in Euclidean distance but far apart along the intrinsic axis (denoted by color) of the data.

0 5

Figure 4: The highlighted circle shoes the epsilon neighborhood around a point. You can hover over or select another point to see its epsilon ball. Manipulate the slider for epsilon to see its impact on the ball.

Capturing a the Structure of Data With a Graph

How can we embed data that is locally Euclidean but globally non-Euclidean?

Now that we have established that our data has local Euclidean structure, how can we take advantage of this to build a better dimensionality reduction technique? A powerful perspective in many dimensionality reduction techniques is to represent data as a graph, where the connections between points in our dataset capture their local similarity structure. This is central not just to Isomap, but to many powerful dimensionality reduction techniques like t-SNE and UMAP.

One interesting choice for constructing a graph that captures local similarity is to use these ϵ\epsilon neighborhoods. If two points lie within each other's ϵ\epsilon neighborhoods, we can connect them with an edge in the graph (Figure 5). We can represent this graph with an adjacency matrix AA where Aij=1A_{ij} = 1 if points ii and jj are connected and Aij=0A_{ij} = 0 otherwise. This way of constructing a graph relates to a deep concept in topology called persistent homology, which studies how the connectivity structure of data changes as we vary ϵ\epsilon. However, this is out of the scope of this article.

Figure 5: Graph formed by connecting the points in each epsilon neighborhood.

Another simple choice is to connect each point to its k-nearest neighbors (Figure 6). This is a construction known as a K-Nearest Neighbor graph.

1 20

This is the construction that Isomap uses to build its graph, and what we'll be using in this article.

Figure 6: k-Nearest Neighbor graph connecting each point to its local neighbors.

Now that we have a graph that captures the local similarity structure of our data, how can we quantify a notion of distance between points? In the case of Isomap, the answer is exceptionally simple, we can just conventional algorithms like Dijkstra's to compute the shortest path between points in the graph as our distance measure (Figure 7). We can use our k-nearest neighbor graph constructed earlier, with edges weighted by Euclidean distance between connected points.

Figure 7: Shortest-path traversal on the KNN graph illustrates geodesic distance.

The Final Isomap Algorithm!

Equipped with a principled method for computing the distances between points that respects the manifold structure of our data, we can now plug these distances into classical MDS to obtain our low-dimensional embedding! The full Isomap algorithm can be summarized in the following steps:

  1. Construct a k-nearest neighbor graph from the data, with edges weighted by Euclidean distance.
  2. Compute the shortest path distances between all pairs of points in the graph using Dijkstra's algorithm. This gives us a new distance matrix DgraphD_{graph}.
  3. Apply classical MDS to the graph distance matrix DgraphD_{graph} to obtain low-dimensional coordinates that preserve these distances.

This procedure gives us a low-dimensional embedding that respects the local similarity structure of our data, allowing us to visualize the relationships between points in a way that is faithful to their intrinsic geometry.

Limitations

It is worth noting that this algorithm has numerous limitations, which has motivated many of the more modern dimensionality reduction techniques like t-SNE and UMAP. For one, the choice of k in the k-nearest neighbor graph can have a significant impact on the resulting embedding. Additionally, MDS can be computationally expensive for large datasets. Third, Isomap assumes that the data lies on a single connected manifold, which may not hold in practice.

Acknowledgements

I would like to thank Benjamin Hoover, Hamza Elhamdadi, Polo Chau, and Vivek Anand for their helpful feedback on this article.

Figure 8: Isomap embedding preserves manifold structure using graph geodesic distances.