Image Blog Methods for Multidimensional Scaling Part 1

November 26, 2018

Methods for Multidimensional Scaling Part 1: Overview

Algorithms & Functions

Multidimensional scaling can be used to uncover the so-called "perceived similarities" between data sets. And, for quantitative data analysts, multidimensional scaling can be applied to data across diverse industries for unique insights.

In part one of our multidimensional scaling blog series, we give an overview of multidimensional scaling, and the methods used for solving multidimensional scaling problems. In part two, we look at a generalized criterion function and end our series with a look at computational optimization in part three. But before we too far ahead of ourselves, what is multidimensional scaling?

What Is Multidimensional Scaling?
Methods for Multidimensional Scaling
Multidimensional Scaling Example
Further Reading

What Is Multidimensional Scaling?

Multidimensional scaling is concerned with models and techniques for locating objects in a multidimensional space based upon distance measurements between the objects.

Data in Multidimensional Scaling

The data in a multidimensional scaling (MDS) problem consists of one or more dissimilarity matrices, where a dissimilarity is a measure of distance between stimuli, and each matrix gives the dissimilarities between the stimuli considered.

Methods for Multidimensional Scaling

In the simplest problems, one matrix is used and the dissimilarities are symmetric. An example of such a matrix is the matrix of intercity mileages often found on road maps, where the distance between two cities is the dissimilarity measure. Often, more than one matrix is observed. For example, several different mapmakers' estimations of the distances between cities may be observed. Since each mapmaker may use different methods for measuring distance, different distance matrices will result. Regardless of the number of matrices observed, the MDS problem is to locate the stimuli (cities) in a multidimensional space of known dimension based upon the observed dissimilarities.

1. The Distance Measures

A dissimilarity matrix need not be symmetric. An example of multidimensional scaling data with asymmetric dissimilarity matrices is given in Table 1. The stimuli are seven stores and the observed dissimilarity is the rank of the distance of the column store from the row store. In other words, for each row, the column store with rank 1 is closest to the row store, the column store with rank 2 is second closest, etc. This matrix is clearly not symmetric since the dissimilarity measures $ d_{ij} \ne d_{ji} $. Moreover, because of the method used for data collection, comparison of ranks in row $ i $ with ranks in row $ j $ should not be made.

Historically, four types of dissimilarity data have most often been considered as input in multidimensional scaling problems:

Nominal data using categories for distance. (Distances corresponding to dissimilarities within the same category are assumed identical).
Ordinal data, as in the example above, using the rank of the distance.
Interval data, using distance plus a constant.
Ratio data, using distance.

Models involving ratio or interval data are called metric scaling models, while models with nominal or ordinal data are called non-metric scaling models. Distance need not be Euclidean. For example, $r^2_{pq} $ could be used as the distance measure between stimuli $p$ and $q$, where $ r^2_{pq} $ is the correlation coefficient between variables $p$ and $q$.

2. The Sampling/Measurement Method

Another common consideration in multidimensional scaling is the sampling/measurement scheme used in collecting the data. In the above example, because the rankings were made within each row, comparisons of ranks between rows are not meaningful. If instead a dissimilarity matrix is provided by each of two judges, then the dissimilarities between the two matrices cannot be compared unless it can be verified that the two judges used the same scale and judging criteria. In general, the sampling/measurement scheme used to obtain the data determines strata (or conditionality groups) within which dissimilarities can be compared, while comparison of dissimilarities between strata does not make sense.

Three sampling/measurement schemes are defined as follows:

If sampling/measurement is such that all dissimilarity measures can be compared, then the data come from a single stratum and are said to be unconditional data.
If only dissimilarity measures within a matrix can be compared, then each matrix is a stratum, and the data are said to be matrix conditional.
If sampling is such that only dissimilarity measures within a row of each dissimilarity matrix can be compared, then each row of each matrix is a stratum, and the data are said to be row conditional.

3. A Distance Model

Generally, the stimuli are located in an $\tau $-dimensional Euclidean space, $ \tau \ge 1$, in such a manner that the agreement between the observed dissimilarities (whether ordinal, ratio, etc.) and the predicted distances is in some sense optimal. In order to locate the stimuli, some model for the distance between stimuli must be specified.

In this example, the model is Euclidean and is given by:

$$ \delta_{ij} = \sqrt{\sum_{k=1}^{\tau} (X_{ik} - X_{jk})^2} $$

where $X_{ik}$ is the coordinate of the $i^{th}$ stimulus in the $k^{th}$ of $ \tau $ dimensions in the Euclidean space and the matrix $X$ is called the configuration matrix. For a given $X$, the model gives the distances between all stimuli.

Since the distance between stimuli is translation invariant, the location of the origin is arbitrary. In the following, the origin is assumed to be zero, so that $\sum_i X_{ik} = 0 $. Also, the Euclidean model, unlike other distance models, is rotation invariant (i.e., multiplying $ X $ by an orthogonal matrix $T, X = XT$ yields the same distance measures). This means that the configuration obtained is not unique rotationally. Usually, no attempt is made for the Euclidean model to obtain a unique solution with respect to rotation, although any of the orthogonal rotations in factor analysis could be used for this purpose.

4. A Criterion Function

In order to estimate parameters, a criterion function, usually called the stress function, may be minimized. In this example, the stress function is given as:

$$ q = \frac{\sum_{i=1}^n \sum_{j=1}^n (\tilde{\delta}_{ij} - \delta_{ij})^2}{ \sum_{i=1}^n \sum_{j=1}^n (\tilde{\delta}_{ij})^2} = \omega \sum\limits_{i=1}^n \sum\limits_{j=1}^n (\tilde{\delta}_{ij} - \delta_{ij})^2 $$

where $ n $ is the number of stimuli and $ \tilde{\delta} $ denotes the optimal dissimilarities, called the disparities.

In metric data, $q$ is optimized with respect to $ \delta $ only, whereas in non-metric data, q is optimized with respect to both $ \delta $ and the disparities $ \tilde{\delta}_{ij}$. Let $\hat{\delta}$ denote predicted values of $ \delta $. Disparities in non-metric data are found from the predicted distances $\hat{\delta}$, such that the rank requirements of ordinal data or the equivalence requirements of nominal data are satisfied and an optimal criterion function is obtained. With the stress $q$ above, ordinal data disparities are optimal when they are taken as the monotonic regression of $ \hat{\delta}$ on the observed ranks within each stratum. In categorical data, the disparities $\tilde{\delta}$ are optimal when they are estimated as the average of all $\hat{\delta}$ within the same category and stratum.

The numerator in the above criterion function is a least squares criterion, whereas the denominator (or $ \omega $ ) is a scaling factor. Scaling is required in non-metric data to prevent the solution from becoming degenerate. If the denominator were not present, $ q$ could be made as small as desired by simultaneously scaling both $\hat{\delta}$ and $\tilde{\delta}$ downward. Different criterion functions are often used in metric data, which do not have this scaling requirement.

5. Monotonic Regression

As an example of the monotonic regression used in optimizing $\tilde{\delta}$ in ordinal data, consider the following table in which the data corresponds to the store example discussed above.

In this table, the rank of the distance between each store and store 7 is given in the second row. Using the estimated configuration matrix $ \hat{X}$, the predicted distances $ \hat{\delta}$ are computed in the third row of the table, while the disparities $\tilde{\delta}$ are given in the fourth row. Note in the third row that the predicted distances (.69, .65, .44) for stores 1, 3, and 6 are not in the order required by the observed ranks. Because the disparities must preserve the observed ranks, the order in the estimated disparities must be reversed from the order obtained from $ \hat{\delta}$. In order to accomplish this, the monotonic regression averages adjacent elements of $ \hat{\delta}$ as required to obtain $\tilde{\delta}$. (See Barlow, Bartholomew, Brunk, and Bremmer (1972) for a discussion of how to determine when averaging is “required .” ) This results in the disparities given in the fourth row, in which the first three predicted distances are averaged, as are predicted distances 4 and 5. The resulting $ \tilde{\delta}$ preserves the rank requirements of the observed data and is as close as possible to $\hat{\delta}$, in the least squares sense.

Multidimensional Scaling Example

An optimal configuration estimated for the multidimensional scaling example above is given in Table 3 and plotted as the symbol ‘+’ in Figure 1. The store locations which gave rise to the rankings in Table 1 are plotted as symbol ‘o’ in Figure 1. In comparing the actual with the optimized store locations, note that scale is unimportant since ranked distances were used. (Scale differences between the axis are important, however.) For these data, the optimal criterion is 0.0, indicating a perfect fit.

Even though the fit is perfect, differences in store locations between the observed and estimated data occur because of the lack of uniqueness of the estimated configuration in non-metric scaling. In the figure, scale changes and rotation with possibly a translation bring the two figures close together.

Lack of uniqueness in the estimated configuration in ordinal data (in addition to translation, scale, and rotation problems) can be seen as follows. Assume a perfect fit so that the numerator of the stress is 0.0 while the denominator has no effect on the stress value. Now change the configuration in such a manner that the ranks of the resulting predicted distances are unchanged. For ordinal data, this is always possible. Then the monotonic regression yields disparity estimates which equal the altered distance estimates (since the stress is zero and the rankings have not changed), and the new stress is again zero. Thus the new configuration fits as well as the old configuration, verifying that the configuration estimates are not unique.

The same uniqueness problems occur when the optimal stress is not zero, but the argument is complicated in this case by the changing denominator. In general, if all other parameters are assumed fixed, then an interval of estimates for each estimated parameter, all yielding the same stress value, is available. The interval will change, of course, when other model parameters are allowed to vary.

Getting Proactive With Predictive Analysis