# Methods for Multidimensional Scaling Part 3: Computational Optimization and Parameter Estimation

In this article, we pick up from part 2 of our multidimensional scaling series with a discussion of the computational optimization procedures and parameter estimation for the generalized criterion functions.

## Computational Optimization Procedures

*This blog is part 3 of a series featuring excerpts from the original technical paper, “Methods for Multidimensional Scaling” by Douglas B. Clarkson, IMSL, Inc., 1987.*

### Initial Parameter Estimates

A procedure for obtaining initial estimates of all parameters in \(\delta\) is given as follows:

- In ordinal or nominal data, initial estimates for the disparities are taken as the appropriately standardized input data. Thus, when ranks or categories are input, initial estimates for other parameters may be far from the optimal estimates.
- Estimates of the configuration matrix in the Euclidean model are obtained as the eigenvectors corresponding to the \(\tau\) largest eigenvalues of the summed product-moment matrices, normalized to a length equal to the square root of the eigenvalue. When more general models are used, the same eigenvectors, normalized to a length of the inverse of the square root of the eigenvalue, are used as the matrix \(Y\) in steps 3 to 5.
- If required, subject weights are estimated via the SUMSCALE method of DeLeeuw and Pruzansky (1978). Let \(C_i = Y^{\prime}P_iY\), where \(P_i\) is the \(i^{th}\) product moment matrix and \(Y\) is the matrix of eigenvectors discussed in step 2. Let \(T\) be the orthogonal matrix which minimizes the sum of the squared off-diagonal elements of the matrices \(D_i = T^{\prime}_iC_iT_i\). Then the initial configuration is taken as \(X = Y\Delta T\), where \(\Delta\) is the diagonal matrix of eigenvalues in step 2. The subject weights for the \(i^{th}\) subject are taken as the diagonal elements of \(D_i\). If required, a constant is added to all subject weights so that all subject weights are (just) positive.
- If required, stimulus weights are computed as least squares estimates obtained from the disparities, with the configuration and subject weights computed above. The least squares estimates for \(S_{ik}\) are defined by the following equations. All combinations of the indices \(j\) and \(l\) are used in estimating each set of parameters \(S_{ik}\). If required, a constant is added to all stimulus weights so that all stimulus weights are (just) positive. $$ \delta_{ijl} = \sum_{k=1}^{\tau} S_{ik}W_{mk}(X_{ik} - X_{jk})^2$$
- Once initial estimates for \(S, W\), and \(X\) have been obtained, predicted distances are computed. Initial estimates for the parameters \(a\) are taken as \(0\), while initial estimates for \(b\) are obtained by regressing the observed transformed dissimilarities on the transformed disparities within each stratum.

### Computational Optimization of Parameter Estimates

After initial estimates have been obtained, a modified Gauss-Newton algorithm can be used to optimize estimates for all parameters but the disparities. The initial iterations are performed on subsets of the parameters. In the final iterations all parameters (except for the disparities) may be optimized simultaneously.

The initial iterations proceed as in the following. "Optimal" estimates in this discussion are computed with all other parameters fixed.

- In nonmetric scaling, optimal disparities estimates \(\tilde{\delta}\) are computed within each stratum.
- Optimal estimates for \(a_h\) and \(b_h\) are computed within each stratum.
- Optimal configuration estimates \((X)\) are computed.
- Optimal subject weight estimates \((W)\) are computed (one subject at a time).
- Optimal stimulus weight estimates \((S)\) are computed.

When the maximum change in any parameter is less than a user-specified constant, the iterative method changes by combining steps 3, 4, and 5 so that iteration on \(X, W\), and \(S\) is performed simultaneously. Convergence is assumed when the change in any parameter from one iteration to the next is less than a user-specified constant.

The initial estimation and optimization procedures allow for any set of parameters to be fixed or eliminated. The presence of the subject weights \(W\), the stimulus weights \(S\), the scaling factors \(b\), and the additive constants \(a\) is optional, except where their absence is required because of overparameterization. Moreover, any of these parameter matrices, as well as the configuration matrix \(X\), can be fixed at a predetermined value.

### Parameter Estimation With the *L*_{p} Gauss-Newton Algorithm

_{p}

A modified Gauss-Newton algorithm may be used in the estimation of all parameters except the disparities. This algorithm, discussed by Merle and Spath (1974), uses iteratively reweighted least squares. The criterion function is rewritten as

$$ q = \sum_{i,j,h} \frac{\omega_h |f(\tilde{\delta}_{ijm}) - a_h - b_hf(\delta_{ijm})|^2}{|f(\tilde{\delta}_{ijm}) - a_h - b_hf(\delta_{ijm})|^{2-p}}$$

Iteratively reweighted least squares uses weighted least squares on a linearization of the parameters in \(\delta_{ijm}\) during each iteration. In the least squares, \(w_h\) is divided by the denominator of \(q\) to yield an observation weight which is assumed to be fixed during the iteration. After each iteration both \(w_h\) and the denominator of the summand of \(q\) are recomputed for each subject. Division by zero can occur when \(p < 2\) and the denominator of \(q\) is zero. In this situation, the denominator is set to \(0.001\), after which the calculations proceed as usual.

### Estimating the Disparities

#### Ordinal data

As discussed earlier, when \(p = 2\) monotonic regression is used to compute the disparities in ordinal data. When \(p \ne 2\), the usual monotonic regression methods cannot be used. A modified monotonic regression must be used instead.

Within each stratum the first Kruskal criterion function is given by:

$$q^* = \frac{\sum_k |y_{(k)} - f(\hat{\delta}_{(k)})|^p}{\sum_k |y_{(k)}|^p}$$

where \(y_{(k)} = f(\tilde{\delta}_{(k)})\), and \(k\) is the rank of the observed dissimilarity in its stratum (\(k\) is enclosed in parentheses to emphasize this ranking). In this equation, the \(y_{(k)}\) are parameters. (A similar equation is obtained when the second Kruskal criterion is used. Substitute \(\sum|y_{(k)} - \bar{y}_.)|^p\) for \(\sum |y_{(k)}|^p\) here and in the equations below.) When estimating the disparities, it is assumed that the predicted distances (\(\hat{\delta}_{(k)}\)) are fixed. The monotonic assumption in ordinal data requires that the \(y_{(1)}\le y_{(2)} \le y_{(3)}\ldots \le y_{(n_h)}\), where \(n_h\) is the number of observations in the \(h^{th}\) stratum. Using Lagrange multipliers, the criterion function within each stratum is transformed to

$$\tilde{q}^* = \sum_k |y_{(k)} - \beta f(\hat{\delta}_{(k)})|^p - \lambda ( \sum_k |y_{(k)}|^p - c )$$

where \(\beta\) is similar to, but not the same as, the scaling parameter \(b_h\) in metric scaling, and \(\lambda\) is the Lagrange multiplier.

Within each stratum the criterion function involves parameters \(y_{(k)},\beta\), and \(\lambda\). Because of the monotonic restrictions on the \(y_{(k)}\), it is not possible to easily use the usual Newton-Raphson techniques on all parameters simultaneously. Because of the Lagrange multiplier \(\lambda\), simple modification of the usual monotone regression techniques cannot be used. The following algorithm, while sometimes slow to converge, yields optimal estimates if convergence is reached.

- Set \(\lambda\) to \(0.0\) and \(\beta\) to \(1.0\).
- Estimate \(y_{(k)}\) for the criterion \(q\) and power \(p\), holding \(\lambda\) and \(\beta\) fixed by using a modified monotonic regression technique discussed below.
- Estimate \(\beta\), holding all other parameters fixed.
- Estimate \(\lambda\), holding all other parameters fixed.
- If the change in any parameter from one iteration to the next is greater than \(\epsilon\), go to step 2.

In step 2 a secant algorithm is used to estimate each monotonic parameter \(y_{(k)}\) based upon the pooled observations \(f(\delta_{(k)})\) associated with the parameter. The usual monotonic regression algorithm (the "pool adjacent violators" algorithm) is used to determine when pooling is required. A secant algorithm is then used to obtain an optimal parameter estimate for the single parameter \(y\) from the criterion obtained from the observations to be pooled. For example, the monotonicity restrictions may require the equality of the ranked transformed disparities \(y_{(2)}\) through \(y_{(7)}\). The disparities are set equal to the estimate \(y\) obtained from the predicted transformed distances corresponding to \(y_{(2)}\) through \(y_{(7)}\) in the criterion subset

$$\tilde{q}_y = \sum_{k=2}^7 |y - \beta f(\hat{\delta}_{(k)})|^p - 6\lambda|y|^p$$

All parameters in the subset are then set to the value \(y\).

In step 3, a secant algorithm for fixed \(y_{(k)}\) and \(\lambda\) is used. The computation of \(\lambda\) in step 4 is direct, and is obtained by setting the derivative of \(q\) with respect to each \(y_{(k)}\) equal to zero, and then summing over all \(y_{(k)}\).

The algorithm seems to converge for all values of \(p\) in the interval \([1 ,2 ]\). Convergence is slowest for \(p\) near \(1\) and is fastest at \(p = 2\).

#### Categorical data

In categorical data the \(p^{th}\) power estimate of location is used on the transformed distances as the disparity estimate for all observations with the same observed dissimilarity within each stratum. A secant algorithm is used to compute the \(p^{th}\) power location estimate.

### Some Statistics

Because the entire Hessian is computed when metric scaling is used, a case analysis can be performed. Clearly of interest is the residual for each observed dissimilarity. Some measure of influence may also be of interest, as well as the observation weight and the standardized residual. These statistics are computed as follows:

- Compute the observation weight and residual in the usual manner.
- Compute the influences. Let \(g_{ijm}\) denote the row vector of weighted partial derivatives of \(a_h+ b_hf(\delta_{ijm})\) with respect to the parameters \(a_h, b_h\), and all parameters in \(\delta_{ijm}\). Let \(G\) denote the matrix of these partial derivatives. Compute the leverages (or influences) as the diagonal elements of the matrix \(G(G^{\prime}G)^{-1}G^{\prime}\) evaluated at the optimal estimates.
- The studentized residual is given as $$ r = \frac{e}{\sqrt{\text{MSE}(1-h)}}$$ where \(e\) is the residual, \(h\) is the leverage, and \(\text{MSE}\) is the (weighted) mean square error estimate computed via the criterion function and adjusted for the number of parameters estimated minus the number of restrictions applied.

In metric data, the inverse of the Hessian computed for all parameters can also be used to obtain large sample standard errors for the estimated parameters. Standard errors can be computed via standard nonlinear regression methods.

## Ordinal Row-Conditional Dissimilarity Data Example

As a second example of ordinal row-conditional dissimilarity data, consider the matrix in Table 4 in which nine wines are judged with respect to their dissimilarity. The data consist of nine such matrices, one for each of nine judges. Each judge was asked to rank the similarity of the eight column wines to the row wines. Thus, in row 1, wine 2 is judged most similar to wine 1, while wine 8 is judged least similar.

A multidimensional scaling analysis of the data is performed using the individual-differences model for the distance model; Table 5 gives the optimal criterion function value for each of 1, 2, 3, and 4 dimensions.

Because there is a large decrease (relative to the criterion at \(\tau = 1\)) in the criterion function from 1 to 2 dimensions and then a leveling off at 2, 3, and 4 dimensions, two dimensions are retained.

A plot of the configuration for the two-dimensional solution is given in Figure 2.

A plot of the subject weights is given in Figure 3. Note in this figure that subject 3 gives almost no weight to dimension 1 and gives comparatively little weight to the second dimension. This outcome is not unreasonable because subject 3 had a head cold. It is encouraging that the multidimensional scaling detects this fact.

Interpretation of the stimulus configuration is difficult. The close location on the plot of the Gallo Hearty Burgundy and the Gallo Burgundy is encouraging because the two wines are related. A similar relationship holds for the two wines made from zinfandel grapes. As with factor analysis, interpretation of results is often the most difficult part of a multidimensional scaling analysis.

## Final Thoughts

The asymptotic theory in multidimensional scaling is complicated by the fact that the number of parameters increases with the number of subjects in most models (see Ramsey, 1978); consequently, the validity of all asymptotic results in samples of even moderate size is questionable. Also questionable is the validity of the estimated variances and covariances and the residual analysis. Still, in most cases, some estimate of a variance is better than no estimate, and the residual analysis will yield information on outliers in metric data, if nothing else.

In nonmetric data, the meaning of the residual analysis must change because of the estimated disparities. Since the leverages also depend upon the residuals (and do not include information about the disparity derivatives) their interpretation is also difficult. Indeed, except for the occasional outlier, the case analysis in nonmetric data seems to defy interpretation.

The residual analysis in \(L_p\) estimation also needs further investigation. The validity of estimated parameter variances needs to be established when \(p \ne 2 \), as does the validity of the residual and leverage estimates.

It is disconcerting that the estimates are not unique in the nonmetric scaling models. This lack of uniqueness comes about because of the disparity estimation. In classical nonmetric scaling, ordinal data become pseudo-continuous via the monotonic regression in the sense that, after the monotonic regression, the disparities are analyzed as if they are continuous. A better method might start from the premise that the data are ordinal, and then compute configuration and other estimates directly from the premise. In this regard, the MAXSCAL algorithm of Takane and Carroll (1981) shows promise. Their algorithm uses the information in the ranks in the same way that the Cox proportional hazards model can be thought to use it, i.e., through the marginal likelihood.

### Additional Resources and References

If you're looking for additional resources on multidimensional scaling, be sure to check out parts one and two of our multidimensional scaling series by clicking the links below.

### Try IMSL for Free

If you want to try the IMSL libraries on your project, you can request a free trial by clicking the button below.