# Using Support Vector Machines in JMSL

## What Are Support Vector Machines?

Support Vector Machines (SVM) are a class of learning algorithms for classification, regression, and distribution estimation motivated by results of statistical learning theory (Vapnik, 1995). Classification problems are characterized by separating data into training and testing sets. Each pattern, or instance in the training set, contains one “target classification value” (i.e. one of the class values) and several “attributes” (i.e. the features or observed variables). The goal of SVM is to produce a model based on the training data that predicts the target values of the test data.

There are numerous resources available online and in books that cover SVM concepts in significant depth, well beyond what any article could accomplish. Some basic familiarity with the fundamentals of SVM will be helpful to understand the context discussed in this article.

## Using Support Vector Machines for Classification Use Cases

The JMSL implementation features algorithms for regressions and the “one class” problem. The example problems discussed here will all use the SVClassification class as the primary tool, but readers are encouraged to investigate the SVRegression and SVOneClass classes as well.

Nearly every discussion about SVM classification starts with two rather basic examples: Linearly separable and non-linearly separable. Variations of the same two data sets appear repeatedly across different material from different authors. These two textbook examples will be presented as a means to get started with the JMSL SVM implementation using familiar data.

**>> Download a complimentary full code sample highlighting the examples at github.com/perforce/jmsl-svm-demo.**

## Linearly Separable Data

Consider the following graph consisting of two sets of points, where any given x,y location is associated with a value of 0 or 1 in a prime example of a classification problem.

The key idea here in terms of classification is that given this set of known coordinates and 0 or 1 values, can one accurately predict whether any new x,y point should be a 0 or a 1?

In this case, the data are linearly separable meaning that one can easily draw a line between the two populations and define a border between them. Any new point to be classified would simply require identification of which side of the boundary line the point is located. Where exactly to place the dividing line requires some effort and this is what the SVM algorithm accomplishes.

The method essentially involves drawing two parallel lines (or hyperplanes as problems extend to higher dimensions) to define the margin between the two sets such that it is as large as possible and then defining the line (or maximum-margin hyperplane) that lies midway between the first lines. Note that mathematically the placement of these hyperplanes in the linearly-separable case depends only on the points closest to the margin. It does not matter if there are three points behind the margin or three million, thus those points nearest to the boundary are called support vectors.

Mapping this problem to be solved using the JMSL SVM classes can be done with a handful of lines of Java code. Including validation results and classifying new data requires a few more lines. As this is the first, and most straightforward example, the required source will be traversed in sections, with succinct concepts for each section.

To begin, the data set is defined by eight points in each group set for a total of sixteen x,y pairs, each with a classification of either 0 or 1:

```
int N = 16;
double[][] x = {
{0.0, 0.3, 3.1},
{0.0, 1.2, 1.2},
{0.0, 1.4, 4.9},
{0.0, 2.0, 2.6},
{0.0, 3.4, 0.9},
{0.0, 4.1, 2.0},
{0.0, 6.0, 1.3},
{0.0, 6.8, 1.7},
{1.0, 3.5, 8.0},
{1.0, 3.6, 6.3},
{1.0, 4.2, 9.3},
{1.0, 5.9, 5.9},
{1.0, 6.2, 7.3},
{1.0, 7.5, 5.7},
{1.0, 8.8, 7.8},
{1.0, 9.8, 3.6}};
```

The JMSL class SVClassification will be used for the analysis and has a constructor signature of:

```
SVClassification(double[][] xy, int responseColumnIndex,
PredictiveModel.VariableType[] varType)
```

The first argument is the input data that is defined as x above. The algorithm needs to know which column contains the response variable, which is column 0 for this example. Additionally, for each column, the variable type is required. The types can be one of Categorical, One Class, Ordered Discrete, Quantitative Continuous, or Ignore. (More details can be found in the PredictiveModel.VariableType enumeration documentation.) For this problem the response column is categorical while the x,y locations are continuous, leading to the following:

```
SVClassification.VariableType[] varType = {
SVClassification.VariableType.CATEGORICAL,
SVClassification.VariableType.QUANTITATIVE_CONTINUOUS,
SVClassification.VariableType.QUANTITATIVE_CONTINUOUS
};
SVClassification svm = new SVClassification(x, 0, varType);
```

With the SVClassification object created, the next step is to set the kernel. A kernel helps map higher-dimensional problems by transforming the data. As such, the default kernel for SVM in the JMSL Library is a radial basis function with one parameter. However, for this linear case, a linear kernel is the best choice. This is set easily using:

` svm. setKernel(new LinearKernel());`

At this stage the model needs to be trained with the known data classes provided in the constructor, which is done with just one more line of code:

` svm.fitModel();`

Although not a requirement, it’s a good idea to validate the model by confirming that the training data is properly classified. This is more important as the data becomes more complex. To accomplish this create a set of known values (below, `knownClass`

) and compare those with the values predicted by fitting the SVM model to the training data (below, `fittedClass`

). The `predict()`

method fits the training data while the `getClassErrors()`

method helps in the comparison:

```
// Extract the known classification
double[] knownClass = new double[N];
for (int i=0; i
```

From here, one can use the JMSL PrintMatrix class to output results. For a small data set simply printing the `fittedClass`

array is sufficient, but printing the `fittedClassErrors`

output array provides a summary table containing the number of errors and non-missing observations for each class and a sum for all the classes. The print methods and their respective output follows:

```
new com.imsl.math.PrintMatrix("fittedClass").print(fittedClass);
new com.imsl.math.PrintMatrix("fittedClassErrors").print(fittedClassErrors);
fittedClass
0
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
fittedClassErrors
0 1
0 0 8
1 0 8
2 0 16
```

This output reveals that the model fits the training data perfectly, with zero errors across all sixteen observations.

The next test for this introductory example is to classify some additional data to confirm that the model works against unknown values not present in the training data. For this test sixteen more points are defined along the main diagonal of the domain and each are classified using the trained model. Plotting the results provides an indication of where the hyperplane lies that separates the two classes of data. Above, to classify the training data `svm.predict()`

was called with no arguments. To classify new data simply pass an array of x,y values to this method as shown below:

```
final int M = 16;
final double step = 10.0/M;
double[][] testLine = new double[M][3];
for (int i=0; i<M; i++) {
testLine[i][1] = 0.3125 + i*step;
testLine[i][2] = 0.3125 + i*step;
}
double[] predictLine = svm.predict(testLine);
```

Plotting these predicted values as solid squares, colored with blue = 0 and red = 1, provides confidence that classification of additional points will be accurate. The boundary between the domains, which is linear due to the selected kernel, can be inferred easily from this test.

In this first example, the new support vector machines algorithm in the JMSL Library is presented along with a straightforward linear example. The methodology described is applicable to more advanced data sets, which will be covered in example two.

## Non-Linearly Separable Data

The example one provides a starting point in writing applications using the JMSL SVM classification algorithm. However, it is admittedly trivial. Most clustering and classification algorithms should have no problem with easily grouped data. The next example that appears in SVM introductions moves onto a less trivial example where the data classes are not linearly separable. In these cases, traditional algorithms like K-Means clustering or K-Nearest Neighbors do not do a very good job. Consider the data set pictured below:

As with the first example, there are sixteen total points, eight classified as 0 and eight as 1, but this time there is no single simple line to be drawn to separate the domains. The source code and analysis steps are very similar to the first example, but with different input data and a different kernel function. To begin, this data set is defined as:

```
int N = 16;
double[][] x = {
{0.0, 3.4, 4.1},
{0.0, 3.6, 5.9},
{0.0, 4.2, 5.0},
{0.0, 4.8, 7.0},
{0.0, 5.1, 3.1},
{0.0, 5.9, 5.8},
{0.0, 6.4, 4.0},
{0.0, 7.0, 6.1},
{1.0, 1.6, 6.2},
{1.0, 2.1, 3.1},
{1.0, 2.6, 1.8},
{1.0, 3.6, 8.2},
{1.0, 6.2, 1.0},
{1.0, 6.1, 8.4},
{1.0, 8.1, 3.4},
{1.0, 8.5, 7.1}};
```

The same array of variable types is used and the analysis steps are exactly the same. However, because prior knowledge lead to the choice of a linear kernel in the first example, this needs to change here. The default radial basis kernel actually does a fine job with this data in properly classifying the training set, but there is additional knowledge that is advantageous here. One could easily draw a circle that encompasses all of the blue points in the graph, and so a second-order polynomial kernel is an appropriate choice. As it is easy to try different kernels, confirming that the radial basis default works well enough is left to the reader as an exercise. To set the desired circular kernel, the PolynomialKernel class is used, but the default arguments are insufficient as they will general a first-order polynomial (i.e., a line). Therefore the constructor that allows setting the parameters is used:

` svm. setKernel(PolynomialKernel(1.0, 1.0, 2));`

Following the rest of the steps discussed in part one to validate classification of the training data again results in no misclassifications. As further validation and a test of the prediction capabilities of the trained model, a set of sixteen points along the main diagonal can be classified with the following result:

The boundaries of the domains becomes evident, with points turning blue as they cross into the circular region in the center of the chart.

## Using Real-World Data Sets

In the previous sections, two text-book examples were presented to help understand how the JMSL SVM classes can be used to classify data sets. In those examples, the example data were simple x,y pairs classified as either 0 or 1 and grouped in a way that a line or a circle easily delineates the two classes. In the real world data are rarely so clean or so easily understood. In this section, the JMSL SVM algorithm is applied to two of the data sets from the UCI Machine Learning UCI Machine Learning Repository with mixed results.

There are numerous data sets available, most from real-world academic studies. The selection criteria for this analysis was to find sets for classification (many are for regression analysis, which is also possible with JMSL and the SVRegression class, but not the focus here) small enough to be hard-coded without being unwieldy, and with more than three attributes in the data set. The methodology is straightforward: Load in the data, select about 10 percent of the data randomly (using JMSL RandomSamples class) to be used as a test with the remainder used for training, evaluate errors in the training set, and then evaluate errors in the test set. The JMSL CrossValidation class could be used to automate this type of random sampling and testing, repeating the work multiple times, resampling with replacement and obtaining average prediction errors.

### Qualitative Bankruptcy Data Set

The Qualitative Bankruptcy data contains 250 observations with six categorical inputs and one classification result. The idea is very much a real-world problem. If there is historical data of how a company is rated on various risk factors and the known bankruptcy status of each individual, is it possible to predict the bankruptcy status accurately given a new set of inputs? For this data, each observation includes a ranking (positive, average, or negative) across six qualities: industrial risk, management risk, financial flexibility, credibility, competitiveness, and operating risk; along with a flag indicating bankruptcy or not. The raw data are text values (P, A, and N for ranks, B and NB for bankruptcy or not), which were converted to categorical integer values. Of the 250 data points, 25 randomly selected points were set aside for testing, leaving 225 for training the model. The output of validating the training set and the test data are:

```
fittedClassErrors
0 1
0 0 126
1 0 99
2 0 225
classErrors
0 1
0 0 17
1 0 8
2 0 25
```

The JMSL Support Vector Machines algorithm classifies this data perfectly. This result isn’t unique, having tested several other small data sets from the same repository. While one should regard such results with suspicion, there are several reasons why very good output may be observed. First, the data categories may fundamentally capture the dynamics of the classes (in this case, the measures of risk are likely very good indicators of bankruptcy). Second, much of the data available in such repositories are not raw data but already cleaned and pre-processed, reducing the noise is true raw data. While data scrubbing is an important part of any analysis, data scientists must be mindful of confirmation bias in what they include or exclude. Finally, even though this data can be classified without error, using support vector machine algorithms provides no additional insight such as which characteristics are most important.

### Haberman's Survival Data Set

Consider the Haberman’s Survival data set, which includes data on five-year survival rates after breast cancer surgery. Given the patient age, year of operation, and number of positive axillary nodes detected, can whether or not the patient survived at least five years be predicted? The output of the validation steps is as follows:

```
fittedClassErrors
0 1
0 0 203
1 24 73
2 24 276
classErrors
0 1
0 0 22
1 8 8
2 8 30
```

There were some errors in classification of the training data set. While this is expected for real-world data, an error rate of nearly 33 percent (24 of 73) for one class is worrisome. When the results of training against 276 input values is applied to the randomly selected 30 test data points, all eight of the test cases where the patient died within five years were misclassified. The primary conclusion from such results is that without additional tweaking to the model or using different methodologies, the input data is not sufficient to predict the outcome. Using cross-validation techniques can help select the best performing model, but if important predictor variables are absent from the input data, there simply isn’t much a model can do to make up for it.

## Summary and Next Steps

Support vector machines can be a robust tool for classification problems complementing other data mining techniques. The JMSL Library implements a flexible programming interface for users to explore this modern technique as shown in the examples in this article.

Want to try JMSL on in your application? Request a free trial.