March 27, 2019

Fitting Polynomials to Data Using IMSL C

Algorithms & Functions

A customer has been migrating a workload from an existing application with a point-and-click interface to a more automated program written in C++. One of their key requirements is fitting polynomials to data. Curve fitting is a popular subject, but if you search IMSL documentation for the topic, you are likely to find spline approximations (both cubic- and B-splines) but not necessarily general curve fitting, and probably not much specific to fitting a polynomial.

This article looks at different options for fitting polynomials to data using the IMSL C Numerical Library. Most of the functions examined are available in the other IMSL languages Java and Fortran. A short discussion and cross-reference table is included at the end to map the functions for readers interested in the other language options.

Example Input Data

Each of the following examples use the same sample data set. The data set contains eleven observations, with x = 0, 1, …, 10 and values computed from a third-order polynomial with artificial noise added to mimic real-world scatter in measured data. These input data are presented in the table below. Even though the input data have limited accuracy, the double precision version of each IMSL Library function is used. If your input data are low precision or speed is critical, you may opt to use the float precision functions instead, but in general you should choose the higher accuracy of double precision. The precision of algorithms are differentiated in IMSL using the imsl_f_<name> and imsl_d_<name> nomenclature for the API.

Example Input Data
x	y
0	29.73
1	30.52
2	22.33
3	19.23
4	23.32
5	30.74
6	47.29
7	53.02
8	107.94
9	152.41
10	262.67

Polynomial Least-Squares Regression

The most specific function for this task is poly_regression in the IMSL C Library. The sole purpose of this algorithm is to compute estimates for the coefficients of a polynomial regression model. Summary statistics, well beyond the common \(R^{2}\) value, are available as optional arguments to quantify the goodness of fit of the calculated coefficients.

Use of this function is the most straightforward of all the routines discussed below. It requires only the observations, the number of observations and the degree of the polynomial to fit. The more general functions require the user to implement a user-defined function or build a coefficient matrix, but in this case, the function is always an N-degree polynomial, so nothing beyond the degree is required.

The entire program is:

#include  <imsl.h>
#define DEGREE 3
#define NOBS 11

int main()
{
      double *result;
      double x[NOBS] = {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
                        9.0, 10.0};

      double y[NOBS] = {29.73, 30.52, 22.33, 19.23, 23.32, 30.74,
                        47.29, 53.02, 107.94, 152.41, 262.67};
      
      result = imsls_d_poly_regression(NOBS, x, y, DEGREE, 0);
      
      imsls_d_write_matrix("The PR coefficients are:", 1, DEGREE + 1,
                           result, IMSLS_WRITE_FORMAT, "%10.6f", 0);
 
      if (result) imsls_free(result);

      return 0;
}

Compiling and executing this function gives the following output:

   The PR coefficients are:
         1            2             3             4
 26.622727     8.066476     -4.865740      0.633390

The order of the coefficients in the output array are of increasing order, starting with the constant (or intercept) and following with the higher-order terms. Thus the solution to this problem in mathematical terms (rounding values to two places for clarity) is:

y = 0.63\(x^{3}\) - 4.87\(x^{2}\) + 8.07\(x\) + 26.62

To retrieve the \(R^{2}\) value, use the IMSLS_ANOVA_TABLE optional argument and examine the 11th item in the 15-element array. This value is returned as a percentage, so to compare it to tools like Excel, divide by 100. For this example, \(R^{2}\) = 98.96% or 0.986, which is a very good fit.

This function is not available in IMSL for Java, so one of the other routines discussed below must be applied for users with this version.

Nonlinear Regression

Staying within the Chapter 2 Regression functions in the IMSL Library documentation, the next routine that can be used to fit a polynomial is the more general nonlinear_regression function. As this is a general-purpose nonlinear regression routine, a user-defined function is required. For a third-order polynomial, the function will accept four parameters and a single value of x to compute the dependent value. For consistency with the poly_regression function above, the coefficients are ordered with c[0] being the intercept and c[3] being the coefficient for \(x^{3}\).

The nonlinear_regression function is parallelized using OpenMP directives internally. While performance is not a vital concern for this trivially short example, the imsls_omp_options function is called nonetheless to indicate that the user function is thread safe and can be called in parallel with no side effects. Further, this algorithm implementation is general enough to be used for multiple independent vectors, so the input values could be higher dimensional arrays defined as x[NOBS][k], where k is the number of independent vectors.

The full example and its output are shown below:

#include  <imsl.h>
#define NOBS 11
double fcn(int, double[], int, double[]);

int main()
{
     int n_independent = 1;
     int n_parameters = 4;
     double *result;
     double x[NOBS] = {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
                       9.0, 10.0};
     double y[NOBS] = {29.73, 30.52, 22.33, 19.23, 23.32, 30.74,
                       47.29, 53.02, 107.94, 152.41, 262.67}; 

     imsls_omp_options(IMSLS_SET_FUNCTIONS_THREAD_SAFE, 1, 0);
 
     result = imsls_d_nonlinear_regression(fcn, n_parameters, NOBS,
         n_independent, x, y, 0);
 
     imsls_d_write_matrix("The NLR coefficients are:", 1, n_parameters,
         result, IMSLS_WRITE_FORMAT, "%10.6f", 0);
 
     if (result) imsls_free(result);
 
     return 0;
}

double fcn(int n_independent, double x[], int n_parameters, double c[])
{
      return c[3] * x[0] * x[0] * x[0] + c[2] * x[0] * x[0] +
          c[1] * x[0] + c[0];
}
 
             The NLR coefficients are:
        1             2              3             4
26.622727      8.066476      -4.865740      0.633390

The output values match those computed by poly_regression exactly. Note that a value for \(R^{2}\) is not explicitly available in the API for nonlinear_regression. However, it can be computed using the IMSLS_PREDICTED and IMSLS_SSE (or IMSLS_RESIDUAL) optional arguments. IMSLS_SSE returns the sum of squares of residuals (here, 600.9867) and IMSLS_PREDICTED conveniently evaluates the function using the computed coefficients at each value of x. Some math is then required to compute the mean of the original observations (70.8364), and the total sum of squares (57,298.6991). Then, \(R^{2}\) is defined as one minus the ratio of the residual sum of squares to the total sum of squares. Walking through this math manually in this case for the nonlinear regression algorithm results in \(R^{2}\) = 0.9895.

Nonlinear Least Squares

Moving over to the optimization chapter of functions of IMSL, the next function to consider is nonlin_least_squares. This is a general optimization solver that uses the well-known Levenberg-Marquardt algorithm. A user-defined function to evaluate is again required, although it programmatically differs from the nonlinear regression routine in that it is called by the algorithm to retrieve an entire vector of evaluations at once instead of passing individual values for x each time. With this programming interface, the raw x,y observations are not passed into the nonlin_least_squares function at all; instead the data observations are contained in the user-defined function where they are used in the evaluation. The structure of the program and data are therefore slightly different than statistical regression algorithms as shown above, but consistent with other general optimization functions.

As with nonlinear_regression, the nonlin_least_squares function will use SMP parallelization if the user-defined function is declared to be thread safe. The complete example using this numerical optimization solver is:

#include <imsl.h>
#define NOBS 11
void fcn(int, int, double[], double[]);

int main()
{
     int n = 4;
     double *result;
 
     imsl_omp_options(IMSL_SET_FUNCTIONS_THREAD_SAFE, 1, 0);
 
     result = imsl_d_nonlin_least_squares(fcn, NOBS, n, 0);
 
     imsl_d_write_matrix("The NLS solution is:", 1, n, result,
                 IMSL_WRITE_FORMAT, "%10.6f", 0);
 
     if (result) imsl_free(result);
 
     return 0;
}

void fcn(int m, int n, double c[], double f[])
{
     int i;
     double x[NOBS] = {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
                       9.0, 10.0};
     double y[NOBS] = {29.73, 30.52, 22.33, 19.23, 23.32, 30.74,
                       47.29, 53.02, 107.94, 152.41, 262.67};
 
     for (i = 0; i < m; i++)
         f[i] = c[3] * x[i] * x[i] * x[i] + c[2] * x[i] * x[i] +
                c[1] * x[i] + c[0] - y[i];
}

                The NLS solution is:
        1             2              3             4
26.622728      8.066475      -4.865740      0.633390

Essentially the same solution is found, differing only in the sixth decimal place for two of the coefficients. Any statistics regarding the goodness of fit must be computed manually by the user. The optimization functions in IMSL are general solvers, and so computing statistical fit information is not ordinarily appropriate. Thus, while the use cases for a nonlinear regression function and a nonlinear least squares solver can be similar as seen in this use case, their typical purposes are varied leading to different functional interfaces.

Linear Least Squares

Since a polynomial is a nonlinear curve, the initial impulse is to look for nonlinear solvers. However, this problem can be formulated to use a linear least squares approach as well. The linear problem to solve in this case is:

where \(A\) is the matrix of x values, c is the vector of 4 coefficients we’re looking for, and y is the vector of observed data values. Expanding \(A\) for the general case gives:

where \(m\) = 10 for this test case. The code requires a loop to build this matrix, then it’s passed along with y to lin_least_squares_gen, which computes a solution using column pivoting. Note that this version requires no user function to be evaluated. The full source and output is:

#include  <imsl.h> 
#define NOBS 11

int main()
{
     int i, j, n = 4;
     double val, a[44], result[4];
     double x[NOBS] = {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
                       9.0, 10.0};
     double y[NOBS] = {29.73, 30.52, 22.33, 19.23, 23.32, 30.74,
                       47.29, 53.02, 107.94, 152.41, 262.67};
 
     for (i = 0; i < NOBS; i++)
     {
         val = 1.0; 
         for (j = 0; j < n; j++)
         {
             a[i * n + j] = val;
             val *= x[i];
         }
     }

     imsl_d_lin_least_squares_gen(NOBS, n, a, y,
         IMSL_RETURN_USER, result, 0);
 
     imsl_d_write_matrix("The LLS solution is:", 1, n, result,
         IMSL_WRITE_FORMAT, "%10.6f", 0);
 
     return 0;
}

              The LLS solution is:
        1             2              3             4
26.622727      8.066476      -4.865740      0.633390

Bounded Nonlinear Least Squares

The sections above cover the primary options for a user looking to fit a polynomial to observed data. The customer that started this investigation included a specific additional use case, however, where they would like to set bounds on a specific coefficient. None of the functions above can accommodate such a use case, and statistical regression analysis rarely needs to worry about variable constraints. Therefore, this section and the next expands slightly on the base example to set bounds on a coefficient.

Consider the same starting dataset, except this time assume there exists some prior knowledge of the target curve to fit based on the physical system. For this data, many experienced users would look to fit a 3rd degree polynomial, but given some extra knowledge, they may know that the most useful equations are found when the leading coefficient is not greater than 0.5.

One appropriate IMSL function for a constrained problem like this is bounded_least_squares which follows a very similar API to nonlinear_least_squares except that it allows setting bounds for each unknown parameter. According to the documentation, when no bounds are required, the limits of [\(-10^{6}\) , \(10^{6}\) ] should be used, which are set for three of the coefficients. However, for the bounded one [\(-10^{6}\) , \(0.5\)] is used.

The complete example is provided below. Note the thread safety option and the additional variables used to set the bounds on the coefficients.

#include  <imsl.h> 
#define NOBS 11
void fcn(int, int, double[], double[]);

int main()
{
     int n = 4;
     double *result;
     double xlb[4] = {-1.0e6, -1.0e6, -1.0e6, -1.0e6};
     double xub[4] = { 1.0e6, 1.0e6, 1.0e6, 0.5};
 
     imsl_omp_options(IMSL_SET_FUNCTIONS_THREAD_SAFE, 1, 0);
 
     result = imsl_d_bounded_least_squares(fcn, NOBS, n, 0,
         xlb, xub, 0);
 
     imsl_d_write_matrix("The BLS solution is:", 1, n, result,
         IMSL_WRITE_FORMAT, "%10.6f", 0);
 
     if (result) imsl_free(result);
 
     return 0;
}

void fcn(int m, int n, double c[], double f[])
{
     int i;
     double x[NOBS] = {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
                       9.0, 10.0};
     double y[NOBS] = {29.73, 30.52, 22.33, 19.23, 23.32, 30.74,
                       47.29, 53.02, 107.94, 152.41, 262.67};
 
     for (i = 0; i < m; i++)
         f[i] = c[3] * x[i] * x[i] * x[i] + c[2] * x[i] * x[i] +
         c[1] * x[i] + c[0] - y[i];
}

         The BLS solution is:
        1             2               3             4
31.424755      0.436587       -2.864895      0.500000

As expected, different results are calculated in this case. The leading coefficient for the polynomial is set to the upper bound of 0.5 where the best fit value is 0.6334. The other coefficients adjust to retain the best fit possible given this constraint. As with other optimization functions, an estimate of \(R^{2}\) must be computed manually. Working through the math separately, this solution results in \(R^{2}\) = 0.9877, which is still a pretty good fit to the data.

Constrained Linear Least Squares

The bounded (or constrained) case can also be solved as a linear system using the function lin_lsq_lin_constraints. The methodology is analogous to the unconstrained linear example above, but combined with the variable bounds in the previous example. Multiple NULL values are passed into the function since the API allows additional constraints not needed for this example. The bounds on the variables are also constructed differently than for the bounded_least_ squares algorithm. Per the documentation, when there is no lower bound, 1.0e30 should be used; -1.0e30 is used when there is no upper bound. The results for both constrained cases are the same using either method.

#include <imsl.h> 
#define NOBS 11

int main()
{
     int i, j, n = 4;
     double val, a[44], result[4];
     double x[NOBS] = {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
                       9.0, 10.0};
     double y[NOBS] = {29.73, 30.52, 22.33, 19.23, 23.32, 30.74,
                       47.29, 53.02, 107.94, 152.41, 262.67};
     double xlb[4] = { 1.0e30, 1.0e30, 1.0e30, 1.0e30};
     double xub[4] = {-1.0e30, -1.0e30, -1.0e30, 0.5};
 
     for (i = 0; i < NOBS; i++)
     {
         val = 1.0;
         for (j = 0; j < n; j++)
         {
             a[i * n + j] = val;
             val *= x[i];
         }
     }

     imsl_d_lin_lsq_lin_constraints(NOBS, n, 0, a, y, NULL, NULL,
         NULL, NULL, xlb, xub, IMSL_RETURN_USER, result, 0);
 
     imsl_d_write_matrix("The LLLC solution is:", 1, n, result,
         IMSL_WRITE_FORMAT, "%10.6f", 0);
 
     return 0;
}

              The LLLC solution is:
        1             2              3              4
31.424755      0.436587      -2.864895       0.500000

Several Functions Available to Help

There are several functions available in the IMSL Library to fit an n-order polynomial to observed data. This article examines several robust methods including polynomial regression, nonlinear regression, nonlinear least squares, and linear least squares. As an extra use case, one coefficient is constrained and the results are computed with bounded nonlinear least squares and constrained linear least squares algorithms.

While statistical fit information is available for the regression functions in IMSL, using a general numerical optimization solver puts the onus on the user to compute statistics of interest. However, each of the methods compute the same coefficients for an example data set derived from a third order polynomial. The constrained case necessarily results in a slightly different solution. The raw data, the best fit and the constrained fit are displayed in the following chart.

With programming language and historical differences, not every function is the same across all varieties of the IMSL Libraries. The table below indicates the function names discussed relative to each of the libraries. While optional arguments and user functions in the Fortran library are like the C examples presented, the Java interfaces use object-oriented get/set methods with interfaces to implement user-defined functions.

Algorithm	IMSL C	IMSL Fortran	IMSL Java
Polynomial regression	poly_regression	RCURV, RPOLY	n/a
Nonlinear regression	nonlinear_regression	RNLIN	NonlinearRegression
Nonlinear least squares	nonlin_least_squares	UNLSF	NonlinLeastSquares
Linear least squares	lin_least_squares_gen	LSQRRR	QR.solve(double[])
Bounded least squares	bounded_least_squares	BCLSF	BoundedLeastSquares
Constrained linear least squares	lin_lsq_lin_constraints	LCLSQ	n/a

Additional Resources

Want to see how IMSL can add functionality to your project? Try any of our IMSL libraries free with an IMSL trial.

Try IMSL Free