Embedding Analytics Using SourcePro IMSL C
July 20, 2016

Embedding Analytics into a Database Using SourcePro and IMSL C

Embedded Analytics

This article describes how to implement embedded analytics within a database using SourcePro and the IMSL C Numerical Library, a native C library from Perforce.

Benefits of Using Embedded Analytics

The benefits of using embedded analytics include:

  • Real time analysis – reports results without data synchronization delay.
  • Faster results – eliminates the need to move the data across the network.
  • Accuracy – co-locates the data and the analytics to avoid potential user errors.
  • Accessibility – allows invoking the data and analytics from any programming language or application that can connect to the database.
  • Better quality of data – allows preprocessing and cleaning of data before it’s stored.
  • Higher security – data used as input to the analytics never leaves the database and, if necessary, user access is limited to running the stored routines without access to the underlying data.

We describe in detail how to implement a server-side native stored procedure leveraging IMSL C using a particular relational database management system (RDBMS). We then execute the procedure in RDBMS using SourcePro DB.

Joining the Algorithms With the Data

While variety is one of the characteristics of big data, it is also a necessary characteristic of the solutions designed to respond to its challenges. That is, the overall strategy when working with big data should consist of various, specific tactics and tools that address specific challenges. For example, the Hadoop and its MapReduce framework is commonly considered an important technology to address big data challenges, but it is not a solution to all the challenges.

Although Hadoop’s MapReduce is not a one-size-fits-all solution to all big data problems, it has introduced a fundamental and important change when working with big data: to join the algorithms with the data. Previously, for advanced analysis and processing, data was moved from the database to a client machine where some computation was performed, and then both the data and the results were uploaded back to the database. The work described in this article brings the IMSL C algorithms to the database itself, orchestrating the processing of the data using SourcePro DB.

The IMSL C Numerical Library is a comprehensive set of pre-built mathematical and statistical analysis functions that developers can embed directly into their applications. Available for a wide range of computing platforms, the robust, scalable, portable, and high performing IMSL analytics allow developers to focus on their domain of expertise and reduce development time. IMSL C includes a comprehensive set of functions for data mining, modeling, prediction, and optimization. There are time series models such as ARIMA, GARCH, and vector auto-regression, plus decision trees, Apriori analysis, SVM, neural networks, linear and logistic regression, K-means clustering, Bayes classification, and much more. IMSL C also includes functions for analyzing streaming data and working with big or distributed data.

SourcePro is the most complete enterprise C++ development platform. It provides fundamental C++ components as well as robust, reliable cross-platform C++ tools in the areas of analysis, networking, and databases. SourcePro DB, a part of the SourcePro product suite, is a library of C++ classes that provide a high-level, object-oriented, cross-database abstraction over the native C APIs of the various database vendors. It provides a high-performance, consistent API encapsulating the database-specific differences in behavior and syntax.

About the Example

Many databases support procedures authored in native languages, allowing the usage of external libraries in the RDBMS and the seamless embedding of C routines. These routines run in the database, thus eliminating the unnecessary movement of data.

To demonstrate how to create and execute a native stored procedure using IMSL C, we describe an explicit case using an Oracle 12c database on the Windows operating system. To demonstrate an example of embedded predictive analytics, we show a complete implementation of the Naïve Bayes classifier algorithm, and create tables in a database containing Fisher iris data (a common data set containing samples for three species of the iris flower) to use as a test case. We then use SourcePro DB to create and execute stored procedures that invoke the embedded Naïve Bayes implementation. 

Create an Oracle Stored Procedure Using IMSL C

Generally, implementing a native stored procedure on an Oracle database includes two steps:

  1. Write the native stored procedure.
  2. Upload and register the stored procedure with the database.

Write the Native Stored Procedure

This example invokes the IMSL C Naïve Bayes Trainer on Fisher’s iris data set fetched from a database using the SourcePro DB API. Using this trained data, Naïve Bayes Classification predicts classification of iris data (sepal length, sepal width, petal length, and petal width) passed to the example. For now, we show just the entry point method to our function:

int iris(OCIExtProcContext* ctx, OCIArray* pIN, short pIN_i, double
sepalLength, double sepalWidth, double petalLength, double petalWidth)


The function iris returns the previously unknown data’s class as an integer. The inputs to the function are:

  • ctx – OCI external procedure calling context
  • pIN – An array of input data to be analyzed
  • pIN_i – A null indicator to signal if there is data in the array pIN
  • sepalLength, sepalWidth, petalLength, petalWidth – classifications of iris data

Create a shared library

Now, compile the iris C function (and helpers) described above into a shared library. Using the following compile and link commands, the C functions are built into a DLL to upload to the Oracle server.

Compile:

cl -IC:\PROGRA~2\VNI\imsl\cnl850\winms120x64\include -nologo -EHsc -MD -W3 -O2 -GR -
D_SCL_SECURE_NO_DEPRECATE -c IrisExample.c 


Link:

cl /LD /Feiris.dll user32.lib imslcmath_imsl_dll.lib imslcstat_imsl_dll.lib IrisExample.
obj /link /LIBPATH:C:\PROGRA~2\VNI\imsl\cnl850\winms120x64\lib -nologo /IMPLIB:iris.lib /
PDB:iris.pdb 


Upload and Register the Stored Procedure With the Database

After the DLL has been built, we’ll copy the new DLL and dependent IMSL C libraries to the Oracle server (iris.dll, imslcmath_imsl_dll.dll, imslcstat_imsl_dll.dll). Please refer to your Oracle documentation to determine any additional requirements for the location where DLLs should be hosted on the Oracle server.

Now we’ll register the new DLL with the database so it can be called.

CREATE OR REPLACE LIBRARY iris_lib AS 


Define aliases

This example depends on some declared type aliases to move the queryset data from a result set to the analysis function:

CREATE OR REPLACE TYPE imsl_vector IS VARRAY(10000) OF NUMBER; -- max size is 10k

CREATE OR REPLACE TYPE imsl_array IS VARRAY(10000) of imsl_vector;


Declare the function to invoke the C procedure

Now we’re ready to declare a stored procedure in the database to provide a SQL entry point for applications to use. Note the function irisfunc uses an Oracle VARRAY to communicate the input data to the external procedure rather than a cursor variable. This design is due to constraints in the Oracle system that prevent using PL/SQL cursor variables with external procedures.

CREATE OR REPLACE FUNCTION irisfunc(varr_in IN imsl_array, sepalLength binary_double,
sepalWidth binary_double, petalLength binary_double, petalWidth binary_double)
RETURN PLS_INTEGER AS
LANGUAGE C
LIBRARY iris_lib
NAME “iris”
WITH CONTEXT
PARAMETERS (CONTEXT, varr_in OCIColl, varr_in INDICATOR, sepalLength double, sepalWidth
double, petalLength double, petalWidth double, RETURN int);
/


Create a wrapper function to build the data set for analysis

The procedure irisfunc acts on an array of data from the irisdata table. To provide a clear division of responsibility, this example demonstrates a wrapping function that gathers the dataset from querying the irisdata table, and then invokes the external procedure for final analysis. In this function, a SELECT query is made, and the result set is processed to build up an array of data to pass to the external procedure irisfunc .

CREATE OR REPLACE FUNCTION RW_NAIVE_BAYES(sepal_len binary_double, sepal_width binary_
double, petal_len binary_double, petal_width binary_double) return int
is
arr_var_in imsl_array := imsl_array();
classification PLS_INTEGER := 6;
speciescount PLS_INTEGER := 1;
BEGIN
 arr_var_in.extend(5);
 arr_var_in(5) := imsl_vector();
 arr_var_in(5).extend(150);
 SELECT sepallength, sepalwidth, petallength, petalwidth
        BULK COLLECT INTO
                   arr_var_in(1), arr_var_in(2), arr_var_in(3), arr_var_in(4)
 FROM irisdata;
 BEGIN
   FOR species IN (select speciesofiris from irisdata)
 LOOP
   if species.speciesofiris = ‘setosa’
   then
        arr_var_in(5)(speciescount) := 0;
   elsif species.speciesofiris = ‘versicolor’
   then
        arr_var_in(5)(speciescount) := 1;
   elsif species.speciesofiris = ‘virginica’
   then
        arr_var_in(5)(speciescount) := 2;
   else
        DBMS_OUTPUT.put_line(speciescount);
   end if;
   speciescount := speciescount + 1;
 end loop;
 end;
 classification := irisfunc(arr_var_in, sepal_len, sepal_width, petal_len, petal_width);
 return classification;
END;
/


Our point with this example is to show an embedded analytics call using just basic SQL syntax. It is quite general and can be used with training data that has many or few parameters (columns), as long as they are a numeric type. The limit on the size of the training set (rows) is essentially the amount of memory available. Many other algorithms in the IMSL C library can be implemented in the same manner.

Create the SourcePro DB Code to Invoke Procedure

First, we need to connect to the Oracle database. To establish a database connection, we first request an RWDBDatabase instance from the RWDBManager and then instantiate an RWDBConnection object from the database instance.

   RWDBDatabase db = RWDBManager::database(“ORACLE_OCI”, “”, “”,
“”, “”);
    RWDBConnection conn = db.connection();
    if(conn.isValid()) {
        std::cout << “Connected!” << std::endl;
    }
    else {
        std::cout << conn.status().message() << std::endl;
    }


To invoke the procedure, an RWDBStoredProc object is created to the RW_NAIVE_BAYES procedure on the server. Values for each of the input parameters are declared and bound to the procedure. Finally, the procedure is invoked and the return value for the classification is retrieved.

 1.     double sepalLength = 6.2;
 2.     double sepalWidth = 2.2;
 3.     double petalLength = 4.5;
 4.     double petalWidth = 1.5;
 5.     RWDBStoredProc cnlProc = db.storedProc(“RW_NAIVE_BAYES”, conn);
 6.     cnlProc << RWDBBoundExpr(&sepalLength) << RWDBBoundExpr(&sepalWidth)
 7.             << RWDBBoundExpr(&petalLength) << RWDBBoundExpr(&petalWidth);
 8.     cnlProc.execute(conn);
 9.     int retVal = cnlProc.returnValue().asInt();
10.     std::cout << “Classification for “ << sepalLength << “, “
11.               << sepalWidth << “, “ << petalLength << “, “
12.               << petalWidth << “ is: “ << retVal << std::endl;


Lines one to four declare local variables to specify the characteristics of the iris to be classified. Line five queries the database to create a RWDBStoredProc object to interact with. Lines six to seven bind the local values as input parameters to the stored procedure. The local values are wrapped in RWDBBoundExpr decorators so that the procedure allows subsequent executions of the procedure to be more efficient. Line eight executes the procedure on the server. Line nine fetches the classification return value from the procedure.

Summary

In this article, we have shown explicit steps for embedding IMSL C in a database to implement a particular algorithm. The Naïve Bayes classifier is just one of many algorithms available in the IMSL C Library. The SourcePro DB intuitive API makes it easy to connect to RDBMS systems and query IMSL C-based algorithms. There is essentially no risk, and the only startup cost is writing simple wrapper code to the robust and proven IMSL C algorithms. Eliminating the data transfer between separate analysis machines offers increased security and huge throughput improvements and also removes the possibility of data corruption due to movements from and to the database.

Want to try the IMSL C Library on your application? Try it free for with an IMSL C trial.

Try IMSL C