Submit support requests and browse self-service resources.
This article describes how to implement embedded analytics within a database using SourcePro and the IMSL C Numerical Library, a native C library from Perforce.
The benefits of using embedded analytics include:
We describe in detail how to implement a server-side native stored procedure leveraging IMSL C using a particular relational database management system (RDBMS). We then execute the procedure in RDBMS using SourcePro DB.
While variety is one of the characteristics of big data, it is also a necessary characteristic of the solutions designed to respond to its challenges. That is, the overall strategy when working with big data should consist of various, specific tactics and tools that address specific challenges. For example, the Hadoop and its MapReduce framework is commonly considered an important technology to address big data challenges, but it is not a solution to all the challenges.
Although Hadoop’s MapReduce is not a one-size-fits-all solution to all big data problems, it has introduced a fundamental and important change when working with big data: to join the algorithms with the data. Previously, for advanced analysis and processing, data was moved from the database to a client machine where some computation was performed, and then both the data and the results were uploaded back to the database. The work described in this article brings the IMSL C algorithms to the database itself, orchestrating the processing of the data using SourcePro DB.
The IMSL C Numerical Library is a comprehensive set of pre-built mathematical and statistical analysis functions that developers can embed directly into their applications. Available for a wide range of computing platforms, the robust, scalable, portable, and high performing IMSL analytics allow developers to focus on their domain of expertise and reduce development time. IMSL C includes a comprehensive set of functions for data mining, modeling, prediction, and optimization. There are time series models such as ARIMA, GARCH, and vector auto-regression, plus decision trees, Apriori analysis, SVM, neural networks, linear and logistic regression, K-means clustering, Bayes classification, and much more. IMSL C also includes functions for analyzing streaming data and working with big or distributed data.
SourcePro is the most complete enterprise C++ development platform. It provides fundamental C++ components as well as robust, reliable cross-platform C++ tools in the areas of analysis, networking, and databases. SourcePro DB, a part of the SourcePro product suite, is a library of C++ classes that provide a high-level, object-oriented, cross-database abstraction over the native C APIs of the various database vendors. It provides a high-performance, consistent API encapsulating the database-specific differences in behavior and syntax.
Many databases support procedures authored in native languages, allowing the usage of external libraries in the RDBMS and the seamless embedding of C routines. These routines run in the database, thus eliminating the unnecessary movement of data.
To demonstrate how to create and execute a native stored procedure using IMSL C, we describe an explicit case using an Oracle 12c database on the Windows operating system. To demonstrate an example of embedded predictive analytics, we show a complete implementation of the Naïve Bayes classifier algorithm, and create tables in a database containing Fisher iris data (a common data set containing samples for three species of the iris flower) to use as a test case. We then use SourcePro DB to create and execute stored procedures that invoke the embedded Naïve Bayes implementation.
Generally, implementing a native stored procedure on an Oracle database includes two steps:
This example invokes the IMSL C Naïve Bayes Trainer on Fisher’s iris data set fetched from a database using the SourcePro DB API. Using this trained data, Naïve Bayes Classification predicts classification of iris data (sepal length, sepal width, petal length, and petal width) passed to the example. For now, we show just the entry point method to our function:
int iris(OCIExtProcContext* ctx, OCIArray* pIN, short pIN_i, double
sepalLength, double sepalWidth, double petalLength, double petalWidth)
The function iris returns the previously unknown data’s class as an integer. The inputs to the function are:
sepalLength, sepalWidth, petalLength, petalWidth
Now, compile the iris C function (and helpers) described above into a shared library. Using the following compile and link commands, the C functions are built into a DLL to upload to the Oracle server.
cl -IC:\PROGRA~2\VNI\imsl\cnl850\winms120x64\include -nologo -EHsc -MD -W3 -O2 -GR -
D_SCL_SECURE_NO_DEPRECATE -c IrisExample.c
cl /LD /Feiris.dll user32.lib imslcmath_imsl_dll.lib imslcstat_imsl_dll.lib IrisExample.
obj /link /LIBPATH:C:\PROGRA~2\VNI\imsl\cnl850\winms120x64\lib -nologo /IMPLIB:iris.lib /
After the DLL has been built, we’ll copy the new DLL and dependent IMSL C libraries to the Oracle server (iris.dll, imslcmath_imsl_dll.dll, imslcstat_imsl_dll.dll). Please refer to your Oracle documentation to determine any additional requirements for the location where DLLs should be hosted on the Oracle server.
Now we’ll register the new DLL with the database so it can be called.
CREATE OR REPLACE LIBRARY iris_lib AS
This example depends on some declared type aliases to move the queryset data from a result set to the analysis function:
CREATE OR REPLACE TYPE imsl_vector IS VARRAY(10000) OF NUMBER; -- max size is 10k
CREATE OR REPLACE TYPE imsl_array IS VARRAY(10000) of imsl_vector;
Now we’re ready to declare a stored procedure in the database to provide a SQL entry point for applications to use. Note the function irisfunc uses an Oracle VARRAY to communicate the input data to the external procedure rather than a cursor variable. This design is due to constraints in the Oracle system that prevent using PL/SQL cursor variables with external procedures.
CREATE OR REPLACE FUNCTION irisfunc(varr_in IN imsl_array, sepalLength binary_double,
sepalWidth binary_double, petalLength binary_double, petalWidth binary_double)
RETURN PLS_INTEGER AS
PARAMETERS (CONTEXT, varr_in OCIColl, varr_in INDICATOR, sepalLength double, sepalWidth
double, petalLength double, petalWidth double, RETURN int);
The procedure irisfunc acts on an array of data from the irisdata table. To provide a clear division of responsibility, this example demonstrates a wrapping function that gathers the dataset from querying the irisdata table, and then invokes the external procedure for final analysis. In this function, a SELECT query is made, and the result set is processed to build up an array of data to pass to the external procedure irisfunc .
CREATE OR REPLACE FUNCTION RW_NAIVE_BAYES(sepal_len binary_double, sepal_width binary_
double, petal_len binary_double, petal_width binary_double) return int
arr_var_in imsl_array := imsl_array();
classification PLS_INTEGER := 6;
speciescount PLS_INTEGER := 1;
arr_var_in(5) := imsl_vector();
SELECT sepallength, sepalwidth, petallength, petalwidth
BULK COLLECT INTO
arr_var_in(1), arr_var_in(2), arr_var_in(3), arr_var_in(4)
FOR species IN (select speciesofiris from irisdata)
if species.speciesofiris = ‘setosa’
arr_var_in(5)(speciescount) := 0;
elsif species.speciesofiris = ‘versicolor’
arr_var_in(5)(speciescount) := 1;
elsif species.speciesofiris = ‘virginica’
arr_var_in(5)(speciescount) := 2;
speciescount := speciescount + 1;
classification := irisfunc(arr_var_in, sepal_len, sepal_width, petal_len, petal_width);
Our point with this example is to show an embedded analytics call using just basic SQL syntax. It is quite general and can be used with training data that has many or few parameters (columns), as long as they are a numeric type. The limit on the size of the training set (rows) is essentially the amount of memory available. Many other algorithms in the IMSL C library can be implemented in the same manner.
First, we need to connect to the Oracle database. To establish a database connection, we first request an RWDBDatabase instance from the RWDBManager and then instantiate an RWDBConnection object from the database instance.
RWDBDatabase db = RWDBManager::database(“ORACLE_OCI”, “”, “”,
RWDBConnection conn = db.connection();
std::cout << “Connected!” << std::endl;
std::cout << conn.status().message() << std::endl;
To invoke the procedure, an RWDBStoredProc object is created to the RW_NAIVE_BAYES procedure on the server. Values for each of the input parameters are declared and bound to the procedure. Finally, the procedure is invoked and the return value for the classification is retrieved.
1. double sepalLength = 6.2;
2. double sepalWidth = 2.2;
3. double petalLength = 4.5;
4. double petalWidth = 1.5;
5. RWDBStoredProc cnlProc = db.storedProc(“RW_NAIVE_BAYES”, conn);
6. cnlProc << RWDBBoundExpr(&sepalLength) << RWDBBoundExpr(&sepalWidth)
7. << RWDBBoundExpr(&petalLength) << RWDBBoundExpr(&petalWidth);
9. int retVal = cnlProc.returnValue().asInt();
10. std::cout << “Classification for “ << sepalLength << “, “
11. << sepalWidth << “, “ << petalLength << “, “
12. << petalWidth << “ is: “ << retVal << std::endl;
Lines one to four declare local variables to specify the characteristics of the iris to be classified. Line five queries the database to create a RWDBStoredProc object to interact with. Lines six to seven bind the local values as input parameters to the stored procedure. The local values are wrapped in RWDBBoundExpr decorators so that the procedure allows subsequent executions of the procedure to be more efficient. Line eight executes the procedure on the server. Line nine fetches the classification return value from the procedure.
In this article, we have shown explicit steps for embedding IMSL C in a database to implement a particular algorithm. The Naïve Bayes classifier is just one of many algorithms available in the IMSL C Library. The SourcePro DB intuitive API makes it easy to connect to RDBMS systems and query IMSL C-based algorithms. There is essentially no risk, and the only startup cost is writing simple wrapper code to the robust and proven IMSL C algorithms. Eliminating the data transfer between separate analysis machines offers increased security and huge throughput improvements and also removes the possibility of data corruption due to movements from and to the database.
Want to try the IMSL C Library on your application? Try it free for with an IMSL C trial.
Try IMSL C