Image Blog Embedding Analytics Using JMSL

September 3, 2015

Embedding Analytics into a Database Using JMSL

Embedded Analytics

This article describes how to implement embedded analytics within a database using JMSL Numerical Library, a native Java library.

Benefits of Using Embedded Analytics

The benefits of using embedded analytics are compelling and include:

Real time analysis – reporting without data synchronization delay.
Faster results – no need to move data across the network to do the analysis.
Accuracy – data and formula in one place avoids potential user errors.
Accessibility – invoke data and analytics from any programming language or application that can connect to the database.
Better quality of data – preprocessing and cleaning of data before it’s stored.
Higher security – data used as input to the analytics never leaves the database and, if necessary, user access is limited to running the stored routines without access to the underlying data.

Using embedded JMSL offers:

Trusted technology – JMSL is a known and proven product, built on over 40 years of experience.
Minimal risk – JMSL works with many existing databases as is, there is no change of platform required.

We describe in detail how to implement embedded JMSL using a particular relational database management system (RDBMS). However, this method works with any RDBMS that implements a Java Virtual Machine (JVM) within the database.

Tactics and Tools for Big Data Analysis

While variety is one of the characteristics of Big Data, it’s also a characteristic of how one responds to the challenges to Big Data. This means the overall strategy when working with Big Data needs to be composed of specific tactics and tools for specific challenges. Use of Hadoop and the MapReduce framework is one such tactic. Hadoop is ideal for storing and processing very large data sets stored on computer clusters built from inexpensive hardware. But it’s not a solution to all the challenges. Hadoop’s MapReduce embodies one of the fundamental changes when working with Big Data. Previously, for advanced analysis and processing, data moved from the database to a client machine and results uploaded back to the database. Now the paradigm is to join the algorithms with the data. The work described in this article brings the JMSL algorithms to the data.

Structured Query Language (SQL) is the language of relational databases. For using code written in another language, Java is the natural choice. The JDBC database connectivity technology has been around since Java 1.1. The growth of Java paralleled the expansion of relational databases and the two technologies work well together. In this article, the focus is on combining SQL calls with JMSL’s sophisticated Java algorithms.

Embedding JMSL

Relational databases have long had the ability to store and call routines in the native SQL. These routines are usually referred to as stored procedures. In addition, many RDBMSs now implement a JVM within the database. Java, in turn, implements data types that are one-to-one mappings of fundamental SQL data types. This allows seamless embedding of Java routines. These routines run in the database and eliminate the unnecessary movement of data. Finally, besides single Java classes, one can load Java archives (jar files) directly into the database, allowing any user with sufficient privileges to write Java stored procedures that use classes in the archive.

To demonstrate how to embed JMSL, we show an explicit case using an Oracle 12c database. Then, to demonstrate an example of embedded predictive analytics, we show a complete implementation of the Naïve Bayes classifier algorithm. Lastly, we create tables in a database containing Fisher iris data (a common data set containing samples for three species of the iris flower) to use as a test case.

Loading JMSL

The first step is to load the JMSL library to the database. Oracle has a command-line utility called loadjava for adding Java source files, classes, and libraries (jar files) to the database.

> loadjava −u sys@ORCL −resolve jmsl.jar

Here, the −u flag specifies the user and database, and the −resolve flag checks for any dependencies of the loaded object.

Implementation of a JMSL stored procedure

In general, when implementing Java stored procedures on an Oracle database there are three steps:

Write the Java.
Load the class to the database with loadjava.
Publish the Java class.

Write the Java class

The Naïve Bayes classifier example has two primary input parameters. The first is the training set that contains data categorized into known classes. The second is a data set of unknown type that needs classification. For now, we show just the entry point method to our class:

public static int Classify ( ResultSet rsT, ResultSet rsC,
                             ResultSet rsRows, ResultSet rsClasses )

Classify returns the previously unknown data’s class as an integer. The input parameters are all of type java.SQL. ResultSet. This data type maps directly to the SQL CURSOR variable type, SYS_REFCURSOR. The particular inputs are, in order: training data, unknown data, size of the training set, and the number of possible types. This method is part of the RWNaiveBayes.java class that has a complete listing in the appendix.

Load the class

We use the same command-line utility used for loading the JMSL library:

> loadjava −verbose −u sys@ORCL −resolve RWNaiveBayes.class

Here, we’re loading the compiled class rather than the source as our experience shows that loading source files and compiling them in the database can add complications. In addition to checking for dependencies with the −resolve flag, we use verbose output to get full information on the upload.

Publish the class methods

For each Java method that’s called from SQL, you must write a call specification to expose the top-level entry point of the method to Oracle.

CREATE or REPLACE
FUNCTION RWNaiveBayes(
    curs_train IN SYS_REFCURSOR,
    curs_class IN SYS_REFCURSOR,
    curs_numrows IN SYS_REFCURSOR,
    curs_numclass IN SYS_REFCURSOR)
RETURN NUMBER
AS LANGUAGE JAVA
   NAME ‘RWNaiveBayes.Classify(
          java.sql.ResultSet,
          java.sql.ResultSet,
          java.sql.ResultSet,
          java.sql.ResultSet ) return int’;

Names for the SQL alias variables (in this case, the cursor variables) are required but not used.

Calling the Embedded JMSL Routine

The JMSL routine is, in strict SQL terms, a function and not a procedure. A key difference is that procedures have no return type and cannot be invoked within a query. For demonstration purposes, we’ll show a single query to classify unknown data and return its type. In practice, one would most likely use stored views rather than the subqueries as in this example.

1. SELECT CT FROM ( SELECT RWNaiveBayes(
2.     CURSOR(SELECT classification, sepallength, sepalwidth, petallength, petalwidth
       FROM iris_trn ),
3.     CURSOR( SELECT * FROM faux_iris OFFSET 8 ROWS FETCH FIRST 1 ROWS ONLY),
4.     CURSOR( SELECT COUNT(*) FROM iris_trn ),
5.     CURSOR( SELECT MAX(classification) FROM iris_trn ) )
6. AS CT FROM dual);

The subquery in line 2 selects values from the Fisher iris data that’s been stored in table iris_trn. The four selected parameters from the training set are the defining characteristics of the iris data. Line 3 selects the eighth row from the table of iris test data stored in faux_iris. Lines 4 and 5 select number of rows and number of classes from the table iris_trn. The class value of the new data is returned and saved in the alias CT, line 6.

There are three levels of SELECT being used. The first gets the classification type from the alias CT. The second is the function call, SELECT RWNaiveBayes. The third level selects information from the database to pass to the Bayes classifier. Lines 2-5 are the four inputs to the Bayes classifier. Formally, when passing queries to any stored procedure, they must be in the form of CURSOR expressions. This is the reason for the syntax: CURSOR( SELECT …).

Our point with this example is to show an embedded analytics call solely using basic SQL syntax. It’s also quite general and can be used with training data that has very many or very few parameters (columns), as long as they are a numeric type. The limit on the size of the training set (rows) is essentially only the amount of memory available. Many other algorithms in the JMSL library can be implemented in the same manner.

Use Robust and Proven Algorithms

We’ve shown explicit steps to embed JMSL in a database to implement one particular algorithm. The Naïve Bayes classifier is just one of many algorithms available in the JMSL library. Using embedded JMSL doesn’t require any kind of platform change, it will work with many existing RDBMS systems. There’s essentially no risk and the only startup cost is writing simple wrapper code to the robust and proven JMSL algorithms.

For this minimal cost, the benefits are quite significant. Eliminating the data transfer to separate analysis machines offers increased security and huge throughput improvements. This also removes the possibility of corruption of data due to movements from and to the database. Finally, there is great flexibility in how users can access the analytics. Any database user granted permission could run the analytics using a variety of connectivity options.

Try JMSL for Free