Submit support requests and browse self-service resources.
This article describes how to implement embedded analytics within a database using JMSL Numerical Library, a native Java library.
The benefits of using embedded analytics are compelling and include:
Using embedded JMSL offers:
We describe in detail how to implement embedded JMSL using a particular relational database management system (RDBMS). However, this method works with any RDBMS that implements a Java Virtual Machine (JVM) within the database.
While variety is one of the characteristics of Big Data, it’s also a characteristic of how one responds to the challenges to Big Data. This means the overall strategy when working with Big Data needs to be composed of specific tactics and tools for specific challenges. Use of Hadoop and the MapReduce framework is one such tactic. Hadoop is ideal for storing and processing very large data sets stored on computer clusters built from inexpensive hardware. But it’s not a solution to all the challenges. Hadoop’s MapReduce embodies one of the fundamental changes when working with Big Data. Previously, for advanced analysis and processing, data moved from the database to a client machine and results uploaded back to the database. Now the paradigm is to join the algorithms with the data. The work described in this article brings the JMSL algorithms to the data.
Structured Query Language (SQL) is the language of relational databases. For using code written in another language, Java is the natural choice. The JDBC database connectivity technology has been around since Java 1.1. The growth of Java paralleled the expansion of relational databases and the two technologies work well together. In this article, the focus is on combining SQL calls with JMSL’s sophisticated Java algorithms.
Relational databases have long had the ability to store and call routines in the native SQL. These routines are usually referred to as stored procedures. In addition, many RDBMSs now implement a JVM within the database. Java, in turn, implements data types that are one-to-one mappings of fundamental SQL data types. This allows seamless embedding of Java routines. These routines run in the database and eliminate the unnecessary movement of data. Finally, besides single Java classes, one can load Java archives (jar files) directly into the database, allowing any user with sufficient privileges to write Java stored procedures that use classes in the archive.
To demonstrate how to embed JMSL, we show an explicit case using an Oracle 12c database. Then, to demonstrate an example of embedded predictive analytics, we show a complete implementation of the Naïve Bayes classifier algorithm. Lastly, we create tables in a database containing Fisher iris data (a common data set containing samples for three species of the iris flower) to use as a test case.
The first step is to load the JMSL library to the database. Oracle has a command-line utility called loadjava for adding Java source files, classes, and libraries (jar files) to the database.
> loadjava −u [email protected] −resolve jmsl.jar
Here, the −u flag specifies the user and database, and the −resolve flag checks for any dependencies of the loaded object.
In general, when implementing Java stored procedures on an Oracle database there are three steps:
The Naïve Bayes classifier example has two primary input parameters. The first is the training set that contains data categorized into known classes. The second is a data set of unknown type that needs classification. For now, we show just the entry point method to our class:
public static int Classify ( ResultSet rsT, ResultSet rsC,
ResultSet rsRows, ResultSet rsClasses )
Classify returns the previously unknown data’s class as an integer. The input parameters are all of type java.SQL. ResultSet. This data type maps directly to the SQL CURSOR variable type, SYS_REFCURSOR. The particular inputs are, in order: training data, unknown data, size of the training set, and the number of possible types. This method is part of the RWNaiveBayes.java class that has a complete listing in the appendix.
We use the same command-line utility used for loading the JMSL library:
> loadjava −verbose −u [email protected] −resolve RWNaiveBayes.class
Here, we’re loading the compiled class rather than the source as our experience shows that loading source files and compiling them in the database can add complications. In addition to checking for dependencies with the −resolve flag, we use verbose output to get full information on the upload.
For each Java method that’s called from SQL, you must write a call specification to expose the top-level entry point of the method to Oracle.
CREATE or REPLACE
curs_train IN SYS_REFCURSOR,
curs_class IN SYS_REFCURSOR,
curs_numrows IN SYS_REFCURSOR,
curs_numclass IN SYS_REFCURSOR)
AS LANGUAGE JAVA
java.sql.ResultSet ) return int’;
Names for the SQL alias variables (in this case, the cursor variables) are required but not used.
The JMSL routine is, in strict SQL terms, a function and not a procedure. A key difference is that procedures have no return type and cannot be invoked within a query. For demonstration purposes, we’ll show a single query to classify unknown data and return its type. In practice, one would most likely use stored views rather than the subqueries as in this example.
1. SELECT CT FROM ( SELECT RWNaiveBayes(
2. CURSOR(SELECT classification, sepallength, sepalwidth, petallength, petalwidth
FROM iris_trn ),
3. CURSOR( SELECT * FROM faux_iris OFFSET 8 ROWS FETCH FIRST 1 ROWS ONLY),
4. CURSOR( SELECT COUNT(*) FROM iris_trn ),
5. CURSOR( SELECT MAX(classification) FROM iris_trn ) )
6. AS CT FROM dual);
The subquery in line 2 selects values from the Fisher iris data that’s been stored in table iris_trn. The four selected parameters from the training set are the defining characteristics of the iris data. Line 3 selects the eighth row from the table of iris test data stored in faux_iris. Lines 4 and 5 select number of rows and number of classes from the table iris_trn. The class value of the new data is returned and saved in the alias CT, line 6.
There are three levels of SELECT being used. The first gets the classification type from the alias CT. The second is the function call, SELECT RWNaiveBayes. The third level selects information from the database to pass to the Bayes classifier. Lines 2-5 are the four inputs to the Bayes classifier. Formally, when passing queries to any stored procedure, they must be in the form of CURSOR expressions. This is the reason for the syntax: CURSOR( SELECT …).
CURSOR( SELECT …).
Our point with this example is to show an embedded analytics call solely using basic SQL syntax. It’s also quite general and can be used with training data that has very many or very few parameters (columns), as long as they are a numeric type. The limit on the size of the training set (rows) is essentially only the amount of memory available. Many other algorithms in the JMSL library can be implemented in the same manner.
We’ve shown explicit steps to embed JMSL in a database to implement one particular algorithm. The Naïve Bayes classifier is just one of many algorithms available in the JMSL library. Using embedded JMSL doesn’t require any kind of platform change, it will work with many existing RDBMS systems. There’s essentially no risk and the only startup cost is writing simple wrapper code to the robust and proven JMSL algorithms.
For this minimal cost, the benefits are quite significant. Eliminating the data transfer to separate analysis machines offers increased security and huge throughput improvements. This also removes the possibility of corruption of data due to movements from and to the database. Finally, there is great flexibility in how users can access the analytics. Any database user granted permission could run the analytics using a variety of connectivity options.
Try JMSL for Free