October 19, 2022

Data Ingestion for Loan Scoring using IMSL for C

Algorithms & Functions

When considering approving a loan, banks and other lending institutions want to know the chance that a borrower will default on the loan. Assessing this risk is known as loan scoring and variously by other terms, such as credit scoring and loan grading. The quality score of the loan application helps the bank decide to approve or not approve the loan application.

Loan scoring plays a central role in a bank’s efforts to manage risk. With respect to writing loans, the greatest risk is default risk. A default occurs when a borrower fails to make payments to the point the company has to write-off the loan amount as a loss. Banks also want to assess the risk of early payoff (or refinance), since interest revenue for the remaining term of the loan is lost.

By analyzing historical loans, both successful and unsuccessful, banks build custom loan scoring models.

Long before any loan scoring model is built, the data must be transformed into a format that machine learning algorithms will accept. In this blog, we discuss a specific data set for loan scoring and illustrate initial steps in getting the data machine learning ready.

Description of the Data
Reading the Data With imsls_data_read
Summary

Description of the Data

The data set is sourced from The Lending Club, a peer-to-peer lending site (Note that The Lending Club has since changed its business model). The data has 1 million rows of individual loans with 157 attributes and takes up a little over half a gigabyte on disk.

We received the data as a comma-delimited (.csv) spreadsheet. The data set is a mixture of strings, dates, numerical data, and contains numerous blank and missing values. Even though it is already organized in a flat file, as opposed to residing in different database tables, it presents a few common challenges to most data ingestion tools.

Here is a snapshot of the data file opened in Notepad++, showing the first several columns for the initial rows in the file.

Here is another view of those same rows showing many missing values in the last several columns, indicated by consecutive comma delimiters:

Reading the Data With imsls_data_read

We use the new CNL function imsls_data_read (available soon in Q1 2023) to properly parse the different data types, transform date fields and string fields to numeric values, skip unimportant fields, and replace missing values.

First define the data file name and the dictionary file name.

The dictionary file tells the data reader how to interpret each column in the data file. Here are a few lines of the data dictionary, “LoanDictionary.csv”.

In the dictionary file, each column is listed by name and by data type. A data type of “skip” tells the data reader to skip that column. Note that date columns must be followed by the date format, e.g., MM/DD/YYYY. If the data were purely numeric, it is not necessary to provide a data dictionary.

Handling Missing Values

The following few lines define the missing values to search for in the data and the corresponding replacement strings. In this example, the effect is to replace occurrences of “null” with “MISSING”, “#N/A” with “MISSING”, and spaces (“” and “ “) with “BLANK”. For numeric fields, missing values are replaced with the value -9999.00. Many other choices can be made for the replacement values. The important thing to know is how missing values might appear in the data set.

The next few lines define values to ignore and to filter out.

The ignore string “Total” is needed to find rows containing subtotals “Total loan amount…” that show up intermittently throughout the file. The entire line is ignored or skipped. The filters are single characters that the data reader will detect and ignore within a field. We discover the strings to ignore and filter out through inspection of the data file or a little trial and error.

Now we can call the data reader.

IMSLS_DATA_HEADER is a flag indicating that the first line of the data file contains the column names. The return from imsls_data_read is a double data matrix structure containing the number of rows and columns in addition to the data. An optional output structure, column_info, provides information on each of the columns: name, data type, and number of classes and counts for string variables.

Next we print out a few rows of the result:

Figure 1 Loan Scoring Data Matrix: the first 10 rows of the first 4 columns

Note that all columns are returned as doubles.

Date and date/time variables are converted to Julian dates using a base date of 9/14/1752. This means that if a returned value is 96137 and we want to convert this back to MM/DD/YYYY format for whatever reason, then we must add the Julian date corresponding to 9/14/1752 (=2361221). That is, 96137 + 2361221 = 2457358 converts to 12/1/2015 using most other conversion tools. Missing dates are returned as 0 from imsls_data_read and therefore correspond to a default date of 9/14/1752, which still indicates the date is #N/A or missing.

Data Summaries

We can (to our heart’s content) run statistics and tabular summaries for different variables.

For example, the loan status field has 7 levels with the following distribution in the data set:

Figure 2 Loan status categories

The good loans are “Fully Paid”, “Current”, and “In Grace Period”.

The “Late (16-30) days” and “Late (31-120) days”, are trending bad, while “Default” and “Charged Off” loans are the bad loans. Loans that are in “Default” are loans for which borrowers have failed to make payments for greater than 120 days, in this case, while “Charged Off” loans are those which the bank has written off as losses. These loans may be sold off to debt consolidators or collection agencies.

As part of the loan application, the bank will want to know about the borrower’s employment and income, home ownership, credit history, and more. All of these would point to a borrower’s ability and reliability to make payments on the loan.

For example, “Home Ownership”:

Figure 3 Home ownership categories

In our example, within the result returned by imsls_data_read the values 1- 5 will replace “Home Ownership” strings according to the legend shown above. The assignment of the integers to categories occurs in the order in which the different categories occur in the data.

The data reader imsls_data_read does quite a bit of work parsing the raw data, but there is almost always more work and summarization needed to get the data ready for a machine learning algorithm. We pick up with these steps in our next blog (Loan Scoring with IMSL for C).

Summary

Regardless of the specific business or data science problem, data ingestion is a crucial and necessary step that involves many repetitive tasks such as interpreting dates, handling missing and erroneous values, encoding for string variables, and more.

We demonstrated how CNL’s new data reader, imsls_data_read streamlines the data ingestion process. imsls_data_read is available in our next release of CNL (IMSL for C).

Want to try IMSL for C on in your application? Request a free trial.

Free Trial

Getting Proactive With Predictive Analysis