Submit support requests and browse self-service resources.
When considering approving a loan, banks and other lending institutions want to know the chance that a borrower will default on the loan. Assessing this risk is known as loan scoring and variously by other terms, such as credit scoring and loan grading. The quality score of the loan application helps the bank decide to approve or not approve the loan application.
Loan scoring plays a central role in a bank’s efforts to manage risk. With respect to writing loans, the greatest risk is default risk. A default occurs when a borrower fails to make payments to the point the company has to charge off the remaining loan amount as a loss. Banks also want to assess the risk of early payoff (or refinance), since interest revenue for the remaining term of the loan is lost.
By analyzing historical loans, both successful and unsuccessful, banks build custom loan scoring models.
In the blog, Data Ingestion with CNL, we show how to ingest a data set containing historical loans using the function imsls_data_read in IMSL for C (CNL). We continue here using the loan data to build and evaluate a loan scoring model.
The model needs to predict the “loan_status” variable in our data set for “new” applications. The loan status field has 7 levels with the following distribution in the data set:
In Grace Period
Late (31-120 days)
Late (16-30 days)
While it is possible to try to model and predict each of the seven categories (a multinomial logistic regression), usually it suffices for loan-scoring to categorize loans into two categories, “good” and “bad” (or low risk, high risk). For this reason, we will classify loans into two categories as follows:
In Grace Period
For a binomial response variable (good loan, bad loan) logistic regression is a natural choice of model, although there are others we could consider. For a new modeling endeavor, it’s generally best to try out several appropriate modeling techniques and compare them with one another. While in this example we only use logistic regression (imsls_d_logistic_regression), any other appropriate model would undergo a similar process to the one we outline below.
Once a modeling technique is chosen, the next step is to build the model and start fitting it with the data. Model building involves creating and selecting predictor variables in order to achieve a successful model. A successful model captures relationships inherent in the data and predicts new data with a high level of precision.
With a combination of the following steps along with a good set of data, it’s usually possible to develop a successful model.
We performed each of these steps to come up with a loan scoring model. We highlight some of this work in the next few sections.
Though tempting, we cannot just input all the data and expect the machine learning algorithm to churn through all the variables and magically come up with the best configured model. That exercise, while on rare occasions producing a seemingly good model, is rife with traps and gotchas.
First, we must think about the business problem and the data we have in hand to address it. What is the data measuring? What are the special cases? What do missing values really mean? While it seems ad-hoc, it is a critical step and applies to any new problem. We illustrate with a few examples from our Lending Club Data.
Of the 157 attributes available in the data file, some can be eliminated immediately as irrelevant to a predictive model. Examples include “id” = customer id, “zip_code”, “URL”.
Some variables should be eliminated because they are only known after a loan has been in effect. For example, “total_pymnt” = Payments received to date for total amount funded is not known at the time of a loan application. The variables “last_fico_range_low” and “last_fico_range_high” had to be eliminated for the same reason.
There are two types of loan applications, “joint” and “individual”. If the application type is “individual”, the income variable is in “annual_income”. If the application type is “joint”, the income variable is in “annual_income_joint”. To use annual income in a model, we must first combine the income variable (for individual loans) with the joint income variable into one income field. This is easily done with a few lines of C code, keying off of “application_type”. Similarly, “dti” = debt to income ratio, has the field “dti_joint” for joint applications. To use debt-to-income in a model, the two fields must be combined just like “annual_income” and “annual_income_joint”.
As mentioned in Data Ingestion with CNL, string data is encoded by imsls_data_read into integers. Integer encoding assigns a unique integer value to each level of the variable. Integer encoded data may need further transformation, depending on the proper interpretation of the field. For example, employment length has the following encoding:
Figure 1 ROC Curve with Area Under Curve (AUC) = 72%
We repeated the process of randomly selecting a training set and test set, fitting the model and calculating the AUC. The model with the predictor variables in Table 1 ranged from .69 - .73, with an average AUC of .71. For assessing the AUC
(Hosmer & Lemeshow, Applied Logistic Regression)
To improve on “acceptable discrimination,” we can continue variable selection and analysis within our dataset, try different classification methods (e.g., decision trees), and to look for additional data to augment the lending club data, if possible. Banks who build their own loan scoring models have access to proprietary data and generally work with models that achieve AUC’s closer to 0.90.
Combined with Data Ingestion with CNL, with this blog we’ve outlined steps from data procurement to model building. The next step in the real world would be model deployment. This is the stage where loan officers apply the model to new loan applications and use the loan score to help decide if a loan should be approved, and possibly to help set interest rates and other terms.
To produce the numerical results in this blog we used exclusively the numerical library CNL (IMSL for C). Mainly we used the functions imsls_data_read and imsls_d_logistic_regression but also leveraged several other functions in CNL to get summary statistics, correlations, sorting, random selection, and more. All functionality is available in our next release of CNL (IMSL for C).
Want to try IMSL for C on in your application? Request a free trial.
Software Architect, Perforce
Roxy Cramer is a PhD statistician for the IMSL Numerical Libraries at Perforce. Roxy specializes in applying and developing software for data science applications. Her current focus is on statistical and machine learning algorithms in C and Java.