Submit support requests and browse self-service resources.
When considering approving a loan, banks and other lending institutions want to know the chance that a borrower will default on the loan. Assessing this risk is known as loan scoring and variously by other terms, such as credit scoring and loan grading. The quality score of the loan application helps the bank decide to approve or not approve the loan application.
Loan scoring plays a central role in a bank’s efforts to manage risk. With respect to writing loans, the greatest risk is default risk. A default occurs when a borrower fails to make payments to the point the company has to charge off the remaining loan amount as a loss. Banks also want to assess the risk of early payoff (or refinance), since interest revenue for the remaining term of the loan is lost.
By analyzing historical loans, both successful and unsuccessful, banks build custom loan scoring models.
In the blog, Data Ingestion with CNL, we show how to ingest a data set containing historical loans using the function imsls_data_read in IMSL for C (CNL). We continue here using the loan data to build and evaluate a loan scoring model.
The model needs to predict the “loan_status” variable in our data set for “new” applications. The loan status field has 7 levels with the following distribution in the data set:
In Grace Period
Late (31-120 days)
Late (16-30 days)
While it is possible to try to model and predict each of the seven categories (a multinomial logistic regression), usually it suffices for loan-scoring to categorize loans into two categories, “good” and “bad” (or low risk, high risk). For this reason, we will classify loans into two categories as follows:
In Grace Period
For a binomial response variable (good loan, bad loan) logistic regression is a natural choice of model, although there are others we could consider. For a new modeling endeavor, it’s generally best to try out several appropriate modeling techniques and compare them with one another. While in this example we only use logistic regression (imsls_d_logistic_regression), any other appropriate model would undergo a similar process to the one we outline below.
Once a modeling technique is chosen, the next step is to build the model and start fitting it with the data. Model building involves creating and selecting predictor variables in order to achieve a successful model. A successful model captures relationships inherent in the data and predicts new data with a high level of precision.
With a combination of the following steps along with a good set of data, it’s usually possible to develop a successful model.
We performed each of these steps to come up with a loan scoring model. We highlight some of this work in the next few sections.
Though tempting, we cannot just input all the data and expect the machine learning algorithm to churn through all the variables and magically come up with the best configured model. That exercise, while on rare occasions producing a seemingly good model, is rife with traps and gotchas.
First, we must think about the business problem and the data we have in hand to address it. What is the data measuring? What are the special cases? What do missing values really mean? While it seems ad-hoc, it is a critical step and applies to any new problem. We illustrate with a few examples from our Lending Club Data.
Of the 157 attributes available in the data file, some can be eliminated immediately as irrelevant to a predictive model. Examples include “id” = customer id, “zip_code”, “URL”.
Some variables should be eliminated because they are only known after a loan has been in effect. For example, “total_pymnt” = Payments received to date for total amount funded is not known at the time of a loan application. The variables “last_fico_range_low” and “last_fico_range_high” had to be eliminated for the same reason.
There are two types of loan applications, “joint” and “individual”. If the application type is “individual”, the income variable is in “annual_income”. If the application type is “joint”, the income variable is in “annual_income_joint”. To use annual income in a model, we must first combine the income variable (for individual loans) with the joint income variable into one income field. This is easily done with a few lines of C code, keying off of “application_type”. Similarly, “dti” = debt to income ratio, has the field “dti_joint” for joint applications. To use debt-to-income in a model, the two fields must be combined just like “annual_income” and “annual_income_joint”.
As mentioned in Data Ingestion with CNL, string data is encoded by imsls_data_read into integers. Integer encoding assigns a unique integer value to each level of the variable. Integer encoded data may need further transformation, depending on the proper interpretation of the field. For example, employment length has the following encoding:
Because the data reader assigns the encoded values in the order in which the values occur in the data, they need to be reordered (e.g., so that < 1 year has the smallest value and 10+ years has the largest value). The ‘BLANK’ values present a bit of a challenge. There is no way to know the true values for the applicants who did not fill in their length of employment. Our options are to exclude the 69889 applications from any model that includes emp_length as a predictor, or to decide on a value to replace ‘BLANK’ with, while accepting any inaccuracy that may result.
For instance, replacing BLANK with 0 treats the applicant who omitted employment length the same as those responding with ‘< 1 year’. Of course, this will include some that omitted the field inadvertently and so will not be entirely accurate. Another option is to replace missing values with the average (or perhaps a weighted average) of employment length. Similarly, this will introduce some inaccuracy.
We ended up trying all three ways: removing (censoring) the observations, replacing (imputing) with 0, or replacing (imputing) with the average. We concluded that imputing the missing values with ‘0’ resulted in the most sensible results.
A second way to encode string variables is to use one-hot encoding. One-hot encoding involves creating dummy variables that are assigned 1 or 0 according to the level of the variable. For “Home Ownership”, there are 5 dummy variables, one for each level of the variable. “HO1” will be 1 if “Home Ownership” is MORTGAGE, and 0 otherwise. “H02” will be 1 if “Home Ownership” is RENT, and 0 otherwise, and so on.
We tried this encoding for models with “Home Ownership” as a predictor. With our first attempts, the logistic regression models resulted in the errors and warnings:
This type of error stems from nearly collinear variables, leading to numerical instability in the algorithm. Upon further inspection of “Home Ownership”,
we can see that the levels “ANY” or “NONE” are very sparse in the data (0.02% of the observations). This means that randomly sampled training sets are more than likely to contain “HO4” and “HO5” variables with all values = 0, causing the collinearity in the data. Especially since there are so few observations, it is best to remove these observations from the training data population. We can do that easily using the data reader, by asking it to ignore any observations with “ANY” or “NONE” showing up anywhere in the record. (Fortunately, there is not another field with those specific values. Otherwise, we could manually delete the records from the data file using a text editor or spreadsheet).
Which encoding method is best depends on the algorithm as well as the nature of the variable. Many algorithms interpret integer encoded values as having an inherent order: 1 < 2 < …< 5. While certainly true for ordinal variables, the ordering is meaningless for “Home Ownership”. Sending “Home Ownership” in as an integer encoded variable may lead to misleading results in the estimated model.
After some pre-processing as described above, we ended up with 41 potential predictor variables from the 157 variables in the data file. While there is no exactly right number of variables, we prefer successful models with the fewest predictor variables. Such models are easier to explain, manage, and deploy.
We eliminated some by running one model with ALL the variables and obtaining individual tests of significance on the resulting coefficients, removing those variables that did not have significant regression coefficients. We were able to eliminate a few more because of collinearity between certain variables.
Next we randomly selected subsets of different sizes of the remaining variables and evaluated their goodness of fit and predictive performance. The following set of 15 predictor variables gave consistently good performance.
The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
Interest Rate on the loan
The self-reported annual income provided by the borrower during registration.
A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
The lower boundary range the borrower’s FICO at loan origination belongs to.
Number of derogatory public records
Number of credit inquiries in past 12 months
Number of accounts ever 120 or more days past due
Number of currently active revolving trades
Number of satisfactory bankcard accounts
Number of accounts opened in past 12 months
Total bankcard high credit/credit limit
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
Term (36 months)
Term length (36 or 60, one-hot encoding)
LC assigned loan grade
Homeownership, one-hot encoding
Verification Status, one-hot encoding
Table 1 Subset of Predicter Variables Yielding Good Performance
To evaluate predictive performance of a model, we performed the usual machine learning process. This involves randomly splitting the data into a training data set and a test data set. The training data set is used to fit the model. The fitted model is then used to predict the outcomes on the test data set. Because we know the outcomes of the test data set, we can compare with the predictions and see how well a model performs
Logistic regression predictions are in the form of probabilities. In our case, the prediction is the probability that a new application with given attributes will be a good loan P(good) or a bad loan P(bad). Note that P(bad) + P(good) = 1. To evaluate prediction accuracy against the known loan status in a test set, we choose a cutoff value or threshold po such that if P(bad) ≥ po, then the model predicts that the loan will be a bad loan. Otherwise, if P(bad) < po, the model predicts that the loan will be a good loan.
In our data, bad loans are assigned a value 1 and good loans are assigned a value 0. For a given cutoff value, we can form a truth table (also sometimes called a confusion matrix):
# of good loans predicted good
# of bad loans predicted good
# of good loans predicted bad
# of bad loans predicted bad
Table 2 Truth Table for Loan Scoring
Now, since we’d like to view a “bad” loan as the signal we’re most interested in, we will label the 4 categories of actual vs predicted in the following way:
# True Negatives (TN)
# False Negatives (FN)
# False Positives (FP)
# True Positives (TP)
Table 3 Defining True and False Positives/Negatives
A measure known as specificity is the true negative rate = TN / (TN+FP). This is the proportion of good loans predicted to be good. On the other hand, sensitivity is the true positive rate = TP / (TP+FN). This is the proportion of bad loans predicted to be bad. It measures how sensitive the model is to the signal we are interested in. For example, think of a medical test for the presence or absence of a disease. We prefer a test that is sensitive enough to detect the disease when it is actually present, even if sometimes the test is overly sensitive. In other words, we are more willing to accept a false positive than a false negative, because in the case of a false negative the patient’s disease will go untreated. We can also look at (TN+TP) / (FP+ TP+FN+TN), the proportion of diagonal values to the total number, as a measure of overall accuracy.
Using the 15 variables listed in Table 1, we randomly selected 5000 observations to be the test data set and trained a logistic regression model on the remaining 1043404 observations. Then, using the trained model, we obtained the predicted probabilities for the 5000 test observations. First, assuming a cutoff value of po = 0, the truth table looks like:
Table 4 Sensitivity = 100%, Specificity = 0%
While it does detect all the bad loans, it misclassifies all the good loans, so this is not the best threshold. (The bank would not make any loans!). More reasonable might be a threshold of po = 0.5:
Table 5 Sensitivity =0.4%, Specificity = 99.7%
Here we see a dramatic flip in the other direction, with low true positive rate and high true negative rate. Fortunately, we don’t have to keep creating these tables in order to evaluate the predictive performance of a given model. Instead, the AUC = “Area Under the Curve” of the Receiver Operator Characteristic (ROC) curve is a single measure indicating the ability of the model to discriminate between observations having the characteristic (bad loan) vs those that do not.
The ROC is a plot of the false positive rate (1 – specificity) vs true positive rate (sensitivity) for a range of cutoff values,0 ≤ po ≤ 1. The AUC is a number between 0 and 1 and can be interpreted as the probability that a randomly selected “bad” loan will have higher predicted probability than a randomly selected good loan.
Figure 1 ROC Curve with Area Under Curve (AUC) = 72%
We repeated the process of randomly selecting a training set and test set, fitting the model and calculating the AUC. The model with the predictor variables in Table 1 ranged from .69 - .73, with an average AUC of .71. For assessing the AUC
(Hosmer & Lemeshow, Applied Logistic Regression)
To improve on “acceptable discrimination,” we can continue variable selection and analysis within our dataset, try different classification methods (e.g., decision trees), and to look for additional data to augment the lending club data, if possible. Banks who build their own loan scoring models have access to proprietary data and generally work with models that achieve AUC’s closer to 0.90.
Combined with Data Ingestion with CNL, with this blog we’ve outlined steps from data procurement to model building. The next step in the real world would be model deployment. This is the stage where loan officers apply the model to new loan applications and use the loan score to help decide if a loan should be approved, and possibly to help set interest rates and other terms.
To produce the numerical results in this blog we used exclusively the numerical library CNL (IMSL for C). Mainly we used the functions imsls_data_read and imsls_d_logistic_regression but also leveraged several other functions in CNL to get summary statistics, correlations, sorting, random selection, and more. All functionality is available in our next release of CNL (IMSL for C).
Want to try IMSL for C on in your application? Request a free trial.
Software Architect, Perforce
Roxy Cramer is a PhD statistician for the IMSL Numerical Libraries at Perforce. Roxy specializes in applying and developing software for data science applications. Her current focus is on statistical and machine learning algorithms in C and Java.