Decision Trees: The Best Answer Based on the Available Data
February 28, 2022

Decision Tree Machine Learning Predictions: The Best Answer Based on Available Data

Data Mining & Big Data
Algorithms & Functions

Imagine this scenario:

You decided to take that last flight from San Diego to Washington, DC. It’s the redeye. You can’t remember if you re-packed your gear correctly, or if you’ll land somewhere close to your car. However, you can be reasonably sure that you’ll get closer to your goal. Orders placed online get to the destination with incredible speed and accuracy.

We can trust the details to someone or something else because there is a history of success or failure. And the models keep getting better – ultimately making everything more efficient or predictable.

What I’d like you to do now is think of ALL the computational systems that made your trip possible. Limit your thoughts to the ones that you personally interacted with. We’ll talk about this at the end of this post.

As Machine Learning and Artificial Intelligence evolve and aspire to answer (and – increasingly) take action on the issues that keep us awake at night, it might be easy for someone to forget – We have come a long way in a short amount of time. (Just for fun can you name 10 tools we used to navigate the seas before ECDIS?) 

Decision trees and technology that make predication possible, is the focus of this post.

Back to top

What Is a Decision Tree?

One classification of a decision tree is:

“… a tree structured outline representing decision points branch into more and more specific alternative outcomes until they reach and endpoint (leaf) that is part of all the decisions previously made” - Microsoft Dictionary

Another is:

“… a widely used data structure that resembles an inverted biological tree…” - Linfo

The somewhat telling part is that there is a large amount of assumption (entropy) based on audience.

However you can arrange this message to comprehend the message – linguistically, graphically, or some other way:

Decision trees have the ability to advise us on what’s been done, how it will impact us now, and what it means for our future path. The patterns of data can be really very interesting.

In computational terms, decision trees take on a much larger focus. And – except in the movies – the data is mostly anonymous. There are also very standard (and non-scary) use cases that we all benefit from.

In many cases, the databases are so large that special infrastructure is needed to even approach utility.

Back to top

Decision Tree Algorithms

When Data Engineers encounter data – parsing, evaluation, re-training, re-testing will occur in small sets to test hypothesis. Not necessarily in that order.

Then after the application of some of the following tools, formulas, and methods – they can present a Decision Support System(s). Which – in some form becomes Data Mining. Which is a much broader application while still remaining relevant to the discussion. Some of the algorithms in use will likely be:

C4.5 Algorithm:

The C4. 5 algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors). The method C4.5 (Quinlan, 1995) is a tree partitioning algorithm for a categorical response variable and categorical or quantitative predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio.

CART Algorithm:

Classification And Regression Trees (CART™) algorithm is a classification algorithm for building a decision tree based on Gini's impurity index as splitting criterion. CART is a binary tree build by splitting node into two child nodes repeatedly.

ALACART Algorithm:

Implements the method of Breiman, Friedman, Olshen and Stone (1984), the original authors and developers of CART. CART™ is the trademarked name for Classification and Regression Trees.

In ALACART, only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.

For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most.

CHAID Algorithm:

Appropriate only for categorical or discrete ordered predictor variables. Due to Kass (1980), CHAID is an acronym for chi-square automatic interaction detection.

At each node, as above, CHAID looks for the best splitting variable. The approach is as follows:

  • Given a predictor variable X
  • Perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y
  • The least significant result is noted and, if a threshold is met, the two categories of X are merged. 
  • Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. 
  • If a merged category consists of three or more of the original categories of X, CHAID calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. 
  • If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth.

Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories define the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.

QUEST Algorithm:

QUEST (Loh and Shih, 1997) is appropriate for a categorical response variable and predictors of either categorical or quantitative type.

For each categorical predictor, QUEST performs a multi-way chi-square test of association between the predictor and Y. For every continuous predictor, QUEST performs an ANOVA test to see if the means of the predictor vary among the groups of Y.

Among these tests, the variable with the most significant result is selected as a potential splitting variable, say, Xj. If the p-value (adjusted for multiple tests) is less than the specified splitting threshold, then Xj is the splitting variable for the current node. If not, QUEST performs for each continuous variable X a Levene’s test of homogeneity to see if the variance of X varies within the different groups of Y. Among these tests, we again find the predictor with the most significant result, say Xi If its p-value (adjusted for multiple tests) is less than the splitting threshold, Xi is the splitting variable. Otherwise, the node is not split.

Assuming a splitting variable is found, the next step is to determine how the variable should be split. If the selected variable Xj is continuous, a split point d is determined by quadratic discriminant analysis (QDA) of Xj into two populations determined by a binary partition of the response Y. The goal of this step is to group the classes of Y into two subsets or super classes, A and B. If there are only two classes in the response Y, the super classes are obvious. Otherwise, calculate the means and variances of Xj in each of the classes of Y. If the means are all equal, put the largest-sized class into group A and combine the rest to form group B. If they are not all equal, use a k-means clustering method (k = 2) on the class means to determine A and B.

Additional details on the above Data Mining functions can be found in the IMSL Documentation.

Back to top

Decision Tree Tools

As you can imagine, the constant development and refinement of methodologies is part of the process as well. IMSL by Perforce can save development and refinement time concerning development and research for maintaining, updating, and validating new methods. 

IMSL allows you to easily add machine learning and advanced data science functionality to your application. IMSL is available in C, Java, Fortran, and Python and contains contains a collection of data mining functions for Decision Trees, Market Analysis, Naive Bayes, Neural Networks, and more. 

So back to the beginning, when you were trying to think of all the systems you personally interacted with on your last trip. Now – in addition to those systems – think of the ones you didn’t know about.

I wish you great fortune on your journey. 

Get started with IMSL

Back to top