Image of a linear regression model -- pink scatterpoint dots along an upward trending graph
June 16, 2021

What Is a Regression Model?

Data Mining & Big Data
A regression model provides a function that describes the relationship between one or more independent variables and a response, dependent, or target variable. 

For example, the relationship between height and weight may be described by a linear regression model. A regression analysis is the basis for many types of prediction and for determining the effects on target variables. When you hear about studies on the news that talk about fuel efficiency, or the cause of pollution, or the effects of screen time on learning, there is often a regression model being used to support their claims.

Types of Regression

1. Linear

A linear regression is a model where the relationship between inputs and outputs is a straight line. This is the easiest to conceptualize and even observe in the real world. Even when a relationship isn’t very linear, our brains try to see the pattern and attach a rudimentary linear model to that relationship.

One example may be around the number of responses to a marketing campaign. If we send 1,000 emails, we may get five responses. If this relationship can be modeled using a linear regression, we would expect to get ten responses when we send 2,000 emails. Your chart may vary, but the general idea is that we associate a predictor and a target, and we assume a relationship between the two.

Linear regression model example using number of clicks vs emails sent as axises

Using a linear regression model, we want to estimate the correlation between the number of emails sent and response rates. In other words, if the linear model fits our observations well enough, then we can estimate that the more emails we send, the more responses we will get.

When making a claim like this, whether it is related to exercise, happiness, health, or any number of claims, there is usually a regression model behind the scenes to support the claim. 

In addition, the model fit can be described using a mean squared error. This basically gives us a number to show exactly how well the linear model fits.   

More serious examples of a linear regression would include predicting a patient’s length of stay at a hospital, relationship between income and crime, education and birth rate, or sales and temperature.  

2. Multiple

Multiple regression indicates that there are more than one input variables that may affect the outcome, or target variable. For our email campaign example, you may include an additional variable with the number of emails sent in the last month. 

By looking at both input variables, a clearer picture starts to emerge about what drives users to respond to a campaign and how to optimize email timing and frequency. While conceptualizing the model becomes more complex with more inputs, the relationship may continue to be linear.

For these models, it is important to understand exactly what effect each input has and how they combine to produce the final target variable results.

3. Non-Linear

A linear model may work for some parts of the marketing example above. However, we know that as we keep increasing the number of emails in a particular campaign, the number of responses starts to decline vs the number of emails sent. To model this, we need a non-linear regression model.

This model will initially show a positive relationship between number of emails and the response, but as the number of emails increases, the model will flatten out and become almost constant. The important takeaway here is that it is important to understand when a model could potentially be non-linear. If you make assumptions based on a linear model, you could get results that are very different than expectations.

4. Stepwise Regression Modeling

While the other items we have talked about until now are specific types of models, stepwise regression is more of a technique. If a model involves many potential inputs, the analyst may start with the most directly correlated input variable to build a model. Once that is accomplished, the next step is to make the model more accurate.  

To do that, additional input variables can be added to the model one at a time in order of significance of the results. Using our email marketing example, the initial model may be based on just the number of emails sent. Then we would add something like the average age of the email recipient. After that, we would add the average number of emails each recipient has received from us. Each additional variable would add a small amount of additional accuracy to the model. 

This process is a stepwise modeling approach to getting the most accurate model. Alternatively, the analyst may start with a larger set of input variables and then incrementally remove the least significant in order to get to a desired model.

When to Use Regression Analysis

There are many benefits to being able to establish a statistically significant correlation between important business outcomes. By doing so, you may be able to make important business decisions based on industry indicators.

One example is any correlation you can establish between GDP, consumer confidence, or industry benchmarks and your own business, which may help with investing or strategizing.

Another valuable example would be if a relationship can be established between business decisions and outcomes, these decisions can be better tuned for optimal results. For example, the price of a product and the number of sales is often correlated and can be modeled using regression models.

How to Create a Regression Model

Here we show a simple regression using IMSL Numerical Library for C.

Consider a production line producing Widgets. The shop manager would like a good estimate of the required number of worker hours given that a certain number of units must be produced. The manager collects a small sample of the number of worker hours for each lot size.

Image of linear regression mapped out on bar graph, upward trending with x axis as units and y axis as hours

The plot indicates that lot size is a strong predictor for number of hours worked, as expected. There is some small variation in the hours worked at the same lot sizes (see at 30 and 60), due to other random factors. Presumably if more observations were taken at other lot sizes, a similar variation in hours would be seen.

***Inference on Coefficients***

 coeft-statstd errp-value
intercept10.02.504.000.00
lot_size2.000.0542.580.00

The regression fit suggests a minimum 10 hours of work and two extra hours for each additional unit that is produced. The line is shown in red in the figure above.

With the regression formula in hand, the shop supervisor can plan staffing needs, costs, and production schedules. Any prediction from a regression line that is outside the observed range of the data should be met with some skepticism, however.

For example, the regression model predicts 2,010 worker hours for producing 1,000 units, but since 1,000 is far outside the observed range of lot sizes, this prediction is not reliable. There are likely supply or capacity constraints that would make 1,000 an infeasible lot size. The shop manager of course will be aware of those constraints.

The code below can be used to perform the regression fit and produce the coefficient summary using IMSL for C.

Code example of a regression model in IMSL C

IMSL and Regression Models

Now we’ve established what a regression model is, what the different types are, and when to use them, it’s time to create your own regression model.

To start, try IMSL free. IMSL libraries have been trusted for decades for accurate and reliable numerical functions. Sign up using the link below.

TRY IMSL FREE