Linear regression is a statistical modeling method and technique conceived in the 19th century by Francis Galton, who described the concept of regression towards the mean. Statisticians Udny Yule and Karl Pearson later described regression analysis, of which linear regression is the most common form.
Linear regression (or linear regression analysis) is used to predict a variable's value from another variable's value. The variable linear regression attempts to predict is called the dependent variable, as it’s dependent on the values of variables before it, which are called the independent variables.
There are two broad forms of linear regression; simple linear regression and multiple linear regression.
- When there is a single input variable, the analysis is described as simple linear regression.
- When there are multiple input variables, the method is called multiple linear regression.
Linear regression is a simple and robust way to predict future variables and is used in a huge range of fields, such as the physical and biological sciences, economics, psychology, business and marketing.
Linear regression in marketing
Linear regression is one of the most-used techniques in marketing, as it enables marketers to model the relationship between inputs and outputs. Since linear regression is clean and simple to calculate, it’s pretty straightforward to mobilize in one’s marketing strategy, even without a background in statistics or math.
For example ,we know that Facebook ad spend is correlated with organic installs for an app, but do we know how much a move in one affects the other?
For that, we need to use linear regression, which lets us model the relationship between spend and installs. The output will be an equation that we can plug numbers into to predict how many organic installs we'll get for our budget.
In a real marketing situation, multiple independent variables would probably impact sales values, hence why multiple linear regression would be the better fit. However, in this situation, it’s also essential to be aware of multicollinearity, which occurs when different campaigns are triggered at the same time/around the same time.
Linear regression in machine learning
Linear regression is also one of the most basic machine learning algorithms. While linear regression is fundamentally a statistical technique or method, machine learning ‘borrows’ it as an algorithm.
Linear regression is fundamental in many predictive machine learning models and forms a benchmark for more advanced models.
In data-driven marketing, developing a predictive model for output variables using input variables helps marketers ‘predict the future’ and make data-based decisions. However, linear regression has its fair share of limitations too, which we’ll investigate shortly.
Breakdown of linear regression
In the above example, suppose the y-axis shows sales and the x-axis shows marketing budget.
The standard equation of the linear regression line is Y= mx+c.
- Y is the dependent variable (usually measured on an interval or ratio scale).
- X is the independent variable (usually measured on an interval or ratio scale).
- M is the slope or gradient, which can be positive or negative depending on whether we go up as x increases (a positive slope), or down as x increases (a negative slope).
- C is the intercept (the point where the line crosses the y axis).
Here, finding the best fit line involves the ordinary least squares (OLS) method, which produces a regression line that minimizes the vertical distance from the data points to the regression line. Performing linear regression tasks is simple in Python, thanks to libraries such as NumPy.
Assumptions of linear regression
For the straight-line path of linear regression to be appropriate, there are certain assumptions to apply to the data, such as:
- Error terms are independent of each other.
- Error terms are normally distributed or form a bell-shaped curve.
- Error terms exhibit constant variance (homoscedasticity).
- There is a probable linear relation between X (independent) and Y (dependent) variables.
- No or low multicollinerity.
Very very basics of these assumptions as follows:
- Variables in the dataset should be measured at a continuous level, e.g., time, sales, weight, and test scores.
- You can use a scatter plot to discover if there is a linear relationship between those two variables.
- Observations should be independent of each other with no dependency.
- No significant outliers.
- Check for homoscedasticity, meaning variance along the best-fit linear-regression line remains similar through the line.
- The residuals or errors of the best-fit regression line follow a normal distribution.
Statistical tests for linear regression assumptions
There are several tests to validate linear regression's core assumptions (typically listed as five core assumptions).
Here is an overview of statistical tests you can perform for linear regression's core five assumptions:
- Linearity: There needs to be a linear relationship between between X (independent) and Y (dependent) variables. The Harvey Collier and the Rainbow Test can discern this.
- Multicollinearity: Features must not be highly correlated with each other (e.g., including height and weight as independent variables might predict each other). If this is not satisfied, the model will suffer from high variance. Test with correlation matrix and Variance Inflation Factor (VIF).
- Gaussian errors: Errors are normally distributed with mean = 0. This is relaxed in large samples due to the central limit theorem. Testable with Jarque-Bera tests for skewness and kurtosis.
- Homoskedasticity: Errors have equal variance, and there is no pattern in the residuals (error). Testable with Breusch-Pagan and Goldfeld-Quandt.
- Errors are independent: Meaning there is no relationship between the residuals of our model (the errors) and the response variables (observations). For example, each day can be forecast without data from the previous day. Testable with Durbin–Watson and Ljungbox.
The benefits of linear regression
Linear regression is a statistical technique that allows for the study of the relationship between one or more independent variables and a dependent variable. The independent variables are also known as "predictors," while the dependent variable is also known as "the outcome."
Linear regression is reliable, but requires the data to fit certain assumptions. The benefits of linear regression include:
- It can be used to make predictions about future values of the outcome.
- It can be used to identify which predictor(s) significantly affect the outcome.
- It can be used to quantify how much each predictor affects the outcome.
- It provides an estimate for how much change in one variable causes changes in another variable, which is often called "regression coefficient."
The disadvantages of linear regression
Linear regression is a linear equation that can be used to predict the relationship between two variables. It has many advantages, such as it is easy to understand and use. However, it also has some disadvantages.
- One disadvantage is that it doesn't account for non-linear relationships between two variables. In this situation, non-linear regression is appropriate.
- Another disadvantage of linear regression is that the slope and intercept depend on the units of measurement and can be difficult to interpret without scaling these variables first.
- Multicollinearity occurs when two or more independent variables are strongly correlated in a regression model, making it tough or impossible to discern cause and effect in the model. This is typical in marketing because campaigns usually trigger at the same time or in close sequence, making it hard to discern what inputs are producing what outputs.
The importance of linear regression
Linear regression is a statistical technique that allows the analyst to conclude the relationship between two variables. Understanding the relationships between variables provides statistical insight into many real-world problems that involve measurements and observations.
The importance of linear regression is that it can be used to predict future values of one variable based on past observations of the other variable. This is fundamental in statistics and machine learning, where linear regression is often used to benchmark and develop more complex models.
Anyone studying science, marketing, advertising, some business-related fields and obviously maths should understand linear regression. So long as the data meets the five core assumptions of linear regression, performing regression analysis is pretty straightforward and reliable.
Three non-marketing examples of linear-regression success
Price elasticity
Prices affect consumer behavior. For example, if you change the price of a particular product, regression shows you what happens to sales as you drop or increase the price.
Of course, you’d expect sales to drop as you increase the price, but regression analysis can tell you how steeply sales drop, and what the maximum price is that maintains a specific revenue. Regression is often used in retail to analyze price changes and other inputs alongside consumer outputs, e.g., sales.
Risk analysis
Linear regression is frequently deployed in insurance and other risk-based businesses and business models. For example, insurers use regression to investigate insurance claims and build models estimating claim costs. This enables the insurers to make data-backed decisions about what risks to take on by predicting the cost of potential claims.
Sports analysis
Linear regression is used to analyze sports. For example, you could find out if the number of games won by a sports team is related to the average number of points or goals they score, discovering whether there’s a linear relationship or a threshold where winning is more likely or not. Regression is used by sports betting companies to set their odds and analyze risk.
Summary: Linear Regression
Linear regression (or linear regression analysis) is used to predict the value of a variable from the value of another variable.
This simple yet robust statistical technique is excellent at predicting future values, providing the dataset adheres to assumptions that are determinable when plotting the data as a scatter plot.
Performing regression analysis in Python is simple with NumPy and Scikit-learn, especially when handling lots of different variables in a multilinear regression analysis.
Linear regression is employed in many scientific and business fields that work with measurements and observations with a view of predicting future variables based on past variables. Not all data and problems are suited to linear regression, which is why it's essential to test linear regression's five core assumptions prior to spending substantial time and effort on constructing a regression model.