Notes on Linear Regression – A Deeper Exploration | Matthew Chang

There are notes on linear regression concepts I created for reference in stock price and energy cost prediction research.

Source: An Introduction to Statistical Learning with Applications in R Found Here.

Abstract

Chapter 3 focused on the use cases and interpretation of linear regression, a simple technique with high interpretability but low flexibility in model fitting (pg. 25 ISL). The chapter covers simple regression, multiple linear regression, key considerations when fitting a model, and a comparison to KNN regression.

The basic model is basically just a line like $y = mx + b$ from Algebra I, but written as

Y \approx \hat{\beta_0} + \hat{\beta_1}X + \epsilon.

We are trying to estimate a good $\hat{\beta_0}$ and $\hat{\beta_1}$ . This is done using linear algebra, which was not provided in the book. $\epsilon$ exists as error that we get from the model, since no model can fit the data 100% perfectly in practice.

Assessing Accuracy of Coefficients

We can see how off the data is from our prediction line. This will tell us how well our model is fitting. One way to test the accuracy is with the variance $Var(\hat{\mu}) =$ the standard error $SE(\hat{\mu})^2 = \frac{\sigma^2}{n}$ . A good thing to note here is that as $n$ increases, $SE$ gets smaller, making the model more accurate. This is because more data points give the model a better idea of the overall trend.

If we want to find out how close $\beta_0$ and $\beta_1$ are to the actual values, we can use the following equations:

SE(\hat{\beta_0})^2 = \sigma^2\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\right], \quad SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}

$(x_i-\bar{x})^2$ is the actual value of the point minus the mean.

If actual and mean differ more, then the dataset is more spread out.

Is There Even a Relationship?

To determine if there is a relationship between our predictor variable $X$ and output variable $Y$ , we can use a t-test. This test is similar to looking at a normal distribution but is used when dealing with samples rather than entire populations. For example, if we wanted to measure income, we might get a sample of 1,000 people, not the entire world.

We will get two results from testing:

A t-statistic, which indicates how many standard deviations away we are from "normal".
A p-value, which helps us determine if there is any relationship between the variables. We can set our null hypothesis as $H_0 = 0$ (no relationship) and our alternative hypothesis as $H_1 \neq 0$ (some relationship).

We can calculate our t-statistic using the equation:

t = \frac{\hat{\beta_1} - 0}{SE(\hat{\beta_1})}

Notice that we have a - 0, which is usually $\beta_1$ (the actual value), but our hypothesis says there is no relation, so we set it to 0 and see what happens.

If t is large, maybe greater than 3, we can reject $H_0$ .
Similarly, we can reject $H_0$ if our p-value is less than 0.05, or our $\alpha$ level.

Assessing Accuracy of Model

Another important metric is the Residual Sum of Squares (RSS), which measures the distance between the line and the actual points, squared, and summed up:

RSS = \sum_{i}^{n} (y_i - \hat{y_i})^2

While RSS can tell us how much error we have, it is not easily interpretable. We have two other tools to help with this, starting with Residual Standard Error (RSE):

RSE = \sqrt{\frac{1}{n-2}RSS}

RSE normalizes RSS, making it more interpretable. However, it is still hard to know what a good RSE is. That's why we use $R^2$ , which is the proportion of variance explained (a number between 0 (no fit) and 1 (perfect model)):

R^2 = 1 - \frac{RSS}{TSS}

Total Sum of Squares (TSS) is the sum of all the differences between actual points and the mean, squared: $\sum(y_i - \bar{y})^2$ .

In this model, TSS is how much $Y$ varies, while RSS is the amount of variability the model is not able to explain. The difference $TSS - RSS$ is the amount of variability in the explained response.

We can also use correlation, given by another equation (pg. 70). In linear regression, $Cor(X, Y) = r$ , and $r = R$ , so the correlation squared is $R^2$ .
Even with easier interpretation, a good $R^2$ still depends on the context. 0.2 might seem really bad, but in some contexts, it can be good.

Multiple Linear Regression

We can now add more variables. Instead of just $\beta_1$ , we can have many:

Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon

We interpret $\beta_j$ as the average effect on $Y$ of a 1-unit increase in $X_j$ , with all other predictors fixed.

Similarly, RSS is given by:

RSS = \sum_{i=1}^{n}(y_i - \hat{\beta_0} - \hat{\beta_1}x_i - \cdots - \hat{\beta_p}x_{ip})^2

With simple regression, two different factors (like radio and news ads) might show an increase in sales the more spending is put into both forms of advertisement.

However, in the example, multiple regression shows non-rejection of $H_0$ for news ads (indicating there isn't actually an increase in sales from increasing news ads).
When looking at a correlation matrix, you can see that radio and news ads are correlated around 0.35.
This is like saying that an increase in ice cream sales increases the number of shark attacks. Here, news ads don't increase sales; it's just that more radio ad spending tends to mean more news ad spending.

Questions to Consider

Relationship Between Response and Predictors?

Similar to our simple linear regression model, we can set up a hypothesis test to see if there is any sort of relationship between our variables and the outcome. We use an F-test here, as there are multiple variables. First, we define the null and alternative hypotheses as:

H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0

H_a: \text{at least one} \ \beta_j \neq 0

Next, we find our F-statistic:

F = \frac{(TSS - RSS) / p}{RSS / (n-p-1)}

The F-stat adjusts for the number of predictors $p$ , so with many predictors, it is good to look at the F-stat, not just the p-value.
Normal F-stat values are around 1 for $H_0$ . If large values, then reject $H_0$ , meaning there is a relationship.
This also means that $E\left[\frac{TSS - RSS}{p}\right] > \sigma^2$ , or the expected value of the average amount of variability is greater than the noise/error.
This is good because it suggests that the model is explaining the variability in our data.
Remember: You can't fit a linear regression model if the number of parameters is greater than the number of samples $(p > n)$ .

Deciding Important Variables

After rejecting $H_0$ , we need to figure out which variables are affecting the outcome to make a new model with the important variables (called variable selection). Since there are $2^p$ possible models (way too many to try all), there are three possible approaches:

Forward Selection: Begin with no predictors. Add a variable that gives the lowest RSS. Continue until some stopping rule.
Backward Selection: Similar, but start with all variables and delete those with the largest p-values one at a time.
Mixed: Combine the first two methods - Do a forward selection and monitor the results. If needed, apply backward selection.

We will learn more about model fit and selection in Chapter 5.

Model Fit

$R^2$ will always increase with more added variables. If we see that adding a parameter only increases $R^2$ slightly, we can drop it from the model. Keep in mind that overfitting will lead to bad results on independent test samples.

Predictions

There are three types of uncertainty:

The least squares plane $\hat{Y}$ is only estimated for true $f(X)$ . We can quantify this uncertainty by computing our confidence interval (CI).
This is a linear approximation, so model bias exists.
Random error $\epsilon$ will exist as irreducible error. We can use prediction intervals (rather than CI) which give us a wider range of uncertainty.

Qualitative Predictors

For non-numeric predictors (like marital status, employment status, etc.) with two possible values, we can create an indicator/dummy variable:

x_i = \begin{cases} 1 & \text{if true/something} \\ 0 & \text{if false/not something} \end{cases}

*Personal note: Recall that from linear algebra, a test matrix for least squares looks like:

x_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if true/something} \\ \beta_0 + \epsilon_i & \text{if false/not something} \end{cases}

You can also use -1 instead of 0. It means that the coefficient will be interpreted differently, but the model will still work the same.
Think of it as adding a gray area instead of black and white (yes or no). Here, the value of $\beta_0$ would be between the values of yes and no.
For more than two levels/possible values, you need more dummy variables. Here is an example:

We are looking at four sections of the US - North, South, West, East. We want to first define our default value $\beta_0$ , which will be East here. This means that we will be comparing all the other areas to East:

\begin{cases} \beta_0 + \beta_1 & \text{if North} \\ \beta_0 + \beta_2 & \text{if South} \\ \beta_0 + \beta_3 & \text{if West} \\ \beta_0 & \text{if East (base)} \end{cases}

Example: South will have $18 less debt than East if $\beta_1 = -18$ .
We use an F-test to test relationships and importance. By using difference coding (0/1 vs -1/1), we can measure contrasts differently.

Extending The Linear Model

So far, we have assumed that the relationship between variables is additive and linear - that is, predictor $X_j$ and response $Y$ don't depend on values of other predictors, and the change in $Y$ from a 1-unit change in $X_j$ is constant.

However, this isn't the case many times. For example, splitting a marketing budget on different things is better than putting 100% of the budget on just YouTube ads. This is known as synergy/interaction.

We can extend the model by adding an interaction term:

Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon

Rewritten:

Y = \beta_0 + (\beta_1 + \beta_3X_2)X_1 + \beta_2X_2 + \epsilon

Now, notice that there is an $X_2$ term as part of $X_1$ , meaning that a change in $X_2$ will affect $X_1$ .

$\beta_3$ is the increase in effectiveness of $X_1$ associated with a 1-unit increase in $X_2$ .
If $H_0$ was rejected for $\beta_3$ , then we have evidence that the true relationship is not additive (predictor and response do depend on other predictors).

Hierarchical principle: If we include the interaction $\beta_3$ , we should also include the main effects $\beta_1$ and $\beta_2$ in our model, even if p-values are large. This is denoted as $X_1 \times X_2$ .

Another thing you can try is polynomial regression, which is calculated like linear regression with multiple dimensions:

Y = \beta_0 + \beta_1X_1 + \beta_2X_1^2 + \epsilon

Be careful though, adding more powers may result in overfitting. More on this in Chapter 7.

Potential Issues with Regression Model

Non-Linearity

To see if the relationship is non-linear, we can plot the residuals $e_i = y_i - \hat{y_i}$ vs. predictor $x_i$ . Ideally, there is no discernible pattern. However, if we see something like a U-shape, then the relationship might be non-linear. A simple way to fix this is to add powers (like polynomial regression) e.g., $\log(x), \sqrt{x}, x^2$ .

In the case of multiple regression, plot residuals vs. predicted (fitted) values $\hat{y_1}$ .

Correlation of Error Terms

We assume that error terms are uncorrelated (if $\epsilon_i$ is positive, that doesn't mean $\epsilon_{i+1}$ is positive as well).

If there is a correlation, the model will underestimate true standard errors (the CI will be too narrow).
Plot the residuals as a function of time. If it is positively correlated, we may see tracking, or adjacent residuals having similar values.

Non-Constant Variance of Error Terms

We are also assuming that $Var(\epsilon_i) = \sigma^2$ . However, sometimes variances of $\epsilon$ increase with the value of the responses (giving a funnel shape in the residual plot). This is called heteroscedasticity.

Can reduce this error by using a concave function for our response $Y$ , like $\log(Y)$ or $\sqrt{Y}$ .
This will result in a greater amount of shrinkage of larger responses.
Since this is changing the data, we have to interpret it differently. If we use $\log(Y)$ , the model now explains changes in the logarithm of the response, rather than the response itself. This is for more reliable interpretation, as model parameters will be optimized for the transformed relationship without heteroscedasticity. To then estimate values, we would run it through the model, then do a conversion back to the original, in this case with $e^Y$ .
We change our average variance $\sigma_i^2 = \sigma^2 / n_i$ , and can fit the model by weighted least squares, where $w_i = n_i$ .

Outliers

To find outliers, plot studentized residuals (divide each residual $e_i$ by estimated $\hat{SE}$ ).

If the studentized residuals are greater than 3, it is a possible outlier.
Usually, just remove that data point. Make sure to be careful, as it might indicate a deficiency with the model (like a missing predictor).

High Leverage Points

These are unusual values for $x_i$ (our data might be modeling family sizes to income, and we have one family size of 20). It becomes hard to tell some of these points for multiple linear regression, where we have a point floating away from the rest but don't have extreme values either. To find these points, we use the leverage statistic, where high values are bad.

For simple linear regression,

h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{j=1}^{n}(x_j-\bar{x})^2}

$h_i$ increases with distance between $x_i$ and $\bar{x}$ .
$h_i$ is always between $\frac{1}{n}$ and 1.
Average leverage for all observations is always $\frac{p+1}{n}$ . (If leverage stat is higher, then suspicious).

Collinearity

When we have two or more predictor variables that are closely related to each other, it's hard to separate individual effects. This causes $SE(\hat{\beta_j})$ to increase, causing the t-statistic to decrease. This means the power of the hypothesis test of correctly detecting non-zero coefficients is reduced. A simple way to test this is to look at a correlation matrix and look for linear patterns.

However, it is still possible for collinearity between three or more variables. To deal with this, compute the Variance Inflation Factor (VIF), the ratio of:

\frac{\text{variance of } \beta_j \text{ when fitting full model}}{\text{variance of } \beta_j \text{ if fit on its own}}

The smallest number possible is 1 (no collinearity at all), and a VIF greater than 5 or 10 indicates issues.

VIF(\hat{\beta_j}) = \frac{1}{1 - R_{x_j|x_{-j}}^2}

$\hat{\beta_j}$ : estimated coefficient for the jth predictor variable.
$\frac{1}{1 - R_{x_j|x_{-j}}^2}$ : $R^2$ value when jth predictor $X_j$ is regressed on all other predictors $X_{-j}$ (all predictors except $X_j$ ).
Example: With collinearity between $X_1$ and $X_2$ , we will have a high $R^2$ . $1-R^2$ = small number, $\frac{1}{\text{small number}} = \text{big number}$ .

To deal with it,

Drop one of the variables that is redundant.
Combine variables (could make a new variable by taking the average of both).

Linear Regression vs KNN

Linear regression is parametric - it assumes a linear form. Meanwhile, K-Nearest Neighbors (KNN) regression is a simple non-parametric model (doesn't take a particular form). How it works:

Pick a $K$ (small for rough fit, high for smooth).
Pick a prediction point $X_o$ .
Given a set of data points, KNN looks for $K$ number of points closest to $x_o$ ( $\mathcal{N}_o$ ). Then, it estimates $f(x_o)$ with the average of $\mathcal{N}_o$ . $\hat{f(x_o)} = \frac{1}{K} \sum_{x_i\in\mathcal{N}_o}y_i$ The optimal value of $K$ depends on the bias-variance trade-off.
Small $K$ = most flexible - low bias, high variance.
Large $K$ = changing just one prediction observation won't mess with the model too much, but smoothing may cause bias by masking some structure in $f(x)$ .

Some other points:

If the data matches with a parametric form ( $x, x^2$ , etc.), a parametric form model will perform better.
In higher dimensions, KNN is worse than linear regression.
- It becomes a reduction in sample size, as we now have to accommodate for extra dimensions.
- This is known as the curse of dimensionality (where an observation has no nearby neighbors).