Published on

Notes on Linear Regression – A Deeper Exploration

Authors

There are notes on linear regression concepts I created for reference in stock price and energy cost prediction research.

Source: An Introduction to Statistical Learning with Applications in R Found Here.

Abstract

Chapter 3 focused on the use cases and interpretation of linear regression, a simple technique with high interpretability but low flexibility in model fitting (pg. 25 ISL). The chapter covers simple regression, multiple linear regression, key considerations when fitting a model, and a comparison to KNN regression.


The basic model is basically just a line like y=mx+by = mx + b from Algebra I, but written as

Yβ0^+β1^X+ϵ.Y \approx \hat{\beta_0} + \hat{\beta_1}X + \epsilon.

We are trying to estimate a good β0^\hat{\beta_0} and β1^\hat{\beta_1}. This is done using linear algebra, which was not provided in the book. ϵ\epsilon exists as error that we get from the model, since no model can fit the data 100% perfectly in practice.

Assessing Accuracy of Coefficients

We can see how off the data is from our prediction line. This will tell us how well our model is fitting. One way to test the accuracy is with the variance Var(μ^)=Var(\hat{\mu}) = the standard error SE(μ^)2=σ2nSE(\hat{\mu})^2 = \frac{\sigma^2}{n}. A good thing to note here is that as nn increases, SESE gets smaller, making the model more accurate. This is because more data points give the model a better idea of the overall trend.

If we want to find out how close β0\beta_0 and β1\beta_1 are to the actual values, we can use the following equations:

SE(β0^)2=σ2[1n+xˉ2i=1n(xixˉ)2],SE(β1^)2=σ2i=1n(xixˉ)2SE(\hat{\beta_0})^2 = \sigma^2\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\right], \quad SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}

(xixˉ)2(x_i-\bar{x})^2 is the actual value of the point minus the mean.

  • If actual and mean differ more, then the dataset is more spread out.

Is There Even a Relationship?

To determine if there is a relationship between our predictor variable XX and output variable YY, we can use a t-test. This test is similar to looking at a normal distribution but is used when dealing with samples rather than entire populations. For example, if we wanted to measure income, we might get a sample of 1,000 people, not the entire world.

We will get two results from testing:

  • A t-statistic, which indicates how many standard deviations away we are from "normal".
  • A p-value, which helps us determine if there is any relationship between the variables. We can set our null hypothesis as H0=0H_0 = 0 (no relationship) and our alternative hypothesis as H10H_1 \neq 0 (some relationship).

We can calculate our t-statistic using the equation:

t=β1^0SE(β1^)t = \frac{\hat{\beta_1} - 0}{SE(\hat{\beta_1})}

Notice that we have a - 0, which is usually β1\beta_1 (the actual value), but our hypothesis says there is no relation, so we set it to 0 and see what happens.

  • If t is large, maybe greater than 3, we can reject H0H_0.
  • Similarly, we can reject H0H_0 if our p-value is less than 0.05, or our α\alpha level.

Assessing Accuracy of Model

Another important metric is the Residual Sum of Squares (RSS), which measures the distance between the line and the actual points, squared, and summed up:

RSS=in(yiyi^)2RSS = \sum_{i}^{n} (y_i - \hat{y_i})^2

While RSS can tell us how much error we have, it is not easily interpretable. We have two other tools to help with this, starting with Residual Standard Error (RSE):

RSE=1n2RSSRSE = \sqrt{\frac{1}{n-2}RSS}

RSE normalizes RSS, making it more interpretable. However, it is still hard to know what a good RSE is. That's why we use R2R^2, which is the proportion of variance explained (a number between 0 (no fit) and 1 (perfect model)):

R2=1RSSTSSR^2 = 1 - \frac{RSS}{TSS}

Total Sum of Squares (TSS) is the sum of all the differences between actual points and the mean, squared: (yiyˉ)2\sum(y_i - \bar{y})^2.

In this model, TSS is how much YY varies, while RSS is the amount of variability the model is not able to explain. The difference TSSRSSTSS - RSS is the amount of variability in the explained response.

  • We can also use correlation, given by another equation (pg. 70). In linear regression, Cor(X,Y)=rCor(X, Y) = r, and r=Rr = R, so the correlation squared is R2R^2.
  • Even with easier interpretation, a good R2R^2 still depends on the context. 0.2 might seem really bad, but in some contexts, it can be good.

Multiple Linear Regression

We can now add more variables. Instead of just β1\beta_1, we can have many:

Y=β0+β1X1+β2X2++βpXp+ϵY = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon

We interpret βj\beta_j as the average effect on YY of a 1-unit increase in XjX_j, with all other predictors fixed.

Similarly, RSS is given by:

RSS=i=1n(yiβ0^β1^xiβp^xip)2RSS = \sum_{i=1}^{n}(y_i - \hat{\beta_0} - \hat{\beta_1}x_i - \cdots - \hat{\beta_p}x_{ip})^2

With simple regression, two different factors (like radio and news ads) might show an increase in sales the more spending is put into both forms of advertisement.

  • However, in the example, multiple regression shows non-rejection of H0H_0 for news ads (indicating there isn't actually an increase in sales from increasing news ads).
  • When looking at a correlation matrix, you can see that radio and news ads are correlated around 0.35.
  • This is like saying that an increase in ice cream sales increases the number of shark attacks. Here, news ads don't increase sales; it's just that more radio ad spending tends to mean more news ad spending.

Questions to Consider

Relationship Between Response and Predictors?

Similar to our simple linear regression model, we can set up a hypothesis test to see if there is any sort of relationship between our variables and the outcome. We use an F-test here, as there are multiple variables. First, we define the null and alternative hypotheses as:

H0:β1=β2==βp=0H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 Ha:at least one βj0H_a: \text{at least one} \ \beta_j \neq 0

Next, we find our F-statistic:

F=(TSSRSS)/pRSS/(np1)F = \frac{(TSS - RSS) / p}{RSS / (n-p-1)}
  • The F-stat adjusts for the number of predictors pp, so with many predictors, it is good to look at the F-stat, not just the p-value.
  • Normal F-stat values are around 1 for H0H_0. If large values, then reject H0H_0, meaning there is a relationship.
  • This also means that E[TSSRSSp]>σ2E\left[\frac{TSS - RSS}{p}\right] > \sigma^2, or the expected value of the average amount of variability is greater than the noise/error.
  • This is good because it suggests that the model is explaining the variability in our data.
  • Remember: You can't fit a linear regression model if the number of parameters is greater than the number of samples (p>n)(p > n).

Deciding Important Variables

After rejecting H0H_0, we need to figure out which variables are affecting the outcome to make a new model with the important variables (called variable selection). Since there are 2p2^p possible models (way too many to try all), there are three possible approaches:

  1. Forward Selection: Begin with no predictors. Add a variable that gives the lowest RSS. Continue until some stopping rule.
  2. Backward Selection: Similar, but start with all variables and delete those with the largest p-values one at a time.
  3. Mixed: Combine the first two methods - Do a forward selection and monitor the results. If needed, apply backward selection.

We will learn more about model fit and selection in Chapter 5.

Model Fit

R2R^2 will always increase with more added variables. If we see that adding a parameter only increases R2R^2 slightly, we can drop it from the model. Keep in mind that overfitting will lead to bad results on independent test samples.

Predictions

There are three types of uncertainty:

  1. The least squares plane Y^\hat{Y} is only estimated for true f(X)f(X). We can quantify this uncertainty by computing our confidence interval (CI).
  2. This is a linear approximation, so model bias exists.
  3. Random error ϵ\epsilon will exist as irreducible error. We can use prediction intervals (rather than CI) which give us a wider range of uncertainty.

Qualitative Predictors

For non-numeric predictors (like marital status, employment status, etc.) with two possible values, we can create an indicator/dummy variable:

xi={1if true/something0if false/not somethingx_i = \begin{cases} 1 & \text{if true/something} \\ 0 & \text{if false/not something} \end{cases}

*Personal note: Recall that from linear algebra, a test matrix for least squares looks like:

xi={β0+β1+ϵiif true/somethingβ0+ϵiif false/not somethingx_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if true/something} \\ \beta_0 + \epsilon_i & \text{if false/not something} \end{cases}
  • You can also use -1 instead of 0. It means that the coefficient will be interpreted differently, but the model will still work the same.
  • Think of it as adding a gray area instead of black and white (yes or no). Here, the value of β0\beta_0 would be between the values of yes and no.
  • For more than two levels/possible values, you need more dummy variables. Here is an example:

We are looking at four sections of the US - North, South, West, East. We want to first define our default value β0\beta_0, which will be East here. This means that we will be comparing all the other areas to East:

{β0+β1if Northβ0+β2if Southβ0+β3if Westβ0if East (base)\begin{cases} \beta_0 + \beta_1 & \text{if North} \\ \beta_0 + \beta_2 & \text{if South} \\ \beta_0 + \beta_3 & \text{if West} \\ \beta_0 & \text{if East (base)} \end{cases}
  • Example: South will have $18 less debt than East if β1=18\beta_1 = -18.
  • We use an F-test to test relationships and importance. By using difference coding (0/1 vs -1/1), we can measure contrasts differently.

Extending The Linear Model

So far, we have assumed that the relationship between variables is additive and linear - that is, predictor XjX_j and response YY don't depend on values of other predictors, and the change in YY from a 1-unit change in XjX_j is constant.

However, this isn't the case many times. For example, splitting a marketing budget on different things is better than putting 100% of the budget on just YouTube ads. This is known as synergy/interaction.

We can extend the model by adding an interaction term:

Y=β0+β1X1+β2X2+β3X1X2+ϵY = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon

Rewritten:

Y=β0+(β1+β3X2)X1+β2X2+ϵY = \beta_0 + (\beta_1 + \beta_3X_2)X_1 + \beta_2X_2 + \epsilon

Now, notice that there is an X2X_2 term as part of X1X_1, meaning that a change in X2X_2 will affect X1X_1.

  • β3\beta_3 is the increase in effectiveness of X1X_1 associated with a 1-unit increase in X2X_2.
  • If H0H_0 was rejected for β3\beta_3, then we have evidence that the true relationship is not additive (predictor and response do depend on other predictors).

Hierarchical principle: If we include the interaction β3\beta_3, we should also include the main effects β1\beta_1 and β2\beta_2 in our model, even if p-values are large. This is denoted as X1×X2X_1 \times X_2.

Another thing you can try is polynomial regression, which is calculated like linear regression with multiple dimensions:

Y=β0+β1X1+β2X12+ϵY = \beta_0 + \beta_1X_1 + \beta_2X_1^2 + \epsilon

Be careful though, adding more powers may result in overfitting. More on this in Chapter 7.

Potential Issues with Regression Model

Non-Linearity

To see if the relationship is non-linear, we can plot the residuals ei=yiyi^e_i = y_i - \hat{y_i} vs. predictor xix_i. Ideally, there is no discernible pattern. However, if we see something like a U-shape, then the relationship might be non-linear. A simple way to fix this is to add powers (like polynomial regression) e.g., log(x),x,x2\log(x), \sqrt{x}, x^2.

In the case of multiple regression, plot residuals vs. predicted (fitted) values y1^\hat{y_1}.

Correlation of Error Terms

We assume that error terms are uncorrelated (if ϵi\epsilon_i is positive, that doesn't mean ϵi+1\epsilon_{i+1} is positive as well).

  • If there is a correlation, the model will underestimate true standard errors (the CI will be too narrow).
  • Plot the residuals as a function of time. If it is positively correlated, we may see tracking, or adjacent residuals having similar values.

Non-Constant Variance of Error Terms

We are also assuming that Var(ϵi)=σ2Var(\epsilon_i) = \sigma^2. However, sometimes variances of ϵ\epsilon increase with the value of the responses (giving a funnel shape in the residual plot). This is called heteroscedasticity.

  • Can reduce this error by using a concave function for our response YY, like log(Y)\log(Y) or Y\sqrt{Y}.
  • This will result in a greater amount of shrinkage of larger responses.
  • Since this is changing the data, we have to interpret it differently. If we use log(Y)\log(Y), the model now explains changes in the logarithm of the response, rather than the response itself. This is for more reliable interpretation, as model parameters will be optimized for the transformed relationship without heteroscedasticity. To then estimate values, we would run it through the model, then do a conversion back to the original, in this case with eYe^Y.
  • We change our average variance σi2=σ2/ni\sigma_i^2 = \sigma^2 / n_i, and can fit the model by weighted least squares, where wi=niw_i = n_i.

Outliers

To find outliers, plot studentized residuals (divide each residual eie_i by estimated SE^\hat{SE}).

  • If the studentized residuals are greater than 3, it is a possible outlier.
  • Usually, just remove that data point. Make sure to be careful, as it might indicate a deficiency with the model (like a missing predictor).

High Leverage Points

These are unusual values for xix_i (our data might be modeling family sizes to income, and we have one family size of 20). It becomes hard to tell some of these points for multiple linear regression, where we have a point floating away from the rest but don't have extreme values either. To find these points, we use the leverage statistic, where high values are bad.

For simple linear regression,

hi=1n+(xixˉ)2j=1n(xjxˉ)2h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{j=1}^{n}(x_j-\bar{x})^2}
  • hih_i increases with distance between xix_i and xˉ\bar{x}.
  • hih_i is always between 1n\frac{1}{n} and 1.
  • Average leverage for all observations is always p+1n\frac{p+1}{n}. (If leverage stat is higher, then suspicious).

Collinearity

When we have two or more predictor variables that are closely related to each other, it's hard to separate individual effects. This causes SE(βj^)SE(\hat{\beta_j}) to increase, causing the t-statistic to decrease. This means the power of the hypothesis test of correctly detecting non-zero coefficients is reduced. A simple way to test this is to look at a correlation matrix and look for linear patterns.

However, it is still possible for collinearity between three or more variables. To deal with this, compute the Variance Inflation Factor (VIF), the ratio of:

variance of βj when fitting full modelvariance of βj if fit on its own\frac{\text{variance of } \beta_j \text{ when fitting full model}}{\text{variance of } \beta_j \text{ if fit on its own}}

The smallest number possible is 1 (no collinearity at all), and a VIF greater than 5 or 10 indicates issues.

VIF(βj^)=11Rxjxj2VIF(\hat{\beta_j}) = \frac{1}{1 - R_{x_j|x_{-j}}^2}
  • βj^\hat{\beta_j}: estimated coefficient for the jth predictor variable.
  • 11Rxjxj2\frac{1}{1 - R_{x_j|x_{-j}}^2}: R2R^2 value when jth predictor XjX_j is regressed on all other predictors XjX_{-j} (all predictors except XjX_j).
  • Example: With collinearity between X1X_1 and X2X_2, we will have a high R2R^2. 1R21-R^2 = small number, 1small number=big number\frac{1}{\text{small number}} = \text{big number}.

To deal with it,

  1. Drop one of the variables that is redundant.
  2. Combine variables (could make a new variable by taking the average of both).

Linear Regression vs KNN

Linear regression is parametric - it assumes a linear form. Meanwhile, K-Nearest Neighbors (KNN) regression is a simple non-parametric model (doesn't take a particular form). How it works:

  • Pick a KK (small for rough fit, high for smooth).
  • Pick a prediction point XoX_o.
  • Given a set of data points, KNN looks for KK number of points closest to xox_o (No\mathcal{N}_o). Then, it estimates f(xo)f(x_o) with the average of No\mathcal{N}_o. f(xo)^=1KxiNoyi\hat{f(x_o)} = \frac{1}{K} \sum_{x_i\in\mathcal{N}_o}y_i The optimal value of KK depends on the bias-variance trade-off.
  • Small KK = most flexible - low bias, high variance.
  • Large KK = changing just one prediction observation won't mess with the model too much, but smoothing may cause bias by masking some structure in f(x)f(x).

Some other points:

  • If the data matches with a parametric form (x,x2x, x^2, etc.), a parametric form model will perform better.
  • In higher dimensions, KNN is worse than linear regression.
    • It becomes a reduction in sample size, as we now have to accommodate for extra dimensions.
    • This is known as the curse of dimensionality (where an observation has no nearby neighbors).