STAT 1103 Week 11 Notes: Simple Linear Regression

Summary

Difficulty: ★★★★☆

Covers: Simple linear regression, Regression vs correlation, Regression line and prediction equation, Residuals and prediction error, Explained variance and R-squared, Hypothesis testing of the slope, Regression assumptions, Running and interpreting regression in Stata

What is Regression?

Simple linear regression predicts a numeric outcome (Y / DV) from one numeric predictor (X / IV).

  • Correlation: describes the relationship between X and Y
  • Regression: uses X to predict Y (and explains variation in Y)

Note: Regression is an extension of correlation.

Regression ≠ Causation

Regression is usually used in non-experimental (correlational) designs, so:

  • If X predicts Y, that does not prove X causes Y
  • Causation needs appropriate design (often experimental + converging evidence)
Regression vs Correlation
  • Correlation (r): describes the strength and direction of a linear relationship
  • Regression: uses that relationship to predict Y from X

In simple linear regression:R2=r2R^2 = r^2

When Do We Use Regression?

Common designs:

  • Cross-sectional surveys (measure many variables once)
  • Longitudinal studies (use earlier measures to predict later outcomes)

The Regression Line + Equation

Regression finds the line of best fit on a scatterplot.

Regression equation

ŷ=a+bX

SymbolMeaningPlain English
ŷpredicted Ypredicted score on the outcome
aa (alpha / intercept)interceptpredicted Y when X = 0
bb (beta / slope)slopechange in Y for a 1-unit increase in X

Slope interpretation example:
If b=0.5b = 0.5b=0.5, then every +1 in X predicts +0.5 in Y.

Important: What the Slope (b) Tells You

The slope (b) is the key result.

  • Positive b → higher X predicts higher Y
  • Negative b → higher X predicts lower Y

Example:

  • b = 0.26 → every extra 1 unit of X predicts +0.26 units in Y
What are Residuals? (Errors)

A residual is the difference between:

  • observed Y
  • predicted Y

Residual=YY^\text{Residual} = Y – \hat{Y}

Regression uses least squares to minimise total error

Small residuals → good prediction

Large residuals → poor prediction

Residual signMeans
Positivepoint is above the regression line
Negativepoint is below the regression line

Variance + R² (Why Regression Works)

Regression explains variance in Y.

R-squared (R²)

R2=variance explained by the modeltotal variance in YR^2 = \frac{\text{variance explained by the model}}{\text{total variance in Y}}R2=total variance in Yvariance explained by the model​

  • R² ranges from 0 to 1
  • Often expressed as a percentage

Examples:

  • R² = .03 → 3% of variance explained (small)
  • R² = .25 → 25% of variance explained (large)
Hypothesis Testing in Regression

In regression, we test only the slope.

Hypotheses

  • H₀: b=0b = 0b=0 (X does not predict Y)
  • H₁: b0b \neq 0b=0

Decision rule:

  • p < .05 → significant predictor
  • p ≥ .05 → not significant

The test statistic is:t=bSE(b)t = \frac{b}{SE(b)}t=SE(b)b​

Assumption Checks

You only need to check three things:

  1. Relationship between X and Y looks roughly linear
  2. Residuals are approximately normal
  3. Residuals show constant spread (no clear pattern)

If these look reasonable → interpret results.

Running Regression in Stata (Commands)
regress y x

Visual check

graph twoway (scatter y x) (lfit y x)

Residual checks

predict r, residual
histogram r
swilk r
rvfplot, yline(0)


Reading Stata Output
Output pieceWhat it tells you
b (coefficient)direction and size of prediction
p-valueis X a significant predictor?
how much variance in Y is explained

Using the Regression Equation to Predict

Once you have aa and bb, you can predict Y for any X:Y^=a+bX\hat{Y} = a + bX

This is how regression is used for:

  • prediction
  • policy decisions
  • real-world forecasting
How to Write the Result For Reports

Significant

X significantly predicted Y, b = __, p = __, explaining __% of the variance (R²).

Not significant

X did not significantly predict Y, b = __, p = __.

Leave a comment