Summary
Difficulty: ★★★★☆
Covers: Simple linear regression, Regression vs correlation, Regression line and prediction equation, Residuals and prediction error, Explained variance and R-squared, Hypothesis testing of the slope, Regression assumptions, Running and interpreting regression in Stata
What is Regression?
Simple linear regression is a statistical method that predicts a numeric outcome (Y) from a numeric predictor (X).
It answers the question:
How much does Y change when X changes?
Regression is used when we want to predict, not just describe relationships.
Regression vs Correlation
Correlation and regression are closely related but serve different purposes.
Correlation (r)
Describes the strength and direction of the relationship between X and Y.
Regression
Uses that relationship to predict Y from X and explain variation in Y.
Regression can be thought of as an extension of correlation.
In simple linear regression:
R² = r²
This means the amount of variance explained in regression is based on the correlation.
Regression does not prove causation
Regression is usually used in correlational (non-experimental) research.
Even if X predicts Y:
This does not mean X causes Y.
Causation requires experimental design and converging evidence.
Regression shows prediction, not proof of cause.
When do we use regression?
Common research designs include:
- cross-sectional surveys (measure variables once)
- longitudinal studies (earlier measures predict later outcomes)
The regression line
Regression finds the line of best fit through a scatterplot.
This line summarises the relationship between X and Y.
The regression equation is:
ŷ = a + bX
Where:
| Symbol | Meaning | Plain English |
|---|---|---|
| ŷ | predicted Y | predicted outcome value |
| a | intercept | predicted Y when X = 0 |
| b | slope | change in Y for a 1-unit increase in X |
Interpreting the slope (b)
The slope is the key result.
It tells you how much Y changes when X increases by one unit.
- Positive b → higher X predicts higher Y
- Negative b → higher X predicts lower Y
Example:
b = 0.26
Every +1 in X predicts +0.26 in Y.
Residuals (prediction errors)
A residual is the difference between:
observed Y − predicted Y
Residual = Y − ŷ
Residuals measure prediction error.
- small residuals → good prediction
- large residuals → poor prediction
| Residual sign | Meaning |
|---|---|
| Positive | point lies above the regression line |
| Negative | point lies below the regression line |
Regression uses least squares to minimise total error.
What Is Variance? (R²)
Regression explains how much variation in Y is accounted for by X.
R² represents:
explained variance ÷ total variance in Y
R² ranges from 0 to 1 and is often reported as a percentage.
Examples:
- R² = .03 → 3% explained (small)
- R² = .25 → 25% explained (large)
Hypothesis testing in regression
We test whether the slope differs from zero.
Hypotheses:
H₀: b = 0 (X does not predict Y)
H₁: b ≠ 0 (X predicts Y)
Decision rule:
- p < .05 → significant predictor
- p ≥ .05 → not significant
The test statistic is:
t = b / SE(b)
Assumptions of regression
Check three key assumptions:
- Relationship between X and Y is roughly linear
- Residuals are approximately normal
- Residual spread is constant (no pattern)
If these are met, results are interpretable.
Running regression in Stata
Regression command:
regress y x
Visual check:
graph twoway (scatter y x) (lfit y x)
Residual checks:
predict r, residual
histogram r
swilk r
rvfplot, yline(0)
Reading regression output
| Output | Meaning |
|---|---|
| b (coefficient) | direction and size of prediction |
| p-value | significance of predictor |
| R² | proportion of variance explained |
Using regression to predict
Once you know a and b:
ŷ = a + bX
You can predict Y for any value of X.
This is used for:
- forecasting
- policy decisions
- real-world prediction
Reporting regression results
Significant result:
X significantly predicted Y, b = __, p = __, explaining __% of variance (R²).
Not significant:
X did not significantly predict Y, b = __, p = __.