Summary
Difficulty: ★★★★☆
Covers: Simple linear regression, Regression vs correlation, Regression line and prediction equation, Residuals and prediction error, Explained variance and R-squared, Hypothesis testing of the slope, Regression assumptions, Running and interpreting regression in Stata
What is Regression?
Simple linear regression predicts a numeric outcome (Y / DV) from one numeric predictor (X / IV).
- Correlation: describes the relationship between X and Y
- Regression: uses X to predict Y (and explains variation in Y)
Note: Regression is an extension of correlation.
Regression ≠ Causation
Regression is usually used in non-experimental (correlational) designs, so:
- If X predicts Y, that does not prove X causes Y
- Causation needs appropriate design (often experimental + converging evidence)
Regression vs Correlation
- Correlation (r): describes the strength and direction of a linear relationship
- Regression: uses that relationship to predict Y from X
In simple linear regression:
When Do We Use Regression?
Common designs:
- Cross-sectional surveys (measure many variables once)
- Longitudinal studies (use earlier measures to predict later outcomes)
The Regression Line + Equation
Regression finds the line of best fit on a scatterplot.
Regression equation
ŷ=a+bX
| Symbol | Meaning | Plain English |
|---|---|---|
| ŷ | predicted Y | predicted score on the outcome |
| (alpha / intercept) | intercept | predicted Y when X = 0 |
| (beta / slope) | slope | change in Y for a 1-unit increase in X |
Slope interpretation example:
If b=0.5, then every +1 in X predicts +0.5 in Y.
Important: What the Slope (b) Tells You
The slope (b) is the key result.
- Positive b → higher X predicts higher Y
- Negative b → higher X predicts lower Y
Example:
- b = 0.26 → every extra 1 unit of X predicts +0.26 units in Y
What are Residuals? (Errors)
A residual is the difference between:
- observed Y
- predicted Y
Regression uses least squares to minimise total error
Small residuals → good prediction
Large residuals → poor prediction
| Residual sign | Means |
|---|---|
| Positive | point is above the regression line |
| Negative | point is below the regression line |
Variance + R² (Why Regression Works)
Regression explains variance in Y.
R-squared (R²)
R2=total variance in Yvariance explained by the model
- R² ranges from 0 to 1
- Often expressed as a percentage
Examples:
- R² = .03 → 3% of variance explained (small)
- R² = .25 → 25% of variance explained (large)
Hypothesis Testing in Regression
In regression, we test only the slope.
Hypotheses
- H₀: b=0 (X does not predict Y)
- H₁: b=0
Decision rule:
- p < .05 → significant predictor
- p ≥ .05 → not significant
The test statistic is:t=SE(b)b
Assumption Checks
You only need to check three things:
- Relationship between X and Y looks roughly linear
- Residuals are approximately normal
- Residuals show constant spread (no clear pattern)
If these look reasonable → interpret results.
Running Regression in Stata (Commands)
regress y x
Visual check
graph twoway (scatter y x) (lfit y x)
Residual checks
predict r, residual
histogram r
swilk r
rvfplot, yline(0)
Reading Stata Output
| Output piece | What it tells you |
|---|---|
| b (coefficient) | direction and size of prediction |
| p-value | is X a significant predictor? |
| R² | how much variance in Y is explained |
Using the Regression Equation to Predict
Once you have and , you can predict Y for any X:
This is how regression is used for:
- prediction
- policy decisions
- real-world forecasting
How to Write the Result For Reports
Significant
X significantly predicted Y, b = __, p = __, explaining __% of the variance (R²).
Not significant
X did not significantly predict Y, b = __, p = __.
Leave a comment