Lecture 14 - From Correlation to Prediction: Understanding Linear Relationships
2025-07-29
Linear Regression: From Correlation to Prediction
“The best-fitting line through a cloud of points tells a story about relationships”
Linear regression helps us understand and predict relationships between two quantitative variables.
Key Components:
Response variable (y): What we want to predict
Explanatory variable (x): What we use to predict
Regression line: Best-fitting straight line through data
The Model: \hat{y} = a + bx
Where:
\hat{y} = predicted value
a = y-intercept
b = slope
x = explanatory variable value
Goal: Find the line that minimizes the sum of squared residuals \text{Minimize: } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
Key Insight: The regression line minimizes the vertical distances (residuals) between data points and the line.
For the regression line \hat{y} = a + bx:
Slope: b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}
Alternative slope formula: b = r \cdot \frac{s_y}{s_x}
Y-intercept: a = \bar{y} - b\bar{x}
Where:
r = correlation coefficient
s_x, s_y = standard deviations
s_{xy} = covariance
\bar{x}, \bar{y} = sample means
Important: The regression line always passes through (\bar{x}, \bar{y})!
Correlation coefficient: r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}
Coefficient of determination: R^2 = r^2
Residual: e_i = y_i - \hat{y}_i
Residual standard error: s_e = \sqrt{\frac{\sum e_i^2}{n-2}}
What R² tells us:
R^2 = 0.25 → 25% of variation in y is explained by x
R^2 = 0.81 → 81% of variation in y is explained by x
Higher R^2 = better fit (closer to 1.0)
Research Question: Can we predict height from shoe size?
Data: 8 college students
Student | Shoe Size (x) | Height (y) |
---|---|---|
1 | 7 | 65 |
2 | 8 | 67 |
3 | 9 | 69 |
4 | 10 | 71 |
5 | 11 | 73 |
6 | 12 | 75 |
7 | 9.5 | 70 |
8 | 8.5 | 68 |
Find: Regression equation and interpret the slope.
Step 1: Calculate means
\bar{x} = \frac{7+8+9+10+11+12+9.5+8.5}{8} = \frac{75}{8} = 9.375
\bar{y} = \frac{65+67+69+71+73+75+70+68}{8} = \frac{558}{8} = 69.75
Step 2: Calculate slope components
x | y | (x-\bar{x}) | (y-\bar{y}) | (x-\bar{x})(y-\bar{y}) | (x-\bar{x})^2 |
---|---|---|---|---|---|
7 | 65 | -2.375 | -4.75 | 11.281 | 5.641 |
8 | 67 | -1.375 | -2.75 | 3.781 | 1.891 |
9 | 69 | -0.375 | -0.75 | 0.281 | 0.141 |
10 | 71 | 0.625 | 1.25 | 0.781 | 0.391 |
11 | 73 | 1.625 | 3.25 | 5.281 | 2.641 |
12 | 75 | 2.625 | 5.25 | 13.781 | 6.891 |
9.5 | 70 | 0.125 | 0.25 | 0.031 | 0.016 |
8.5 | 68 | -0.875 | -1.75 | 1.531 | 0.766 |
Sums: \sum(x-\bar{x})(y-\bar{y}) = 36.75, \sum(x-\bar{x})^2 = 18.375
Step 3: Calculate slope and intercept b = \frac{36.75}{18.375} = 2.0 a = 69.75 - 2.0(9.375) = 69.75 - 18.75 = 51.0
Step 4: Write equation \hat{y} = 51.0 + 2.0x
Interpretation: For each additional shoe size, height increases by 2.0 inches on average.
Research Question: Does study time predict test performance?
Data: 10 students’ study hours and test scores
Hours (x) | Score (y) |
---|---|
2 | 70 |
3 | 75 |
4 | 80 |
5 | 82 |
6 | 85 |
7 | 88 |
8 | 90 |
1 | 65 |
9 | 92 |
10 | 95 |
Tasks: 1. Find the regression equation 2. Calculate R² 3. Predict score for 6.5 hours of study 4. Interpret the y-intercept
Step 1: Calculate basic statistics - n = 10 - \bar{x} = 5.5 hours - \bar{y} = 82.2 points - s_x = 2.87, s_y = 9.19
Step 2: Calculate correlation Using the formula with sums: - \sum(x-\bar{x})(y-\bar{y}) = 230.5 - \sum(x-\bar{x})^2 = 82.5 - \sum(y-\bar{y})^2 = 760.4
r = \frac{230.5}{\sqrt{82.5 \times 760.4}} = \frac{230.5}{250.4} = 0.921
Step 3: Calculate slope and intercept b = r \cdot \frac{s_y}{s_x} = 0.921 \times \frac{9.19}{2.87} = 2.95 a = \bar{y} - b\bar{x} = 82.2 - 2.95(5.5) = 66.0
Step 4: Write equation \hat{y} = 66.0 + 2.95x
Step 5: Calculate R² R^2 = r^2 = (0.921)^2 = 0.848
Step 6: Make prediction For x = 6.5: \hat{y} = 66.0 + 2.95(6.5) = 85.2 points
Step 7: Interpret y-intercept When study time = 0 hours, predicted score = 66.0 points (baseline knowledge)
Coefficient of Determination (R²) measures the proportion of variation in the response variable explained by the explanatory variable.
Formula:
R^2 = r^2 (square of correlation)
Interpretation:
R^2 = 0: No linear relationship
R^2 = 0.25: Weak relationship (25% explained)
R^2 = 0.64: Moderate relationship (64% explained)
R^2 = 0.90: Strong relationship (90% explained)
R^2 = 1: Perfect linear relationship
Key Points:
Always between 0 and 1
Higher = better fit
Context matters for “good” values
Can’t determine causation
Problem: A real estate agent wants to predict house prices based on square footage.
Data: 6 recent sales
Sq Ft (x) | Price ($000s) (y) |
---|---|
1200 | 180 |
1500 | 220 |
1800 | 260 |
2000 | 290 |
2200 | 320 |
1600 | 235 |
Tasks: 1. Calculate the regression equation 2. Find R² and interpret it 3. Predict price for 1900 sq ft house 4. What’s the price per square foot?
Step 1: Calculate means
\bar{x} = \frac{1200+1500+1800+2000+2200+1600}{6} = 1716.7 sq ft
\bar{y} = \frac{180+220+260+290+320+235}{6} = 250.8 ($000s)
Step 2: Calculate slope components
x | y | (x-\bar{x}) | (y-\bar{y}) | (x-\bar{x})(y-\bar{y}) | (x-\bar{x})^2 |
---|---|---|---|---|---|
1200 | 180 | -516.7 | -70.8 | 36,583 | 266,978 |
1500 | 220 | -216.7 | -30.8 | 6,675 | 46,978 |
1800 | 260 | 83.3 | 9.2 | 766 | 6,939 |
2000 | 290 | 283.3 | 39.2 | 11,105 | 80,278 |
2200 | 320 | 483.3 | 69.2 | 33,453 | 233,778 |
1600 | 235 | -116.7 | -15.8 | 1,844 | 13,619 |
Sums: \sum(x-\bar{x})(y-\bar{y}) = 90,426, \sum(x-\bar{x})^2 = 648,570
Step 3: Calculate regression b = \frac{90,426}{648,570} = 0.139 a = 250.8 - 0.139(1716.7) = 12.0
Equation: \hat{y} = 12.0 + 0.139x
Step 4: Calculate R² - Calculate correlation r first, then R^2 = r^2 = 0.982
Step 5: Interpretation
R² = 0.982: 98.2% of price variation explained by square footage
Very strong relationship
Step 6: Prediction for 1900 sq ft \hat{y} = 12.0 + 0.139(1900) = 276.1 ($276,100)
Step 7: Price per square foot The slope = 0.139 means $139 per square foot increase
1. Linearity - Relationship is actually linear - Check with scatterplot
2. Independence - Observations are independent - No time trends or clustering
3. Normality of Residuals - Residuals follow normal distribution - Check with residual plots
4. Constant Variance - Spread of residuals is consistent - No “fan” patterns
5. No Extreme Outliers - Outliers can heavily influence line - Check for unusual points
1. Correlation ≠ Causation - Strong correlation doesn’t prove cause - Could be confounding variables
2. Extrapolation Dangers - Don’t predict far outside data range - Relationships may change
3. Outlier Sensitivity - One extreme point can change everything - Always examine unusual observations
4. Non-linear Relationships - Linear regression assumes straight line - Curved relationships need different methods
5. Ecological Fallacy - Group-level patterns ≠ individual patterns - Be careful with aggregated data
Regression Line: \hat{y} = a + bx
Slope: b = r \cdot \frac{s_y}{s_x}
Intercept: a = \bar{y} - b\bar{x}
R-squared: R^2 = r^2
Residual: e_i = y_i - \hat{y}_i
Slope: Change in y per unit change in x
Intercept: Value of y when x = 0
R²: % of y-variation explained by x
Always include units and context!
Linear Regression © 2025