PSTAT 5A: Linear Regression Basics

Lecture 14 - From Correlation to Prediction: Understanding Linear Relationships

Narjes Mathlouthi

2025-07-29

Welcome to Lecture 13

Linear Regression: From Correlation to Prediction

“The best-fitting line through a cloud of points tells a story about relationships”

📢 Today’s Roadmap

🎯 Learning Objectives

  • Understand the linear regression model
  • Calculate slope and intercept by hand
  • Interpret regression coefficients correctly
  • Assess model quality with R^2
  • Make predictions using the regression line
  • Identify assumptions and when they’re violated

📋 What We’ll Cover

  1. What is Regression? The big picture
  2. The Math Behind the Line Least squares method
  3. Key Statistics R², correlation, residuals
  4. Step-by-Step Examples Real calculations
  5. Interpretation Skills What does it all mean?
  6. Common Pitfalls Correlation ≠ causation

What is Linear Regression? 🔍

🎯 The Big Idea

Linear regression helps us understand and predict relationships between two quantitative variables.

Key Components:

  • Response variable (y): What we want to predict

  • Explanatory variable (x): What we use to predict

  • Regression line: Best-fitting straight line through data

The Model: \hat{y} = a + bx

Where:

  • \hat{y} = predicted value

  • a = y-intercept

  • b = slope

  • x = explanatory variable value

Goal: Find the line that minimizes the sum of squared residuals \text{Minimize: } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Key Insight: The regression line minimizes the vertical distances (residuals) between data points and the line.

The Math: Least Squares Formulas 📐

🧮 Essential Formulas

For the regression line \hat{y} = a + bx:

Slope: b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}

Alternative slope formula: b = r \cdot \frac{s_y}{s_x}

Y-intercept: a = \bar{y} - b\bar{x}

Where:

  • r = correlation coefficient

  • s_x, s_y = standard deviations

  • s_{xy} = covariance

  • \bar{x}, \bar{y} = sample means

Important: The regression line always passes through (\bar{x}, \bar{y})!

🔢 Key Statistics

Correlation coefficient: r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}

Coefficient of determination: R^2 = r^2

Residual: e_i = y_i - \hat{y}_i

Residual standard error: s_e = \sqrt{\frac{\sum e_i^2}{n-2}}

What R² tells us:

  • R^2 = 0.25 → 25% of variation in y is explained by x

  • R^2 = 0.81 → 81% of variation in y is explained by x

  • Higher R^2 = better fit (closer to 1.0)

Example 1: Height vs. Shoe Size 👟

🎯 Problem Setup

Research Question: Can we predict height from shoe size?

Data: 8 college students

Student Shoe Size (x) Height (y)
1 7 65
2 8 67
3 9 69
4 10 71
5 11 73
6 12 75
7 9.5 70
8 8.5 68

Find: Regression equation and interpret the slope.

Step 1: Calculate means

  • \bar{x} = \frac{7+8+9+10+11+12+9.5+8.5}{8} = \frac{75}{8} = 9.375

  • \bar{y} = \frac{65+67+69+71+73+75+70+68}{8} = \frac{558}{8} = 69.75

Step 2: Calculate slope components

x y (x-\bar{x}) (y-\bar{y}) (x-\bar{x})(y-\bar{y}) (x-\bar{x})^2
7 65 -2.375 -4.75 11.281 5.641
8 67 -1.375 -2.75 3.781 1.891
9 69 -0.375 -0.75 0.281 0.141
10 71 0.625 1.25 0.781 0.391
11 73 1.625 3.25 5.281 2.641
12 75 2.625 5.25 13.781 6.891
9.5 70 0.125 0.25 0.031 0.016
8.5 68 -0.875 -1.75 1.531 0.766

Sums: \sum(x-\bar{x})(y-\bar{y}) = 36.75, \sum(x-\bar{x})^2 = 18.375

Step 3: Calculate slope and intercept b = \frac{36.75}{18.375} = 2.0 a = 69.75 - 2.0(9.375) = 69.75 - 18.75 = 51.0

Step 4: Write equation \hat{y} = 51.0 + 2.0x

Interpretation: For each additional shoe size, height increases by 2.0 inches on average.

🔍 Key Insights:

  • Slope = 2.0: Each shoe size ↑ → 2 inches taller
  • R² = 0.968: 96.8% of height variation explained
  • Strong positive relationship

Example 2: Study Time vs. Test Score 📚

🎯 Problem Setup

Research Question: Does study time predict test performance?

Data: 10 students’ study hours and test scores

Hours (x) Score (y)
2 70
3 75
4 80
5 82
6 85
7 88
8 90
1 65
9 92
10 95

Tasks: 1. Find the regression equation 2. Calculate R² 3. Predict score for 6.5 hours of study 4. Interpret the y-intercept

Step 1: Calculate basic statistics - n = 10 - \bar{x} = 5.5 hours - \bar{y} = 82.2 points - s_x = 2.87, s_y = 9.19

Step 2: Calculate correlation Using the formula with sums: - \sum(x-\bar{x})(y-\bar{y}) = 230.5 - \sum(x-\bar{x})^2 = 82.5 - \sum(y-\bar{y})^2 = 760.4

r = \frac{230.5}{\sqrt{82.5 \times 760.4}} = \frac{230.5}{250.4} = 0.921

Step 3: Calculate slope and intercept b = r \cdot \frac{s_y}{s_x} = 0.921 \times \frac{9.19}{2.87} = 2.95 a = \bar{y} - b\bar{x} = 82.2 - 2.95(5.5) = 66.0

Step 4: Write equation \hat{y} = 66.0 + 2.95x

Step 5: Calculate R² R^2 = r^2 = (0.921)^2 = 0.848

Step 6: Make prediction For x = 6.5: \hat{y} = 66.0 + 2.95(6.5) = 85.2 points

Step 7: Interpret y-intercept When study time = 0 hours, predicted score = 66.0 points (baseline knowledge)

📊 Analysis Summary:

  • Slope = 2.95: Each hour ↑ → 2.95 points ↑
  • R² = 0.848: 84.8% of score variation explained
  • Strong positive relationship (r = 0.921)
  • Practical significance: Study time matters!

Understanding R² and Model Quality 📈

🎯 What is R²?

Coefficient of Determination (R²) measures the proportion of variation in the response variable explained by the explanatory variable.

Formula:

R^2 = r^2 (square of correlation)

Interpretation:

  • R^2 = 0: No linear relationship

  • R^2 = 0.25: Weak relationship (25% explained)

  • R^2 = 0.64: Moderate relationship (64% explained)

  • R^2 = 0.90: Strong relationship (90% explained)

  • R^2 = 1: Perfect linear relationship

Key Points:

  • Always between 0 and 1

  • Higher = better fit

  • Context matters for “good” values

  • Can’t determine causation

Practice Problem: Housing Prices 🏠

🎯 Your Turn!

Problem: A real estate agent wants to predict house prices based on square footage.

Data: 6 recent sales

Sq Ft (x) Price ($000s) (y)
1200 180
1500 220
1800 260
2000 290
2200 320
1600 235

Tasks: 1. Calculate the regression equation 2. Find R² and interpret it 3. Predict price for 1900 sq ft house 4. What’s the price per square foot?

Step 1: Calculate means

  • \bar{x} = \frac{1200+1500+1800+2000+2200+1600}{6} = 1716.7 sq ft

  • \bar{y} = \frac{180+220+260+290+320+235}{6} = 250.8 ($000s)

Step 2: Calculate slope components

x y (x-\bar{x}) (y-\bar{y}) (x-\bar{x})(y-\bar{y}) (x-\bar{x})^2
1200 180 -516.7 -70.8 36,583 266,978
1500 220 -216.7 -30.8 6,675 46,978
1800 260 83.3 9.2 766 6,939
2000 290 283.3 39.2 11,105 80,278
2200 320 483.3 69.2 33,453 233,778
1600 235 -116.7 -15.8 1,844 13,619

Sums: \sum(x-\bar{x})(y-\bar{y}) = 90,426, \sum(x-\bar{x})^2 = 648,570

Step 3: Calculate regression b = \frac{90,426}{648,570} = 0.139 a = 250.8 - 0.139(1716.7) = 12.0

Equation: \hat{y} = 12.0 + 0.139x

Step 4: Calculate R² - Calculate correlation r first, then R^2 = r^2 = 0.982

Step 5: Interpretation

  • R² = 0.982: 98.2% of price variation explained by square footage

  • Very strong relationship

Step 6: Prediction for 1900 sq ft \hat{y} = 12.0 + 0.139(1900) = 276.1 ($276,100)

Step 7: Price per square foot The slope = 0.139 means $139 per square foot increase

🏠 Real Estate Insights:

  • Very strong fit: R² = 0.982
  • Price increases $139 per sq ft
  • 1900 sq ft house ≈ $276,100
  • Useful for pricing decisions

Assumptions and Limitations ⚠️

🔍 Key Assumptions

1. Linearity - Relationship is actually linear - Check with scatterplot

2. Independence - Observations are independent - No time trends or clustering

3. Normality of Residuals - Residuals follow normal distribution - Check with residual plots

4. Constant Variance - Spread of residuals is consistent - No “fan” patterns

5. No Extreme Outliers - Outliers can heavily influence line - Check for unusual points

⚠️ Common Pitfalls

1. Correlation ≠ Causation - Strong correlation doesn’t prove cause - Could be confounding variables

2. Extrapolation Dangers - Don’t predict far outside data range - Relationships may change

3. Outlier Sensitivity - One extreme point can change everything - Always examine unusual observations

4. Non-linear Relationships - Linear regression assumes straight line - Curved relationships need different methods

5. Ecological Fallacy - Group-level patterns ≠ individual patterns - Be careful with aggregated data

🎯 Best Practices

  • Always plot your data first
  • Check residuals
  • Consider context and causation
  • Report limitations honestly

Summary: Linear Regression Essentials 🎯

🧠 Key Formulas

Regression Line: \hat{y} = a + bx

Slope: b = r \cdot \frac{s_y}{s_x}

Intercept: a = \bar{y} - b\bar{x}

R-squared: R^2 = r^2

Residual: e_i = y_i - \hat{y}_i

🔍 Interpretation Skills

Slope: Change in y per unit change in x

Intercept: Value of y when x = 0

R²: % of y-variation explained by x

Always include units and context!

🛠️ Problem-Solving Steps

  1. Plot the data (always start here!)
  2. Calculate means \bar{x}, \bar{y}
  3. Find correlation r
  4. Calculate slope b = r \cdot s_y/s_x
  5. Calculate intercept a = \bar{y} - b\bar{x}
  6. Write equation \hat{y} = a + bx
  7. Find R² = r²
  8. Make predictions (within reason)
  9. Interpret in context
  10. Check assumptions

🚨 Red Flags to Watch

  • Perfect correlation (r = ±1.0)
  • Extrapolating too far
  • Ignoring outliers
  • Assuming causation
  • Non-linear patterns

Resources

🏠 Back to Main Page