PSTAT 5A: Linear Regression Basics

Lecture 14 - From Correlation to Prediction: Understanding Linear Relationships

Narjes Mathlouthi

2025-07-29

Welcome to Lecture 13

Linear Regression: From Correlation to Prediction

“The best-fitting line through a cloud of points tells a story about relationships”

📢 Today’s Roadmap

🎯 Learning Objectives

Understand the linear regression model
Calculate slope and intercept by hand
Interpret regression coefficients correctly
Assess model quality with R^2
Make predictions using the regression line
Identify assumptions and when they’re violated

📋 What We’ll Cover

What is Regression? The big picture
The Math Behind the Line Least squares method
Key Statistics R², correlation, residuals
Step-by-Step Examples Real calculations
Interpretation Skills What does it all mean?
Common Pitfalls Correlation ≠ causation

What is Linear Regression? 🔍

🎯 The Big Idea

Linear regression helps us understand and predict relationships between two quantitative variables.

Key Components:

Response variable (y): What we want to predict
Explanatory variable (x): What we use to predict
Regression line: Best-fitting straight line through data

The Model: \hat{y} = a + bx

Where:

\hat{y} = predicted value
a = y-intercept
b = slope
x = explanatory variable value

Goal: Find the line that minimizes the sum of squared residuals \text{Minimize: } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Key Insight: The regression line minimizes the vertical distances (residuals) between data points and the line.

The Math: Least Squares Formulas 📐

🧮 Essential Formulas

For the regression line \hat{y} = a + bx:

Slope: b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}

Alternative slope formula: b = r \cdot \frac{s_y}{s_x}

Y-intercept: a = \bar{y} - b\bar{x}

Where:

r = correlation coefficient
s_x, s_y = standard deviations
s_{xy} = covariance
\bar{x}, \bar{y} = sample means

Important: The regression line always passes through (\bar{x}, \bar{y})!

🔢 Key Statistics

Correlation coefficient: r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}

Coefficient of determination: R^2 = r^2

Residual: e_i = y_i - \hat{y}_i

Residual standard error: s_e = \sqrt{\frac{\sum e_i^2}{n-2}}

What R² tells us:

R^2 = 0.25 → 25% of variation in y is explained by x
R^2 = 0.81 → 81% of variation in y is explained by x
Higher R^2 = better fit (closer to 1.0)

Example 1: Height vs. Shoe Size 👟

🎯 Problem Setup

Research Question: Can we predict height from shoe size?

Data: 8 college students

Student	Shoe Size (x)	Height (y)
1	7	65
2	8	67
3	9	69
4	10	71
5	11	73
6	12	75
7	9.5	70
8	8.5	68

Find: Regression equation and interpret the slope.

Step 1: Calculate means

\bar{x} = \frac{7+8+9+10+11+12+9.5+8.5}{8} = \frac{75}{8} = 9.375
\bar{y} = \frac{65+67+69+71+73+75+70+68}{8} = \frac{558}{8} = 69.75

Step 2: Calculate slope components

x	y	(x-\bar{x})	(y-\bar{y})	(x-\bar{x})(y-\bar{y})	(x-\bar{x})^2
7	65	-2.375	-4.75	11.281	5.641
8	67	-1.375	-2.75	3.781	1.891
9	69	-0.375	-0.75	0.281	0.141
10	71	0.625	1.25	0.781	0.391
11	73	1.625	3.25	5.281	2.641
12	75	2.625	5.25	13.781	6.891
9.5	70	0.125	0.25	0.031	0.016
8.5	68	-0.875	-1.75	1.531	0.766

Sums: \sum(x-\bar{x})(y-\bar{y}) = 36.75, \sum(x-\bar{x})^2 = 18.375

Step 3: Calculate slope and intercept b = \frac{36.75}{18.375} = 2.0 a = 69.75 - 2.0(9.375) = 69.75 - 18.75 = 51.0

Step 4: Write equation \hat{y} = 51.0 + 2.0x

Interpretation: For each additional shoe size, height increases by 2.0 inches on average.

🔍 Key Insights:

Slope = 2.0: Each shoe size ↑ → 2 inches taller
R² = 0.968: 96.8% of height variation explained
Strong positive relationship

Example 2: Study Time vs. Test Score 📚

🎯 Problem Setup

Research Question: Does study time predict test performance?

Data: 10 students’ study hours and test scores

Hours (x)	Score (y)
2	70
3	75
4	80
5	82
6	85
7	88
8	90
1	65
9	92
10	95

Tasks: 1. Find the regression equation 2. Calculate R² 3. Predict score for 6.5 hours of study 4. Interpret the y-intercept

Step 1: Calculate basic statistics - n = 10 - \bar{x} = 5.5 hours - \bar{y} = 82.2 points - s_x = 2.87, s_y = 9.19

Step 2: Calculate correlation Using the formula with sums: - \sum(x-\bar{x})(y-\bar{y}) = 230.5 - \sum(x-\bar{x})^2 = 82.5 - \sum(y-\bar{y})^2 = 760.4

r = \frac{230.5}{\sqrt{82.5 \times 760.4}} = \frac{230.5}{250.4} = 0.921

Step 3: Calculate slope and intercept b = r \cdot \frac{s_y}{s_x} = 0.921 \times \frac{9.19}{2.87} = 2.95 a = \bar{y} - b\bar{x} = 82.2 - 2.95(5.5) = 66.0

Step 4: Write equation \hat{y} = 66.0 + 2.95x

Step 5: Calculate R² R^2 = r^2 = (0.921)^2 = 0.848

Step 6: Make prediction For x = 6.5: \hat{y} = 66.0 + 2.95(6.5) = 85.2 points

Step 7: Interpret y-intercept When study time = 0 hours, predicted score = 66.0 points (baseline knowledge)

📊 Analysis Summary:

Slope = 2.95: Each hour ↑ → 2.95 points ↑
R² = 0.848: 84.8% of score variation explained
Strong positive relationship (r = 0.921)
Practical significance: Study time matters!

Understanding R² and Model Quality 📈

🎯 What is R²?

Coefficient of Determination (R²) measures the proportion of variation in the response variable explained by the explanatory variable.

Formula:

R^2 = r^2 (square of correlation)

Interpretation:

R^2 = 0: No linear relationship
R^2 = 0.25: Weak relationship (25% explained)
R^2 = 0.64: Moderate relationship (64% explained)
R^2 = 0.90: Strong relationship (90% explained)
R^2 = 1: Perfect linear relationship

Key Points:

Always between 0 and 1
Higher = better fit
Context matters for “good” values
Can’t determine causation

Practice Problem: Housing Prices 🏠

🎯 Your Turn!

Problem: A real estate agent wants to predict house prices based on square footage.

Data: 6 recent sales

Sq Ft (x)	Price ($000s) (y)
1200	180
1500	220
1800	260
2000	290
2200	320
1600	235

Tasks: 1. Calculate the regression equation 2. Find R² and interpret it 3. Predict price for 1900 sq ft house 4. What’s the price per square foot?

Step 1: Calculate means

\bar{x} = \frac{1200+1500+1800+2000+2200+1600}{6} = 1716.7 sq ft
\bar{y} = \frac{180+220+260+290+320+235}{6} = 250.8 ($000s)

Step 2: Calculate slope components

x	y	(x-\bar{x})	(y-\bar{y})	(x-\bar{x})(y-\bar{y})	(x-\bar{x})^2
1200	180	-516.7	-70.8	36,583	266,978
1500	220	-216.7	-30.8	6,675	46,978
1800	260	83.3	9.2	766	6,939
2000	290	283.3	39.2	11,105	80,278
2200	320	483.3	69.2	33,453	233,778
1600	235	-116.7	-15.8	1,844	13,619

Sums: \sum(x-\bar{x})(y-\bar{y}) = 90,426, \sum(x-\bar{x})^2 = 648,570

Step 3: Calculate regression b = \frac{90,426}{648,570} = 0.139 a = 250.8 - 0.139(1716.7) = 12.0

Equation: \hat{y} = 12.0 + 0.139x

Step 4: Calculate R² - Calculate correlation r first, then R^2 = r^2 = 0.982

Step 5: Interpretation

R² = 0.982: 98.2% of price variation explained by square footage
Very strong relationship

Step 6: Prediction for 1900 sq ft \hat{y} = 12.0 + 0.139(1900) = 276.1 ($276,100)

Step 7: Price per square foot The slope = 0.139 means $139 per square foot increase

🏠 Real Estate Insights:

Very strong fit: R² = 0.982
Price increases $139 per sq ft
1900 sq ft house ≈ $276,100
Useful for pricing decisions

Assumptions and Limitations ⚠️

🔍 Key Assumptions

1. Linearity - Relationship is actually linear - Check with scatterplot

2. Independence - Observations are independent - No time trends or clustering

3. Normality of Residuals - Residuals follow normal distribution - Check with residual plots

4. Constant Variance - Spread of residuals is consistent - No “fan” patterns

5. No Extreme Outliers - Outliers can heavily influence line - Check for unusual points

⚠️ Common Pitfalls

1. Correlation ≠ Causation - Strong correlation doesn’t prove cause - Could be confounding variables

2. Extrapolation Dangers - Don’t predict far outside data range - Relationships may change

3. Outlier Sensitivity - One extreme point can change everything - Always examine unusual observations

4. Non-linear Relationships - Linear regression assumes straight line - Curved relationships need different methods

5. Ecological Fallacy - Group-level patterns ≠ individual patterns - Be careful with aggregated data

🎯 Best Practices

Always plot your data first
Check residuals
Consider context and causation
Report limitations honestly

Summary: Linear Regression Essentials 🎯

🧠 Key Formulas

Regression Line: \hat{y} = a + bx

Slope: b = r \cdot \frac{s_y}{s_x}

Intercept: a = \bar{y} - b\bar{x}

R-squared: R^2 = r^2

Residual: e_i = y_i - \hat{y}_i

🔍 Interpretation Skills

Slope: Change in y per unit change in x

Intercept: Value of y when x = 0

R²: % of y-variation explained by x

Always include units and context!

🛠️ Problem-Solving Steps

Plot the data (always start here!)
Calculate means \bar{x}, \bar{y}
Find correlation r
Calculate slope b = r \cdot s_y/s_x
Calculate intercept a = \bar{y} - b\bar{x}
Write equation \hat{y} = a + bx
Find R² = r²
Make predictions (within reason)
Interpret in context
Check assumptions

🚨 Red Flags to Watch

Perfect correlation (r = ±1.0)
Extrapolating too far
Ignoring outliers
Assuming causation
Non-linear patterns

Resources

🏠 Back to Main Page