Welcome to Lab 6 Solutions! This lab focuses on two fundamental areas of statistical analysis that youโll use throughout your data science journey: hypothesis testing and simple linear regression. These tools allow us to make data-driven decisions and understand relationships between variables.
What Youโll Learn Today
By the end of this lab, youโll be able to:
Conduct hypothesis tests to determine if sample data provides evidence against a claim
Model relationships between variables using simple linear regression
Make predictions based on data patterns
Interpret statistical results in plain English for real-world applications
Getting Started
โฑ๏ธ Estimated time: 5 minutes
Setup
Navigate to our class Jupyterhub Instance. Create a new notebook and rename it โlab6โ (for detailed instructions view lab1).
First, letโs load our tools! Copy the below code to get started! Weโll be using the following core libraries:
NumPy: Fundamental package for fast array-based numerical computing.
Matplotlib (pyplot): Primary library for creating static 2D plots and figures.
SciPy (stats): Collection of scientific algorithms, including probability distributions and statistical tests.
Pandas: High-performance data structures (DataFrame) and tools for data wrangling and analysis.
Statsmodels: Econometric and statistical modeling for regression analysis, time series, and more.
Seaborn:Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
# Install any missing packages (will skip those already installed)#!%pip install --quiet numpy matplotlib scipy pandas statsmodels seaborn# Load our tools (libraries)import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsimport pandas as pdimport statsmodels.api as smimport seaborn as sns# Make our graphs look niceplt.style.use('seaborn-v0_8-whitegrid')sns.set_palette("husl")# Set random seed for reproducible resultsnp.random.seed(42)print("โ All tools loaded successfully!")
โ All tools loaded successfully!
Task 1: One-Sample T-Test
โฑ๏ธ Estimated time: 20 minutes
What is a One-Sample T-Test?
A one-sample t-test helps us determine whether a sample mean is significantly different from a claimed or hypothesized population mean. Itโs one of the most common statistical tests youโll encounter.
Real-world example: A coffee shop advertises that their espresso shots contain an average of 75mg of caffeine. As a health-conscious consumer (or maybe a caffeine researcher!), you want to test this claim. You collect a sample of espresso shots and measure their caffeine content.
The Question: Is the actual average caffeine content different from what the coffee shop claims?
Scenario
A coffee shop claims their average espresso shot contains 75 mg of caffeine. You suspect itโs actually higher. You test 20 shots and want to test at \(\alpha = 0.05\) significance level.
Your Goal: Determine if thereโs sufficient evidence that the actual caffeine content exceeds the coffee shopโs claim.
Step 1: Explore the Data
# Generate caffeine data for our analysisnp.random.seed(123)caffeine_data = np.random.normal(78, 8, 20) # Sample data: n=20 espresso shotsprint("โ Coffee Shop Caffeine Analysis")print("="*40)print(f"๐ Sample size: {len(caffeine_data)}")print(f"๐ Sample mean: {np.mean(caffeine_data):.2f} mg")print(f"๐ Sample std dev: {np.std(caffeine_data, ddof=1):.2f} mg")print(f"๐ช Coffee shop's claim: 75 mg")# Let's look at our raw dataprint(f"\n๐ First 10 caffeine measurements:")print([f"{x:.1f}"for x in caffeine_data[:10]])
Think about this carefully: - What does the coffee shop claim? (This becomes your null hypothesis) - What do you suspect? (This becomes your alternative hypothesis) - Are you testing if the caffeine content is different, higher, or lower?
print("๐ STEP 1: Setting Up Hypotheses")print("="*35)# SOLUTION: Complete these hypothesesprint("$H_0$ (Null Hypothesis): $\\mu$ = 75 mg") # Coffee shop's claimprint("$H_1$ (Alternative Hypothesis): $\\mu$ > 75 mg") # We suspect it's higher# SOLUTION: What type of test is this?print("Test type: RIGHT-tailed test") # Testing if mean is greater than 75print("\n๐ก Explanation:")print("โข $H_0$ represents the coffee shop's claim (status quo)")print("โข $H_1$ represents what we suspect is actually true")print("โข We use $\\alpha$ = 0.05 as our significance level")
๐ STEP 1: Setting Up Hypotheses
===================================
$H_0$ (Null Hypothesis): $\mu$ = 75 mg
$H_1$ (Alternative Hypothesis): $\mu$ > 75 mg
Test type: RIGHT-tailed test
๐ก Explanation:
โข $H_0$ represents the coffee shop's claim (status quo)
โข $H_1$ represents what we suspect is actually true
โข We use $\alpha$ = 0.05 as our significance level
โ Answer Key: - \(H_0\): \(\mu\) = 75 mg (coffee shopโs claim) - \(H_1\): \(\mu\) > 75 mg (we suspect itโs higher) - Right-tailed test (testing if mean is greater than 75)
Step 3: Calculate the Test Statistic
The t-statistic formula is: \(t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\)
print("๐ข STEP 2: Calculating Test Statistic")print("="*38)# Calculate the componentssample_mean = np.mean(caffeine_data)sample_std = np.std(caffeine_data, ddof=1) # ddof=1 for sample std devn =len(caffeine_data)claimed_mean =75print(f"Sample mean ($\\bar{{x}}$): {sample_mean:.3f} mg")print(f"Sample std dev (s): {sample_std:.3f} mg")print(f"Sample size (n): {n}")print(f"Claimed mean ($\\mu_0$): {claimed_mean} mg")# SOLUTION: Calculate the t-statistic using the formula abovet_statistic = (sample_mean - claimed_mean) / (sample_std / np.sqrt(n))degrees_freedom = n -1print(f"\n๐ Formula: $t = \\frac{{\\bar{{x}} - \\mu_0}}{{s / \\sqrt{{n}}}}$")print(f"๐ Calculation: t = ({sample_mean:.3f} - {claimed_mean}) / ({sample_std:.3f} / โ{n})")print(f"๐ t-statistic: {t_statistic:.3f}")print(f"๐ Degrees of freedom: {degrees_freedom}")
๐ข STEP 2: Calculating Test Statistic
======================================
Sample mean ($\bar{x}$): 78.915 mg
Sample std dev (s): 10.060 mg
Sample size (n): 20
Claimed mean ($\mu_0$): 75 mg
๐ Formula: $t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$
๐ Calculation: t = (78.915 - 75) / (10.060 / โ20)
๐ t-statistic: 1.741
๐ Degrees of freedom: 19
Step 4: Find the P-Value
For a right-tailed test, the p-value is the probability of getting a t-statistic as extreme or more extreme than what we observed.
What exactly is a pโvalue?
Loosely speaking, the pโvalue answers the question:
โIf the null hypothesis were true, how surprising would my sample be?โ
Formally, it is the probability, calculated under the assumption that the null hypothesis is correct; of obtaining a test statistic as extreme or more extreme than the one observed.
Small pโvalue (e.g., < 0.05) โ data are rare under \(H_0\) โ strong evidence against\(H_0\).
Large pโvalue โ data are plausible under \(H_0\) โ little or no evidence against \(H_0\).
Important: A pโvalue does not give the probability that the null hypothesis is true; it quantifies how incompatible your data are with \(H_0\).
print("๐ STEP 3: Finding the P-Value")print("="*32)# SOLUTION: Calculate p-value for right-tailed test# For right-tailed test, p-value = 1 - stats.t.cdf(t_statistic, df)p_value =1- stats.t.cdf(t_statistic, degrees_freedom)print(f"๐ฏ P-value calculation:")print(f" P(t > {t_statistic:.3f}) = {p_value:.4f}")print(f"\n๐ญ Interpretation:")print(f" If the coffee shop's claim is true ($\\mu$ = 75),")print(f" there's a {p_value:.1%} chance of getting a sample")print(f" mean as high or higher than {sample_mean:.2f} mg")
๐ STEP 3: Finding the P-Value
================================
๐ฏ P-value calculation:
P(t > 1.741) = 0.0490
๐ญ Interpretation:
If the coffee shop's claim is true ($\mu$ = 75),
there's a 4.9% chance of getting a sample
mean as high or higher than 78.92 mg
Step 5: Make Your Decision
Compare your p-value to \(\alpha = 0.05\) and make a statistical decision.
print("โ๏ธ STEP 4: Making the Decision")print("="*31)alpha =0.05print(f"๐ฏ Significance level ($\\alpha$): {alpha}")print(f"๐ P-value: {p_value:.4f}")print(f"๐ Decision rule: Reject $H_0$ if p-value < $\\alpha$")print(f"\n๐ Comparison:")if p_value < alpha:print(f" {p_value:.4f} < {alpha} โ ")print(f" Decision: **REJECT $H_0$**")print(f" Conclusion: There IS sufficient evidence that")print(f" the average caffeine content > 75 mg")print(f" ๐ช The coffee shop's claim appears to be FALSE")else:print(f" {p_value:.4f} โฅ {alpha} โ")print(f" Decision: **FAIL TO REJECT $H_0$**")print(f" Conclusion: There is NOT sufficient evidence that")print(f" the average caffeine content > 75 mg")print(f" ๐ช We cannot conclude the coffee shop's claim is false")# SOLUTION: Write conclusion in plain Englishprint(f"\n๐ Conclusion in plain English:")print(f" Based on our sample of 20 espresso shots, we found")print(f" strong statistical evidence that the coffee shop's")print(f" claim of 75mg caffeine is too low. The actual average")print(f" appears to be significantly higher than advertised.")
โ๏ธ STEP 4: Making the Decision
===============================
๐ฏ Significance level ($\alpha$): 0.05
๐ P-value: 0.0490
๐ Decision rule: Reject $H_0$ if p-value < $\alpha$
๐ Comparison:
0.0490 < 0.05 โ
Decision: **REJECT $H_0$**
Conclusion: There IS sufficient evidence that
the average caffeine content > 75 mg
๐ช The coffee shop's claim appears to be FALSE
๐ Conclusion in plain English:
Based on our sample of 20 espresso shots, we found
strong statistical evidence that the coffee shop's
claim of 75mg caffeine is too low. The actual average
appears to be significantly higher than advertised.
Step 6: Verify with Python
Letโs double-check our work using Pythonโs built-in statistical functions.
# Create visualizations to understand our testfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))# Plot 1: Sample data histogram with meansax1.hist(caffeine_data, bins=8, density=True, alpha=0.7, color='lightblue', edgecolor='black', label='Sample Data')ax1.axvline(sample_mean, color='red', linestyle='-', linewidth=3, label=f'Sample Mean = {sample_mean:.1f}mg')ax1.axvline(claimed_mean, color='orange', linestyle='--', linewidth=3, label=f'Claimed Mean = {claimed_mean}mg')ax1.set_xlabel('Caffeine Content (mg)', fontsize=12)ax1.set_ylabel('Density', fontsize=12)ax1.set_title('โ Sample vs Claimed Caffeine Content', fontsize=14, fontweight='bold')ax1.legend(fontsize=11)ax1.grid(True, alpha=0.3)# Plot 2: t-distribution with test statistic and p-valuex = np.linspace(-4, 4, 1000)y = stats.t.pdf(x, degrees_freedom)ax2.plot(x, y, 'b-', linewidth=2, label=f't-distribution (df={degrees_freedom})')ax2.fill_between(x, y, alpha=0.3, color='lightblue')# Shade the rejection region (right tail)x_reject = x[x >= t_statistic]y_reject = stats.t.pdf(x_reject, degrees_freedom)ax2.fill_between(x_reject, y_reject, alpha=0.7, color='red', label=f'p-value = {p_value:.4f}')ax2.axvline(t_statistic, color='red', linestyle='-', linewidth=3, label=f'Our t-statistic = {t_statistic:.3f}')ax2.set_xlabel('t-value', fontsize=12)ax2.set_ylabel('Density', fontsize=12)ax2.set_title('๐ T-Distribution with Test Statistic', fontsize=14, fontweight='bold')ax2.legend(fontsize=11)ax2.grid(True, alpha=0.3)plt.tight_layout()plt.show()
๐ค Reflection Questions - SOLUTIONS
Answer these questions to check your understanding:
Hypotheses: What were your null and alternative hypotheses? Why did you choose a right-tailed test?
Answer: \(H_0\): \(\mu\) = 75 mg, \(H_1\): \(\mu\) > 75 mg. We chose a right-tailed test because we specifically suspected the caffeine content was higher than claimed, not just different.
Test Choice: Why did you use a t-test instead of a z-test for this problem?
Answer: We used a t-test because: (1) small sample size (n=20 < 30), (2) population standard deviation unknown, (3) assuming approximately normal distribution.
Results: What was your t-statistic and p-value? What do these numbers mean?
Answer: t โ 1.84, p โ 0.041. The t-statistic tells us how many standard errors our sample mean is above the claimed mean. The p-value tells us thereโs only a 4.1% chance of seeing this result if the true mean were 75mg.
Decision: What was your final conclusion at \(\alpha = 0.05\)? Do you reject or fail to reject the null hypothesis?
Answer: We REJECT \(H_0\) because p-value (0.041) < ฮฑ (0.05). Thereโs sufficient evidence that the actual caffeine content exceeds 75mg.
Real-World Impact: If you were advising the coffee shop, what would you tell them based on your analysis?
Answer: โYour espresso shots appear to contain significantly more caffeine than advertised. You should either update your labeling to reflect the actual content or adjust your brewing process to match your claim.โ
Task 2: Simple Linear Regression
โฑ๏ธ Estimated time: 25 minutes
What is Simple Linear Regression?
Simple linear regression helps us understand and model the relationship between two continuous variables. Unlike hypothesis testing (which answers yes/no questions), regression helps us predict outcomes and quantify relationships.
Real-world example: As a student, youโve probably wondered: โIf I study more hours, how much will my exam score improve?โ Linear regression can help answer this question by finding the relationship between study time and exam performance.
The Question: Can we predict exam scores based on hours studied? And if so, how much does each additional hour of studying improve your expected score?
At a glance โ what youโll do
Explore & visualize the data
Measure correlation (r) and \(R^2\)
Fit the regression line \(\hat{y} = \beta_0 + \beta_1 x\)
Test if the slope is significant
Predict new values & quantify error
Check model assumptions
Visualize diagnostics
Write a plainโEnglish conclusion
Key Concepts:
Correlation: How strongly two variables move together (-1 to +1)
Slope: How much \(Y\) changes when X increases by \(1\) unit
Intercept: The predicted value of \(Y\) when \(X = 0\)
\(R^2\): What percentage of the variation in \(Y\) is explained by \(X\)
Important
Remember: Correlation does not imply causation! Just because two variables are related doesnโt mean one causes the other.
Scenario
You want to investigate the relationship between study hours and exam performance. You collect data from \(50\) students about their weekly study hours and corresponding exam scores.
Your Goal: Create a statistical model to predict exam scores based on study hours and determine how much each additional hour of studying helps.
Step 1: Explore the Data
# Generate realistic study datanp.random.seed(101)n_students =50# Study hours (predictor variable X)study_hours = np.random.uniform(1, 20, n_students)# Exam scores with linear relationship plus noise# True relationship: score = 65 + 2*hours + noisetrue_intercept =65true_slope =2noise = np.random.normal(0, 8, n_students)exam_scores = true_intercept + true_slope * study_hours + noise# Create DataFrame for easier handlingstudy_data = pd.DataFrame({'hours_studied': study_hours,'exam_score': exam_scores})print("๐ Study Hours vs Exam Scores Analysis")print("="*45)print(f"๐ฅ Sample size: {len(study_data)} students")print(f"โฐ Study hours range: {study_hours.min():.1f} to {study_hours.max():.1f} hours")print(f"๐ Exam scores range: {exam_scores.min():.1f} to {exam_scores.max():.1f} points")print(f"\n๐ First 10 students:")print(study_data.head(10).round(2))
๐ Study Hours vs Exam Scores Analysis
=============================================
๐ฅ Sample size: 50 students
โฐ Study hours range: 1.5 to 19.9 hours
๐ Exam scores range: 60.1 to 111.6 points
๐ First 10 students:
hours_studied exam_score
0 10.81 88.53
1 11.84 104.66
2 1.54 60.14
3 4.26 75.09
4 14.02 83.95
5 16.84 98.69
6 6.83 86.87
7 17.98 99.70
8 14.71 94.17
9 4.61 79.42
๐ค Quick Questions:
Do you see any obvious pattern in the data?
Answer: Yes! As study hours increase, exam scores tend to increase too.
Which variable is the predictor (X) and which is the response (Y)?
Answer: Study hours is the predictor (X), exam scores is the response (Y).
Step 2: Calculate and Interpret Correlation
Correlation measures how strongly two variables move together.
print("๐ STEP 1: Measuring the Relationship")print("="*40)# SOLUTION: Calculate the correlation coefficientcorrelation = np.corrcoef(study_hours, exam_scores)[0, 1]print(f"๐ Correlation coefficient: r = {correlation:.3f}")# SOLUTION: Interpret the correlation strengthprint(f"\n๐ญ Interpretation:")ifabs(correlation) <0.3: strength ="weak"elifabs(correlation) <0.7: strength ="moderate"else: strength ="strong"direction ="positive"if correlation >0else"negative"print(f" This indicates a {strength}{direction} relationship")print(f" between study hours and exam scores.")print(f"\n๐ What this means:")print(f" โข r = {correlation:.3f} means the variables are strongly related")print(f" โข As study hours increase, exam scores tend to increase")print(f" โข About {correlation**2:.1%} of the variation in scores")print(f" can be explained by study hours alone")
๐ STEP 1: Measuring the Relationship
========================================
๐ Correlation coefficient: r = 0.753
๐ญ Interpretation:
This indicates a strong positive relationship
between study hours and exam scores.
๐ What this means:
โข r = 0.753 means the variables are strongly related
โข As study hours increase, exam scores tend to increase
โข About 56.8% of the variation in scores
can be explained by study hours alone
โ Check Your Understanding:
What does r = 0.8 vs r = 0.3 tell you?
Answer: r = 0.8 indicates a strong relationship (variables move together closely), while r = 0.3 indicates a weak relationship (more scattered, less predictable).
If r = -0.9, what would that mean?
Answer: Very strong negative relationship - as one variable increases, the other decreases in a highly predictable way.
Step 3: Fit the Linear Regression Model
Now weโll find the โline of best fitโ through our data points.
print("๐ STEP 2: Fitting the Regression Line")print("="*42)# Set up the regression (add constant for intercept)X = sm.add_constant(study_hours) # Add intercept term# SOLUTION: Fit the OLS (Ordinary Least Squares) modelmodel = sm.OLS(exam_scores, X).fit()print(f"๐ฏ Regression Equation:")print(f" Exam Score = $\\beta_0$ + $\\beta_1$ ร Hours Studied")print(f" Exam Score = {model.params[0]:.2f} + {model.params[1]:.2f} ร Hours")print(f"\n๐ Model Coefficients:")print(f" Intercept ($\\beta_0$): {model.params[0]:.3f}")print(f" Slope ($\\beta_1$): {model.params[1]:.3f}")print(f" R-squared ($R^2$): {model.rsquared:.3f}")# SOLUTION: Complete these interpretationsprint(f"\n๐ก What These Numbers Mean:")print(f" ๐ Intercept ({model.params[0]:.1f}): Expected score with 0 hours of study")print(f" ๐ Slope ({model.params[1]:.2f}): Each additional hour increases score by {model.params[1]:.2f} points")print(f" ๐ $R^2$ ({model.rsquared:.3f}): Study hours explain {model.rsquared:.1%} of score variation")
๐ STEP 2: Fitting the Regression Line
==========================================
๐ฏ Regression Equation:
Exam Score = $\beta_0$ + $\beta_1$ ร Hours Studied
Exam Score = 69.94 + 1.67 ร Hours
๐ Model Coefficients:
Intercept ($\beta_0$): 69.936
Slope ($\beta_1$): 1.670
R-squared ($R^2$): 0.568
๐ก What These Numbers Mean:
๐ Intercept (69.9): Expected score with 0 hours of study
๐ Slope (1.67): Each additional hour increases score by 1.67 points
๐ $R^2$ (0.568): Study hours explain 56.8% of score variation
Step 4: Test Statistical Significance
Is the relationship we found statistically significant, or could it be due to chance?
print("๐ฌ STEP 3: Testing Statistical Significance")print("="*46)# Check if the slope is significantly different from zeroslope_pvalue = model.pvalues[1] # p-value for the slopealpha =0.05print(f"๐งช Hypothesis Test for Slope:")print(f" $H_0$: $\\beta_1$ = 0 (no relationship)")print(f" $H_1$: $\\beta_1$ โ 0 (there is a relationship)")print(f" $\\alpha$ = {alpha}")print(f"\n๐ Test Results:")print(f" Slope p-value: {slope_pvalue:.6f}")# SOLUTION: Make the decisionif slope_pvalue < alpha:print(f" Decision: REJECT $H_0$")print(f" Conclusion: The relationship IS statistically significant") significance ="IS"else:print(f" Decision: FAIL TO REJECT $H_0$")print(f" Conclusion: The relationship is NOT statistically significant") significance ="IS NOT"print(f"\nโ Bottom Line:")print(f" Study hours {significance} a significant predictor of exam scores")# Show confidence intervalsconf_int = model.conf_int(alpha=0.05)print(f"\n๐ 95% Confidence Intervals:")print(f" Intercept: [{conf_int[0,0]:.2f}, {conf_int[0,1]:.2f}]")print(f" Slope: [{conf_int[1,0]:.2f}, {conf_int[1,1]:.2f}]")
๐ฌ STEP 3: Testing Statistical Significance
==============================================
๐งช Hypothesis Test for Slope:
$H_0$: $\beta_1$ = 0 (no relationship)
$H_1$: $\beta_1$ โ 0 (there is a relationship)
$\alpha$ = 0.05
๐ Test Results:
Slope p-value: 0.000000
Decision: REJECT $H_0$
Conclusion: The relationship IS statistically significant
โ Bottom Line:
Study hours IS a significant predictor of exam scores
๐ 95% Confidence Intervals:
Intercept: [64.81, 75.06]
Slope: [1.25, 2.09]
Step 5: Make Predictions
Now letโs use our model to predict exam scores for different study scenarios.
print("๐ฎ STEP 4: Making Predictions")print("="*32)# SOLUTION: Calculate predictions for different study hoursexample_hours = [5, 10, 15, 20]print(f"๐ฏ Prediction Examples:")for hours in example_hours:# SOLUTION: Calculate predicted score pred_score = model.params[0] + model.params[1] * hoursprint(f" ๐ {hours:2d} hours โ Predicted score: {pred_score:.1f} points")print(f"\n๐ค Your Turn:")# SOLUTION: Pick your own study hours and make a predictionyour_hours =12# Enter a number between 1-20your_prediction = model.params[0] + model.params[1] * your_hoursprint(f" ๐ {your_hours} hours โ Predicted score: {your_prediction:.1f} points")# Calculate residuals for analysisy_predicted = model.predict(X)residuals = exam_scores - y_predictedresidual_std = np.std(residuals, ddof=2)print(f"\n๐ Prediction Accuracy:")print(f" Average prediction error: ยฑ{residual_std:.1f} points")print(f" This means most predictions are within ยฑ{residual_std:.1f} points of actual scores")
๐ฎ STEP 4: Making Predictions
================================
๐ฏ Prediction Examples:
๐ 5 hours โ Predicted score: 78.3 points
๐ 10 hours โ Predicted score: 86.6 points
๐ 15 hours โ Predicted score: 95.0 points
๐ 20 hours โ Predicted score: 103.3 points
๐ค Your Turn:
๐ 12 hours โ Predicted score: 90.0 points
๐ Prediction Accuracy:
Average prediction error: ยฑ8.2 points
This means most predictions are within ยฑ8.2 points of actual scores
Step 6: Check Model Assumptions
Before trusting our model, we need to verify it meets the assumptions of linear regression.
print("โ STEP 5: Checking Model Assumptions")print("="*42)print("๐ Linear Regression Assumptions:")print(" 1๏ธโฃ Linear relationship between X and Y")print(" 2๏ธโฃ Residuals are normally distributed") print(" 3๏ธโฃ Residuals have constant variance (homoscedasticity)")print(" 4๏ธโฃ Residuals are independent")# Calculate residualsy_predicted = model.predict(X)residuals = exam_scores - y_predictedprint(f"\n๐ Residual Analysis:")print(f" Mean residual: {np.mean(residuals):.6f} (should be โ 0)")print(f" Std of residuals: {np.std(residuals, ddof=2):.3f}")# SOLUTION: Check normality of residuals using Shapiro-Wilk testfrom scipy.stats import shapiroshapiro_stat, shapiro_p = shapiro(residuals)print(f"\n๐งช Normality Test (Shapiro-Wilk):")print(f" p-value: {shapiro_p:.4f}")if shapiro_p >0.05:print(" โ Residuals appear normally distributed")else:print(" โ ๏ธ Residuals may not be normally distributed")
โ STEP 5: Checking Model Assumptions
==========================================
๐ Linear Regression Assumptions:
1๏ธโฃ Linear relationship between X and Y
2๏ธโฃ Residuals are normally distributed
3๏ธโฃ Residuals have constant variance (homoscedasticity)
4๏ธโฃ Residuals are independent
๐ Residual Analysis:
Mean residual: -0.000000 (should be โ 0)
Std of residuals: 8.167
๐งช Normality Test (Shapiro-Wilk):
p-value: 0.4928
โ Residuals appear normally distributed
Step 7: Visualize Your Results
# Create comprehensive visualizationfig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))# Plot 1: Scatter plot with regression lineax1.scatter(study_hours, exam_scores, alpha=0.6, color='blue', s=60, label='Student Data')sorted_hours = np.sort(study_hours)sorted_predictions = model.params[0] + model.params[1] * sorted_hoursax1.plot(sorted_hours, sorted_predictions, color='red', linewidth=3, label=f'y = {model.params[0]:.1f} + {model.params[1]:.2f}x')ax1.set_xlabel('Study Hours', fontsize=12)ax1.set_ylabel('Exam Score', fontsize=12)ax1.set_title(f'๐ Study Hours vs Exam Scores\n$R^2$ = {model.rsquared:.3f}', fontsize=14, fontweight='bold')ax1.legend(fontsize=11)ax1.grid(True, alpha=0.3)# Plot 2: Residuals vs Fitted valuesax2.scatter(y_predicted, residuals, alpha=0.6, color='purple', s=50)ax2.axhline(y=0, color='red', linestyle='--', linewidth=2)ax2.set_xlabel('Fitted Values', fontsize=12)ax2.set_ylabel('Residuals', fontsize=12)ax2.set_title('๐ Residuals vs Fitted\n(Should show no pattern)', fontsize=14, fontweight='bold')ax2.grid(True, alpha=0.3)# Plot 3: Q-Q plot for normality of residualsstats.probplot(residuals, dist="norm", plot=ax3)ax3.set_title('๐ Q-Q Plot of Residuals\n(Should be roughly linear)', fontsize=14, fontweight='bold')ax3.grid(True, alpha=0.3)# Plot 4: Histogram of residualsax4.hist(residuals, bins=12, density=True, alpha=0.7, color='lightgreen', edgecolor='black')ax4.set_xlabel('Residuals', fontsize=12)ax4.set_ylabel('Density', fontsize=12)ax4.set_title('๐ Distribution of Residuals\n(Should look normal)', fontsize=14, fontweight='bold')ax4.grid(True, alpha=0.3)# Overlay normal curvex_norm = np.linspace(residuals.min(), residuals.max(), 100)y_norm = stats.norm.pdf(x_norm, np.mean(residuals), np.std(residuals))ax4.plot(x_norm, y_norm, 'r-', linewidth=2, label='Normal curve')ax4.legend()plt.tight_layout()plt.show()
Step 8: Interpret Your Model
print("๐ FINAL INTERPRETATION")print("="*25)print(f"๐ฏ Our Model: Exam Score = {model.params[0]:.1f} + {model.params[1]:.2f} ร Study Hours")print(f"\nโ Key Findings:")print(f" ๐ Strong positive relationship (r = {correlation:.3f})")print(f" ๐ Study hours explain {model.rsquared:.1%} of score variation")print(f" ๐ฏ Each extra hour โ {model.params[1]:.1f} point increase")print(f" ๐ฌ Relationship is statistically significant (p < 0.001)")print(f"\n๐ก Practical Insights:")print(f" ๐โโ๏ธ Going from 5 to 10 hours of study:")pred_5 = model.params[0] + model.params[1] *5pred_10 = model.params[0] + model.params[1] *10improvement = pred_10 - pred_5print(f" Expected score improvement: {improvement:.1f} points")print(f"\nโ ๏ธ Important Limitations:")print(f" โข Correlation โ Causation")print(f" โข Model only explains {model.rsquared:.1%} of variation")print(f" โข Other factors matter too (sleep, prior knowledge, etc.)")print(f" โข Predictions have uncertainty: ยฑ{residual_std:.1f} points")# Show full model summaryprint(f"\n๐ Full Statistical Summary:")print("="*30)print(model.summary())
๐ FINAL INTERPRETATION
=========================
๐ฏ Our Model: Exam Score = 69.9 + 1.67 ร Study Hours
โ Key Findings:
๐ Strong positive relationship (r = 0.753)
๐ Study hours explain 56.8% of score variation
๐ฏ Each extra hour โ 1.7 point increase
๐ฌ Relationship is statistically significant (p < 0.001)
๐ก Practical Insights:
๐โโ๏ธ Going from 5 to 10 hours of study:
Expected score improvement: 8.4 points
โ ๏ธ Important Limitations:
โข Correlation โ Causation
โข Model only explains 56.8% of variation
โข Other factors matter too (sleep, prior knowledge, etc.)
โข Predictions have uncertainty: ยฑ8.2 points
๐ Full Statistical Summary:
==============================
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.568
Model: OLS Adj. R-squared: 0.559
Method: Least Squares F-statistic: 63.01
Date: Tue, 29 Jul 2025 Prob (F-statistic): 2.73e-10
Time: 20:58:50 Log-Likelihood: -174.93
No. Observations: 50 AIC: 353.9
Df Residuals: 48 BIC: 357.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 69.9358 2.551 27.415 0.000 64.807 75.065
x1 1.6702 0.210 7.938 0.000 1.247 2.093
==============================================================================
Omnibus: 2.245 Durbin-Watson: 2.594
Prob(Omnibus): 0.325 Jarque-Bera (JB): 1.440
Skew: 0.139 Prob(JB): 0.487
Kurtosis: 2.217 Cond. No. 26.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
๐ค Reflection Questions - SOLUTIONS
Test your understanding by answering these questions:
Correlation vs Causation:
What was your correlation coefficient?
Answer: r โ 0.89 (strong positive correlation)
Does this prove that studying more causes higher exam scores? Why or why not?
Answer: No! Correlation โ causation. While thereโs a strong relationship, other factors could explain both variables (intelligence, motivation, time management skills) or the relationship could be reverse (students who are doing well might be motivated to study more).
Model Interpretation:
What does the slope coefficient mean in practical terms?
Answer: Each additional hour of study is associated with about 2.1 point increase in exam score on average.
What does the intercept represent, and does it make sense?
Answer: The intercept (~65) represents the predicted exam score for 0 hours of study. This might not be realistic (students likely have some baseline knowledge), but itโs a mathematical extrapolation.
Prediction Quality:
What percentage of exam score variation is explained by study hours?
Answer: About 79% (Rยฒ โ 0.79)
How accurate are your predictions (whatโs the typical error)?
Answer: Typical prediction error is about ยฑ8 points.
Statistical Significance:
Is the relationship statistically significant?
Answer: Yes, the p-value for the slope is much less than 0.05.
What would it mean if the p-value for the slope was 0.20?
Answer: We would fail to reject Hโ and conclude thereโs insufficient evidence of a relationship between study hours and exam scores.
Assumptions:
Based on your diagnostic plots, are the regression assumptions satisfied?
Answer: Generally yes - residuals appear roughly normal and randomly scattered around zero with fairly constant variance.
What would you do if the assumptions were violated?
Answer: Consider data transformations, use different modeling approaches, or collect more data.
Practical Application:
If you were advising a student, what would you tell them based on this analysis?
Answer: โStudy time appears to have a strong positive relationship with exam performance. Each extra hour might improve your score by about 2 points on average. However, remember that other factors also matter, and everyone is different.โ
What other variables might improve your prediction model?
Answer: Sleep quality, prior GPA, attendance, quality of study methods, stress levels, nutrition, etc.
๐ฏ Lab Summary
Congratulations! Youโve successfully completed Lab 6 and learned fundamental statistical analysis techniques:
What You Accomplished
โ One-Sample T-Test: Tested a coffee shopโs caffeine claims using hypothesis testing
โ Simple Linear Regression: Modeled the relationship between study hours and exam performance
โ Statistical Interpretation: Translated statistical results into practical insights
โ Critical Thinking: Distinguished between correlation and causation