From Questions to Statistical Evidence
2025-07-30
Making Decisions with Data: From Questions to Statistical Evidence
“The goal is not to eliminate uncertainty, but to make informed decisions despite it”
When:
- 📅 Date: Friday, July 25
- ⏰ Window: 7 AM – 12 AM
- ⏳ Duration: 1 hour once started
Where: 💻 Online via Canvas
What is Hypothesis Testing?
Hypothesis testing lets us use sample data to weigh competing claims about a population.
Workflow
The goal is not to prove anything with certainty, but to judge whether the evidence tips the scale away from H₀.
Hypothesis testing helps us answer: “Is what we observed in our sample strong enough evidence to conclude something about the population?”
Criminal Justice System
| Aspect | Criminal Court |
|---|---|
| Starting Position | Defendant is innocent |
| Burden of Proof | Prosecution must prove guilt |
| Evidence Standard | Beyond reasonable doubt |
| Decision Options | Guilty or Not Guilty |
| Type I Error | Convict an innocent person |
| Type II Error | Acquit a guilty person |
| Consequences | Balance justice vs. protecting innocent |
Statistical Hypothesis Testing
| Aspect | Hypothesis Testing |
|---|---|
| Starting Position | Null hypothesis (H₀) is true |
| Burden of Proof | Data must support alternative (H₁) |
| Evidence Standard | p ≤ α (usually 0.05) |
| Decision Options | Reject H₀ or Fail to reject H₀ |
| Type I Error | Reject a true null (false positive) |
| Type II Error | Fail to reject a false null (false negative) |
| Consequences | Balance discovery vs. false claims |
Key Insight 💡 Just like in court, we never “prove” innocence or accept the null hypothesis—we only decide whether the evidence is strong enough to reject it.
What is a p-value?
The p-value answers the question
“If the null hypothesis were true, how likely is a result at least this extreme?”
Formally, for an observed test statistic T_{\text{obs}},
p = P\bigl(|T| \ge |T_{\text{obs}}| \;\big|\; H_0\bigr).
Smaller p-values → data less compatible with H_0 → stronger evidence against H_0.
Interpretation Cheat-Sheet
| p-value | Evidence vs H_0 |
|---|---|
| p > 0.10 | Little / none |
| 0.05 < p \le 0.10 | Weak |
| 0.01 < p \le 0.05 | Moderate |
| p \le 0.01 | Strong |
(Guidelines, not iron-clad laws.)
Common Pitfalls
Example: In our treatment test, t = 2.5 gave p = 0.006.
If H_0 were true, such an extreme outcome would occur only 0.6 % of the time,compelling evidence favoring the new treatment.
Error Types Matrix
| Decision ↓ / Reality → | H₀ True | H₀ False |
|---|---|---|
| Reject H₀ | Type I Error (α) | ✔ Correct (Power) |
| Fail to Reject H₀ | ✔ Correct | Type II Error (β) |
Real‑World Consequences
| Context | Type I Error (False Positive) | Type II Error (False Negative) |
|---|---|---|
| Medical Test | Treat healthy patient | Miss actual disease |
| Drug Approval | Approve ineffective drug | Reject effective drug |
| Fire Alarm | False alarm evacuation | Fail to detect real fire |
Note
Balancing the risks
Lowering the significance level \alpha reduces the chance of Type I mistakes but increases the risk of Type II errors unless you gather more data or target a larger effect.
Choose $$ based on which error would be more costly in your scenario.
Key Insight: Higher power means you’re more likely to detect a true effect when it exists. Aim for power ≥ 0.80!
Research Question: A new study technique claims to improve test scores. The current average is 75. We test 25 students using the new method.
📊 Sample Data Summary:
========================================
Sample size (n): 25
Sample mean (x̄): 77.19
Sample std (s): 7.65
Current average (μ₀): 75
🎯 DETAILED RESULTS:
==================================================
Test Statistic: t = 1.432
P-value: 0.0825
Critical Value: 1.711
Effect Size (Cohen's d): 0.286
❌ DECISION: Fail to reject H₀
📊 CONCLUSION: There is insufficient evidence (p = 0.0825) that the new study method improves test scores.
💡 PRACTICAL IMPACT: The observed difference could reasonably be due to chance.
🐍 PYTHON IMPLEMENTATION:
========================================
Method 1: scipy.stats.ttest_1samp
t-statistic: 1.432
p-value (two-tailed): 0.1650
p-value (one-tailed): 0.0825
Method 2: Manual with 95% Confidence Interval
95% CI: (74.03, 80.35)
Interpretation: We're 95% confident the true mean is between 74.0 and 80.4
Effect Size (Cohen's d): 0.286
Effect size interpretation: small effect
Research Question: Compare effectiveness of two teaching methods
📊 TWO-GROUP COMPARISON:
========================================
Method A (Traditional):
n = 30, mean = 75.45, std = 11.87
Method B (New):
n = 28, mean = 82.72, std = 14.82
Difference in means: 7.27 points
🔑 KEY LESSON: Statistical Significance ≠ Practical Importance
============================================================
Left: Tiny effect (0.02) but significant due to large sample
Right: Large effect (8.7) but significant with small sample
💡 Always consider BOTH statistical significance AND effect size!
# =============================================================================
# COMPLETE HYPOTHESIS TESTING TOOLKIT
# =============================================================================
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# -----------------------------------------------------------------------------
# Template 1: One-Sample t-test
# -----------------------------------------------------------------------------
def one_sample_ttest(data, null_value, alpha=0.05, alternative='two-sided'):
"""
Perform one-sample t-test with complete analysis
Parameters:
-----------
data : array-like
Sample data
null_value : float
Hypothesized population mean
alpha : float
Significance level (default 0.05)
alternative : str
'two-sided', 'greater', or 'less'
"""
# Calculate statistics
n = len(data)
x_bar = np.mean(data)
s = np.std(data, ddof=1)
se = s / np.sqrt(n)
# Test statistic
t_stat = (x_bar - null_value) / se
df = n - 1
# P-value calculation
if alternative == 'two-sided':
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
elif alternative == 'greater':
p_value = 1 - stats.t.cdf(t_stat, df)
elif alternative == 'less':
p_value = stats.t.cdf(t_stat, df)
# Effect size (Cohen's d)
cohens_d = (x_bar - null_value) / s
# Confidence interval
t_crit = stats.t.ppf(1 - alpha/2, df)
ci_lower = x_bar - t_crit * se
ci_upper = x_bar + t_crit * se
# Results
results = {
'test_statistic': t_stat,
'p_value': p_value,
'degrees_of_freedom': df,
'effect_size': cohens_d,
'confidence_interval': (ci_lower, ci_upper),
'reject_null': p_value <= alpha,
'sample_mean': x_bar,
'sample_std': s,
'sample_size': n
}
return results
# -----------------------------------------------------------------------------
# Template 2: Two-Sample t-test
# -----------------------------------------------------------------------------
def two_sample_ttest(group1, group2, alpha=0.05, equal_var=True):
"""
Perform two-sample t-test with complete analysis
"""
# Calculate statistics
n1, n2 = len(group1), len(group2)
mean1, mean2 = np.mean(group1), np.mean(group2)
s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
if equal_var:
# Pooled variance
pooled_var = ((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2)
se = np.sqrt(pooled_var * (1/n1 + 1/n2))
df = n1 + n2 - 2
else:
# Welch's t-test
se = np.sqrt(s1**2/n1 + s2**2/n2)
df = (s1**2/n1 + s2**2/n2)**2 / ((s1**2/n1)**2/(n1-1) + (s2**2/n2)**2/(n2-1))
# Test statistic
t_stat = (mean1 - mean2) / se
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
# Effect size (Cohen's d)
if equal_var:
pooled_std = np.sqrt(pooled_var)
else:
pooled_std = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
cohens_d = (mean1 - mean2) / pooled_std
results = {
'test_statistic': t_stat,
'p_value': p_value,
'degrees_of_freedom': df,
'effect_size': cohens_d,
'reject_null': p_value <= alpha,
'group1_stats': {'mean': mean1, 'std': s1, 'n': n1},
'group2_stats': {'mean': mean2, 'std': s2, 'n': n2}
}
return results
# -----------------------------------------------------------------------------
# Template 3: Power Analysis
# -----------------------------------------------------------------------------
def power_analysis(effect_size, alpha=0.05, power=0.8):
"""
Calculate required sample size for desired power
"""
from scipy.stats import norm
z_alpha = norm.ppf(1 - alpha/2)
z_beta = norm.ppf(power)
n = ((z_alpha + z_beta) / effect_size) ** 2
return int(np.ceil(n))
# -----------------------------------------------------------------------------
# Example Usage
# -----------------------------------------------------------------------------
# Generate sample data
np.random.seed(42)
sample_data = np.random.normal(105, 15, 25)
# Perform one-sample t-test
results = one_sample_ttest(sample_data, null_value=100, alternative='greater')
print("One-Sample t-test Results:")
print(f"Test statistic: {results['test_statistic']:.3f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Effect size (d): {results['effect_size']:.3f}")
print(f"95% CI: ({results['confidence_interval'][0]:.2f}, {results['confidence_interval'][1]:.2f})")
print(f"Reject null: {results['reject_null']}")
# Power analysis
required_n = power_analysis(effect_size=0.5, power=0.8)
print(f"\nRequired sample size for d=0.5, power=0.8: {required_n}")Remember: Hypothesis testing is a tool for making informed decisions under uncertainty. Use it wisely, report honestly, and always consider the practical implications of your statistical conclusions.
“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” - John Tukey
Next class: Regression Analysis! 📊
🏠 Back to Main PageUnderstanding Data – Hypothesis Testing Lecture © 2025