Hypothesis Testing: Making Decisions with Data

From Questions to Statistical Evidence

Narjes Mathlouthi

2025-07-30

Welcome to Hypothesis Testing

Making Decisions with Data: From Questions to Statistical Evidence

“The goal is not to eliminate uncertainty, but to make informed decisions despite it”

📢 Quick Announcements

📝 Quiz 2 Reminder

When:
- 📅 Date: Friday, July 25
- ⏰ Window: 7 AM – 12 AM
- ⏳ Duration: 1 hour once started

Where: 💻 Online via Canvas

📚 Today’s Focus

Foundation: Logic of hypothesis testing
Practice: Real examples with Python
Skills: Making statistical decisions
Applications: From medicine to marketing

Learning Journey Today 🎯

🧠 Conceptual Goals

Understand the logic of hypothesis testing
Get familiar with the language of statistical decisions
Recognize different types of errors and their consequences
Connect to confidence intervals from last lecture

🛠️ Practical Skills

Formulate hypotheses from research questions
Calculate and interpret p-values correctly
Perform hypothesis tests in Python
Make informed decisions using statistical evidence
Communicate results effectively

What is Hypothesis Testing?

What is Hypothesis Testing?

Hypothesis testing lets us use sample data to weigh competing claims about a population.

Workflow

State H₀ (null) – the status‑quo or “no‑effect” position
State H₁ (alternative) – the research claim you hope to support
Choose α – the tolerable Type I error rate (e.g., 0.05)
Compute a test statistic – compress the data into one number
Find the p‑value – “How unusual is this statistic if H₀ were true?”
Make a decision – reject H₀ if p ≤ α; otherwise fail to reject

The goal is not to prove anything with certainty, but to judge whether the evidence tips the scale away from H₀.

Hypothesis testing helps us answer: “Is what we observed in our sample strong enough evidence to conclude something about the population?”

The Courtroom Analogy ⚖️

Criminal Justice System

Aspect	Criminal Court
Starting Position	Defendant is innocent
Burden of Proof	Prosecution must prove guilt
Evidence Standard	Beyond reasonable doubt
Decision Options	Guilty or Not Guilty
Type I Error	Convict an innocent person
Type II Error	Acquit a guilty person
Consequences	Balance justice vs. protecting innocent

Statistical Hypothesis Testing

Aspect	Hypothesis Testing
Starting Position	Null hypothesis (H₀) is true
Burden of Proof	Data must support alternative (H₁)
Evidence Standard	p ≤ α (usually 0.05)
Decision Options	Reject H₀ or Fail to reject H₀
Type I Error	Reject a true null (false positive)
Type II Error	Fail to reject a false null (false negative)
Consequences	Balance discovery vs. false claims

Key Insight 💡 Just like in court, we never “prove” innocence or accept the null hypothesis—we only decide whether the evidence is strong enough to reject it.

The Six Steps of Hypothesis Testing 📋

Types of Alternative Hypotheses 🎯

What is a p-value?

What is a p-value?

The p-value answers the question
“If the null hypothesis were true, how likely is a result at least this extreme?”

Formally, for an observed test statistic T_{\text{obs}},

p = P\bigl(|T| \ge |T_{\text{obs}}| \;\big|\; H_0\bigr).

Smaller p-values → data less compatible with H_0 → stronger evidence against H_0.

Interpretation Cheat-Sheet

p-value	Evidence vs H_0
p > 0.10	Little / none
0.05 < p \le 0.10	Weak
0.01 < p \le 0.05	Moderate
p \le 0.01	Strong

(Guidelines, not iron-clad laws.)

Common Pitfalls

p is not the probability that (H_0) is true
p is not the probability the result occurred “by chance”
A non-significant p does not prove H_0
Statistical significance ≠ practical importance

Example: In our treatment test, t = 2.5 gave p = 0.006.
If H_0 were true, such an extreme outcome would occur only 0.6 % of the time,compelling evidence favoring the new treatment.

Types of Errors: The Trade-off 🎲

Error Types Matrix

Decision ↓ / Reality →	H₀ True	H₀ False
Reject H₀	Type I Error (α)	✔ Correct (Power)
Fail to Reject H₀	✔ Correct	Type II Error (β)

Real‑World Consequences

Context	Type I Error (False Positive)	Type II Error (False Negative)
Medical Test	Treat healthy patient	Miss actual disease
Drug Approval	Approve ineffective drug	Reject effective drug
Fire Alarm	False alarm evacuation	Fail to detect real fire

Note

Balancing the risks

Lowering the significance level \alpha reduces the chance of Type I mistakes but increases the risk of Type II errors unless you gather more data or target a larger effect.
Choose $$ based on which error would be more costly in your scenario.

Statistical Power: Detecting True Effects 💪

Key Insight: Higher power means you’re more likely to detect a true effect when it exists. Aim for power ≥ 0.80!

Example 1: One-Sample t-test 📊

Problem Setup

Research Question: A new study technique claims to improve test scores. The current average is 75. We test 25 students using the new method.

📊 Sample Data Summary:
========================================
Sample size (n): 25
Sample mean (x̄): 77.19
Sample std (s): 7.65
Current average (μ₀): 75

Complete Hypothesis Test


🎯 DETAILED RESULTS:
==================================================
Test Statistic: t = 1.432
P-value: 0.0825
Critical Value: 1.711
Effect Size (Cohen's d): 0.286

❌ DECISION: Fail to reject H₀
📊 CONCLUSION: There is insufficient evidence (p = 0.0825) that the new study method improves test scores.
💡 PRACTICAL IMPACT: The observed difference could reasonably be due to chance.

Using Python’s Built-in Functions

🐍 PYTHON IMPLEMENTATION:
========================================
Method 1: scipy.stats.ttest_1samp
t-statistic: 1.432
p-value (two-tailed): 0.1650
p-value (one-tailed): 0.0825

Method 2: Manual with 95% Confidence Interval
95% CI: (74.03, 80.35)
Interpretation: We're 95% confident the true mean is between 74.0 and 80.4

Effect Size (Cohen's d): 0.286
Effect size interpretation: small effect

Example 2: Two-Sample t-test 📊

Problem Setup

Research Question: Compare effectiveness of two teaching methods

📊 TWO-GROUP COMPARISON:
========================================
Method A (Traditional):
  n = 30, mean = 75.45, std = 11.87

Method B (New):
  n = 28, mean = 82.72, std = 14.82

Difference in means: 7.27 points

Statistical vs Practical Significance 🤔

🔑 KEY LESSON: Statistical Significance ≠ Practical Importance
============================================================
Left: Tiny effect (0.02) but significant due to large sample
Right: Large effect (8.7) but significant with small sample

💡 Always consider BOTH statistical significance AND effect size!

Best Practices Summary 📋

Python Code Templates 💻

# =============================================================================
# COMPLETE HYPOTHESIS TESTING TOOLKIT
# =============================================================================

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# -----------------------------------------------------------------------------
# Template 1: One-Sample t-test
# -----------------------------------------------------------------------------
def one_sample_ttest(data, null_value, alpha=0.05, alternative='two-sided'):
    """
    Perform one-sample t-test with complete analysis
    
    Parameters:
    -----------
    data : array-like
        Sample data
    null_value : float
        Hypothesized population mean
    alpha : float
        Significance level (default 0.05)
    alternative : str
        'two-sided', 'greater', or 'less'
    """
    
    # Calculate statistics
    n = len(data)
    x_bar = np.mean(data)
    s = np.std(data, ddof=1)
    se = s / np.sqrt(n)
    
    # Test statistic
    t_stat = (x_bar - null_value) / se
    df = n - 1
    
    # P-value calculation
    if alternative == 'two-sided':
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
    elif alternative == 'greater':
        p_value = 1 - stats.t.cdf(t_stat, df)
    elif alternative == 'less':
        p_value = stats.t.cdf(t_stat, df)
    
    # Effect size (Cohen's d)
    cohens_d = (x_bar - null_value) / s
    
    # Confidence interval
    t_crit = stats.t.ppf(1 - alpha/2, df)
    ci_lower = x_bar - t_crit * se
    ci_upper = x_bar + t_crit * se
    
    # Results
    results = {
        'test_statistic': t_stat,
        'p_value': p_value,
        'degrees_of_freedom': df,
        'effect_size': cohens_d,
        'confidence_interval': (ci_lower, ci_upper),
        'reject_null': p_value <= alpha,
        'sample_mean': x_bar,
        'sample_std': s,
        'sample_size': n
    }
    
    return results

# -----------------------------------------------------------------------------
# Template 2: Two-Sample t-test
# -----------------------------------------------------------------------------
def two_sample_ttest(group1, group2, alpha=0.05, equal_var=True):
    """
    Perform two-sample t-test with complete analysis
    """
    
    # Calculate statistics
    n1, n2 = len(group1), len(group2)
    mean1, mean2 = np.mean(group1), np.mean(group2)
    s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
    
    if equal_var:
        # Pooled variance
        pooled_var = ((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2)
        se = np.sqrt(pooled_var * (1/n1 + 1/n2))
        df = n1 + n2 - 2
    else:
        # Welch's t-test
        se = np.sqrt(s1**2/n1 + s2**2/n2)
        df = (s1**2/n1 + s2**2/n2)**2 / ((s1**2/n1)**2/(n1-1) + (s2**2/n2)**2/(n2-1))
    
    # Test statistic
    t_stat = (mean1 - mean2) / se
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
    
    # Effect size (Cohen's d)
    if equal_var:
        pooled_std = np.sqrt(pooled_var)
    else:
        pooled_std = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
    
    cohens_d = (mean1 - mean2) / pooled_std
    
    results = {
        'test_statistic': t_stat,
        'p_value': p_value,
        'degrees_of_freedom': df,
        'effect_size': cohens_d,
        'reject_null': p_value <= alpha,
        'group1_stats': {'mean': mean1, 'std': s1, 'n': n1},
        'group2_stats': {'mean': mean2, 'std': s2, 'n': n2}
    }
    
    return results

# -----------------------------------------------------------------------------
# Template 3: Power Analysis
# -----------------------------------------------------------------------------
def power_analysis(effect_size, alpha=0.05, power=0.8):
    """
    Calculate required sample size for desired power
    """
    from scipy.stats import norm
    
    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = norm.ppf(power)
    
    n = ((z_alpha + z_beta) / effect_size) ** 2
    
    return int(np.ceil(n))

# -----------------------------------------------------------------------------
# Example Usage
# -----------------------------------------------------------------------------

# Generate sample data
np.random.seed(42)
sample_data = np.random.normal(105, 15, 25)

# Perform one-sample t-test
results = one_sample_ttest(sample_data, null_value=100, alternative='greater')

print("One-Sample t-test Results:")
print(f"Test statistic: {results['test_statistic']:.3f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Effect size (d): {results['effect_size']:.3f}")
print(f"95% CI: ({results['confidence_interval'][0]:.2f}, {results['confidence_interval'][1]:.2f})")
print(f"Reject null: {results['reject_null']}")

# Power analysis
required_n = power_analysis(effect_size=0.5, power=0.8)
print(f"\nRequired sample size for d=0.5, power=0.8: {required_n}")

Summary: Key Takeaways 🎯

🧠 Core Concepts

Hypothesis testing provides a framework for making decisions under uncertainty
P-values quantify how surprising our data would be if H₀ were true
Statistical significance ≠ practical importance - always consider effect size
Type I and II errors represent different kinds of mistakes with different costs
Power is the ability to detect true effects when they exist

🛠️ Practical Skills

Plan before you analyze - specify hypotheses and α level in advance
Check assumptions and use appropriate tests
Report effect sizes and confidence intervals, not just p-values
Consider practical significance alongside statistical significance
Be honest about limitations and acknowledge uncertainty

Resources

Thank You! 🎉

Remember: Hypothesis testing is a tool for making informed decisions under uncertainty. Use it wisely, report honestly, and always consider the practical implications of your statistical conclusions.

“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” - John Tukey

Next class: Regression Analysis! 📊

🏠 Back to Main Page