Descriptive Statistics Part II

Variability, Position, Shape & Visualization Lecture

Narjes Mathlouthi

2025-06-26

Learning Objectives

By the end of this lecture, you will be able to:

  • Calculate and interpret measures of variability (range, variance, standard deviation)
  • Understand and compute measures of position (percentiles, quartiles, z-scores)
  • Assess distribution shape using skewness and kurtosis
  • Create and interpret histograms with appropriate bin widths
  • Construct and analyze boxplots for data exploration
  • Identify trends, patterns, and outliers in data visualizations
  • Apply Python for comprehensive descriptive analysis and visualization

Lecture Outline

Part I: Measures of Variability (25 min)

  • Range, Variance, Standard Deviation

  • Coefficient of Variation

  • Python Implementation

Part II: Measures of Position (20 min)

  • Percentiles and Quartiles

  • Z-scores and Standardization

Part III: Distribution Shape (10 min)

  • Skewness and Kurtosis

Part IV: Data Visualization (20 min)

  • Histograms and Bin Width Selection

  • Boxplots and Interpretation

Part V: Identifying Patterns (5 min)

Measures of Variability

25 minutes

What is Variability?

🎯 Definition: Variability (or dispersion) measures how spread out or scattered the data points are around the center.

Why Variability Matters

  • Two datasets can have the same mean but very different spreads
  • Variability indicates consistency and predictability
  • Essential for risk assessment and quality control
  • Helps determine confidence in our central tendency measures

Comparing Datasets with Same Mean

Dataset A: 98, 99, 100, 101, 102 (Mean = 100)

Dataset B: 80, 90, 100, 110, 120 (Mean = 100)

Both have the same mean (100), but Dataset B is much more variable!

This is why we need measures of variability.

Range

Range = Maximum value - Minimum value

Example

Data: 12, 15, 18, 22, 25, 30, 35

Range = 35 - 12 = 23

When to Use Range

Use range when:

  • Need a quick, simple measure of spread

  • Working with small datasets

  • Communicating to non-technical audiences

Avoid range when:

  • Outliers are present

  • Need detailed information about variability

  • Working with large datasets

Variance and Standard Deviation

15 minutes

Variance Definition

🎯 Definition: Variance measures the average squared deviation from the mean.

Population vs Sample Variance

Understanding the mathematical foundations

Recall

Side-by-Side Comparison

Population Variance

\[\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}\]

\(\sigma^2\) = population variance
\(x_i\) = value of \(i^{th}\) element
\(\mu\) = population mean
\(N\) = population size

Sample Variance

\[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]

\(s^2\) = sample variance
\(x_i\) = value of \(i^{th}\) element
\(\bar{x}\) = sample mean
\(n\) = sample size

Key Differences

Key Difference: Sample variance uses \((n-1)\) instead of \(N\) in the denominator

Why \((n-1)\)?

  • When we use sample mean \(\bar{x}\) to estimate population mean \(\mu\)

  • We lose one degree of freedom

  • Called Bessel’s correction

  • Makes sample variance an unbiased estimator

When to Use Each Formula

Population Variance (\(\sigma^2\))

  • You have data for the entire population

  • You know the true population mean \(\mu\)

  • Example: Test scores for all students in a small class

Sample Variance (\(s^2\))

  • You have data from a sample only
  • Want to estimate population variance
  • Example: Survey responses from 100 people out of 10,000

Degrees of Freedom Concept

🎯 Definition: Degrees of Freedom = Number of independent pieces of information available for estimating a parameter

Understanding Degrees of Freedom

📊 Population Case

All observations are independent

  • We know the true population mean \(\mu\)
  • Each of the \(N\) observations provides independent information
  • No constraints on the data

\[\text{Degrees of Freedom} = N\]

📈 Sample Case

Constraint introduced by sample mean

  • We must estimate \(\mu\) using \(\bar{x}\)
  • Once we know \(\bar{x}\) and \((n-1)\) observations, the last one is determined
  • We “lose” one degree of freedom

\[\text{Degrees of Freedom} = n-1\]

Why Does This Matter?

Key Insight: Using \(\bar{x}\) instead of \(\mu\) creates dependency among observations

🧠 Intuitive Explanation

Imagine you have 5 numbers that must average to 10:

If the mean is fixed at 10:

  • Choose any 4 numbers freely: 8, 12, 7, 15

  • The 5th number is forced: \(5 \times 10 - (8+12+7+15) = 8\)

  • Only 4 degrees of freedom, not 5!

This is exactly what happens with sample variance

Sample Mean Constraint

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\]

Rearranging: \[x_n = n\bar{x} - (x_1 + x_2 + \cdots + x_{n-1})\]

Once we know \(\bar{x}\) and the first \((n-1)\) values, \(x_n\) is completely determined!

Degrees of Freedom in Other Contexts

📚 General Pattern

Degrees of Freedom = Sample Size - Number of Parameters Estimated

Examples:

  • Sample mean: \(df = n - 0 = n\) (no parameters estimated from data)

  • Sample variance: \(df = n - 1\) (estimate \(\mu\) with \(\bar{x}\))

  • Linear regression: \(df = n - p - 1\) (estimate \(p\) coefficients + intercept)

  • Chi-square test: \(df = (rows-1) \times (cols-1)\)

Common Questions

❓ “Why can’t we just use the population mean?”

Answer: In practice, we rarely know \(\mu\). We’re estimating it from the same data we’re using to calculate variance.

❓ “When does the difference matter most?”

Answer: Small samples! The correction becomes negligible as \(n\) increases.

❓ “What if I know the true population mean?”

Answer: Then you can use \(n\) in the denominator: \(\frac{\sum(x_i - \mu)^2}{n}\)

Example Calculation

Data: 3, 7, 2, 8, 5

If this is the entire population:

  • \(\mu = \frac{3+7+2+8+5}{5} = 5\)

  • \(\sigma^2 = \frac{(3-5)^2+(7-5)^2+(2-5)^2+(8-5)^2+(5-5)^2}{5} = \frac{22}{5} = 4.4\)

If this is a sample:

  • \(\bar{x} = 5\) (same calculation)

  • \(s^2 = \frac{22}{5-1} = \frac{22}{4} = 5.5\)

Variance Properties

Variance measures:

  • Average squared deviation from the mean

  • Always non-negative: \(\sigma^2 \geq 0\), \(s^2 \geq 0\)

  • Units: (original units)²

Standard Deviation:

  • \(\sigma = \sqrt{\sigma^2}\) (population)

  • \(s = \sqrt{s^2}\) (sample)

  • Same units as original data

Bias and Unbiasedness

Population variance:

  • True parameter value

  • No estimation involved

Sample variance with \((n-1)\):

  • \(E[s^2] = \sigma^2\) (unbiased)

  • On average, equals population variance

Sample variance with \(n\):

  • \(E\left[\frac{\sum(x_i - \bar{x})^2}{n}\right] = \frac{n-1}{n}\sigma^2\) (biased)

  • Systematically underestimates population variance

Common Questions

❓ “Why not just use \(n\) for sample variance?”

Answer: It would systematically underestimate the true variance

❓ “When does it matter?”

Answer: More important for small samples; less difference as \(n\) increases

❓ “What if I know the population mean?”

Answer: Then you can use \(n\) in denominator: \(\frac{\sum(x_i - \mu)^2}{n}\)

Implementation

Calculators:

  • Most use \((n-1)\) by default for sample standard deviation

  • Check your calculator’s documentation

Software:

  • R: var() uses \((n-1)\), sd() uses \((n-1)\)

  • Excel: VAR.S() uses \((n-1)\), VAR.P() uses \(n\)

  • Python: np.var(ddof=1) uses \((n-1)\)

Practice Problem

Dataset: Number of hours studied by 6 students: 2, 4, 3, 5, 6, 4

Calculate both:

  1. Population variance (assuming this is the entire population)

  2. Sample variance (assuming this is a sample)

Solution:

  • Mean: \(\bar{x} = \frac{24}{6} = 4\)

  • Population variance: \(\sigma^2 = \frac{10}{6} = 1.67\)

  • Sample variance: \(s^2 = \frac{10}{5} = 2.0\)

Summary

Key Takeaways

  • Population variance uses \(N\) (entire population)
  • Sample variance uses \((n-1)\) (Bessel’s correction)
  • Sample variance is unbiased estimator of population variance
  • Difference matters more for small samples
  • Always check which formula your software uses!

Standard Deviation Definition

🎯 Definition: Standard Deviation is the square root of variance.

Sample Standard Deviation

\[s = \sqrt{s^2} = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}}\]

Population Standard Deviation

\[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum(x_i - \mu)^2}{N}}\]

Example

📊 Complete Population Data (Test Scores)

We have test scores from 100 students arranged in a grid:

A B C D E F G H I J K L M N O P Q R S T
1 24 96 30 69 85 60 55 18 30 66 64 99 92 95 84 55 72 38 86 32
2 53 81 30 89 42 94 31 26 53 78 38 60 93 90 82 85 89 54 30 58
3 62 67 75 47 99 25 32 63 49 45 30 97 57 32 37 62 33 16 11 41
4 95 74 28 73 82 97 65 88 56 95 85 44 70 65 34 85 58 15 64 84
5 76 46 83 56 98 16 76 77 35 19 97 42 90 79 73 28 82 92 90 22

🎯 Random Sample Selection

We randomly select 5 scores from different positions in our population:

Our Sample: 82, 95, 83, 60, 92

Detailed Calculation

Sample Size: n = 5

These 5 values will be used for our standard deviation calculation

Sample Mean Calculation

\[\bar{x} = \frac{82 + 95 + 83 + 60 + 92}{5} = \frac{412}{5} = 82.4\]

\(x_i\) \(x_i - \bar{x}\) \((x_i - \bar{x})^2\)
82 \(82 - 82.4 = -0.4\) \((-0.4)^2 = 0.16\)
95 \(95 - 82.4 = 12.6\) \((12.6)^2 = 158.76\)
83 \(83 - 82.4 = 0.6\) \((0.6)^2 = 0.36\)
60 \(60 - 82.4 = -22.4\) \((-22.4)^2 = 501.76\)
92 \(92 - 82.4 = 9.6\) \((9.6)^2 = 92.16\)
Sum \(\sum = 753.2\)

Final Calculations

Sample Variance

\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\] \[s^2 = \frac{753.2}{5-1} = \frac{753.2}{4} = 188.3\]

Sample Standard Deviation

\[s = \sqrt{s^2}\] \[s = \sqrt{188.3} = 13.72\]

Properties of Standard Deviation

  • Same units as the original data
  • Always non-negative
  • Zero only when all values are identical
  • Larger values indicate more variability
  • Approximately 68% of data within 1 SD of mean (for normal distributions)
  • Approximately 95% of data within 2 SD of mean

Empirical Rule (68-95-99.7 Rule)

For approximately normal distributions:

  • 68% of data falls within 1 standard deviation of the mean
  • 95% of data falls within 2 standard deviations of the mean
  • 99.7% of data falls within 3 standard deviations of the mean

This rule helps us understand what constitutes “typical” vs “unusual” values.

Coefficient of Variation

5 minutes

Definition and Purpose

🎯 Definition: Coefficient of Variation (CV) = \(\frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%\)

Why Use CV?

  • Compares variability across different units or scales
  • Relative measure of variability
  • Useful when means differ substantially

Example: Comparing Variability

Stock A: Mean return = $50, SD = $10, CV = 20%

Stock B: Mean return = $500, SD = $50, CV = 10%

Important

Stock B has higher absolute variability ($50 vs $10) but lower relative variability (10% vs 20%)

Stock B is relatively less risky per dollar invested.

Python Implementation - Variability

import numpy as np
import pandas as pd

# Sample data
data = [10, 12, 14, 16, 18, 22, 25]

# Calculate measures of variability
range_val = np.max(data) - np.min(data)
variance_sample = np.var(data, ddof=1)  # Sample variance
std_sample = np.std(data, ddof=1)       # Sample standard deviation
cv = (std_sample / np.mean(data)) * 100

print(f"Range: {range_val}")
print(f"Variance: {variance_sample:.2f}")
print(f"Standard Deviation: {std_sample:.2f}")
print(f"Coefficient of Variation: {cv:.1f}%")
Range: 15
Variance: 28.90
Standard Deviation: 5.38
Coefficient of Variation: 32.2%

Measures of Position

20 minutes

What are Measures of Position?

Measures of position tell us where a particular value stands relative to the rest of the data.

They answer questions like:

  • “What percentage of students scored below 85?”

  • “Is this value typical or unusual?”

  • “How does this observation compare to others?”

Percentiles and Quartiles

10 minutes

Percentiles Definition

The k-th percentile is the value below which k% of the data falls.

Examples:

  • 50th percentile = Median (50% of data below this value)
  • 90th percentile = 90% of data falls below this value
  • 25th percentile = 25% of data falls below this value

Quartiles

Quartiles divide the data into four equal parts:

  • Q1 (First Quartile) = 25th percentile
  • Q2 (Second Quartile) = 50th percentile = Median
  • Q3 (Third Quartile) = 75th percentile

Interquartile Range (IQR)

IQR = Q3 - Q1

Properties of IQR:

  • Contains the middle 50% of the data
  • Resistant to outliers
  • Used in boxplot construction
  • Useful for outlier detection

Five-Number Summary

The five-number summary provides a complete picture of data distribution:

  1. Minimum
  2. Q1 (25th percentile)
  3. Median (50th percentile)
  4. Q3 (75th percentile)
  5. Maximum

This forms the basis for boxplots!

Z-scores and Standardization

10 minutes

Z-score Definition

🎯 Definition

Z-score tells us how many standard deviations a value is from the mean.

\[z = \frac{x - \mu}{\sigma} \text{ or } z = \frac{x - \bar{x}}{s}\]

Interpreting Z-scores

  • z = 0: Value equals the mean
  • z = 1: Value is 1 standard deviation above the mean
  • z = -2: Value is 2 standard deviations below the mean
  • |z| > 2: Often considered “unusual” (beyond 95% of data)
  • |z| > 3: Very unusual (beyond 99.7% of data)

Z-score Example

Student’s test score: 85 Class mean: 78, Class standard deviation: 6

\[z = \frac{85 - 78}{6} = \frac{7}{6} = 1.17\]

Interpretation: This student scored 1.17 standard deviations above the class average.

Benefits of Standardization

  • Compare across different scales (test scores vs income)
  • Identify outliers systematically
  • Combine different variables meaningfully
  • Prepare data for certain statistical methods

Distribution Shape

10 minutes

Skewness

Skewness measures the asymmetry of a distribution.

Types of Skewness:

  • Symmetric (Skewness ≈ 0): Mean ≈ Median ≈ Mode
  • Right-skewed (Positive skewness): Mean > Median, long tail to the right
  • Left-skewed (Negative skewness): Mean < Median, long tail to the left

Examples of Skewness

Right-skewed (Income data):

  • Most people earn moderate amounts

  • Few people earn very high amounts

  • Mean > Median

Left-skewed (Test scores with ceiling effect):

  • Most students score high

  • Few students score very low

  • Mean < Median

Kurtosis

Kurtosis measures the “tailedness” of a distribution. It measures the degree of peaked Ness or flatness of a distribution compared to the normal distribution.

Types:

  • Mesokurtic (Normal-like): Kurtosis ≈ 3
  • Leptokurtic (Heavy tails): Kurtosis > 3, more peaked
  • Platykurtic (Light tails): Kurtosis < 3, flatter

Excess Kurtosis = Kurtosis - 3 (makes normal distributions have excess kurtosis of 0)

Python Implementation - Position & Shape

import numpy as np
import pandas as pd
from scipy import stats

data = [12, 15, 18, 22, 25, 28, 30, 35, 40, 45]

# Percentiles and quartiles
q1 = np.percentile(data, 25)
median = np.percentile(data, 50)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Z-scores
z_scores = stats.zscore(data)

# Shape measures
skewness = stats.skew(data)
kurt = stats.kurtosis(data)

print(f"Q1: {q1}, Median: {median}, Q3: {q3}")
print(f"IQR: {iqr}")
print(f"Skewness: {skewness:.3f}")
print(f"Kurtosis: {kurt:.3f}")
Q1: 19.0, Median: 26.5, Q3: 33.75
IQR: 14.75
Skewness: 0.243
Kurtosis: -1.023

Data Visualization

20 minutes

Histograms and Bin Width Selection

10 minutes

What is a Histogram?

A histogram displays the distribution of a continuous variable by dividing data into bins and showing the frequency of observations in each bin.

Key Components:

  • X-axis: Variable values (continuous)
  • Y-axis: Frequency or density
  • Bins: Intervals that group the data
  • Bars: Height represents frequency in each bin

Choosing Bin Width: Critical Decision

Bin width dramatically affects histogram interpretation!

Too Few Bins (Wide bins):

  • Oversmoothing - lose important details
  • May hide multimodality
  • Distribution appears simpler than it is

Too Many Bins (Narrow bins):

  • Undersmoothing - too much noise
  • May create artificial gaps
  • Hard to see overall pattern

Bin Width Guidelines

Rule of Thumb Methods:

  1. Square Root Rule: Number of bins ≈ \(\sqrt{n}\)
  2. Sturges’ Rule: Number of bins = \(1 + \log_2(n)\)
  3. Scott’s Rule: Bin width = \(\frac{3.5 \times \text{SD}}{n^{1/3}}\)
  4. Freedman-Diaconis Rule: Bin width = \(\frac{2 \times \text{IQR}}{n^{1/3}}\)

Best practice: Try multiple bin widths and choose based on the story your data tells!

Python Histogram Examples

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)

Interpreting Histograms

What to Look For:

  1. Shape: Normal, skewed, uniform, bimodal?
  2. Center: Where is the “typical” value?
  3. Spread: How variable is the data?
  4. Outliers: Any unusual values?
  5. Gaps: Are there missing values in certain ranges?
  6. Multiple peaks: Suggests multiple subgroups

Boxplots and Interpretation

10 minutes

Anatomy of a Boxplot

Boxplot Components Explained

The Box:

  • Left edge: Q1 (25th percentile)
  • Middle line: Median (Q2, 50th percentile)
  • Right edge: Q3 (75th percentile)
  • Box width: IQR (contains middle 50% of data)

The Whiskers:

  • Extend to: Most extreme values within 1.5 × IQR from box edges
  • Lower whisker: Minimum value within Q1 - 1.5×IQR
  • Upper whisker: Maximum value within Q3 + 1.5×IQR

Outliers:

  • Points beyond whiskers: Values > Q3 + 1.5×IQR or < Q1 - 1.5×IQR

What Boxplots Tell Us

Distribution Shape:

  • Symmetric: Median in center of box, whiskers equal length
  • Right-skewed: Median closer to Q1, longer upper whisker
  • Left-skewed: Median closer to Q3, longer lower whisker

Variability:

  • Wide box: High variability in middle 50%
  • Long whiskers: High overall variability
  • Many outliers: Extreme variability

Comparing Groups with Boxplots

Advanced Boxplot Interpretations

Comparing Boxplots:

  • Median differences: Which group has higher typical values?
  • IQR differences: Which group is more consistent?
  • Outlier patterns: Which group has more extreme values?
  • Overlap: Do the groups have similar ranges?

Business Applications:

  • Quality control: Compare product batches
  • Performance analysis: Compare team/department performance
  • Customer segmentation: Compare customer groups

Common Patterns in Data

Distribution Patterns:

  1. Normal/Bell-shaped: Symmetric, single peak
  2. Uniform: All values equally likely
  3. Bimodal: Two distinct peaks (suggests subgroups)
  4. Multimodal: Multiple peaks
  5. U-shaped: High values at extremes, low in middle

Outlier Patterns:

  • Individual outliers: Data entry errors, measurement errors
  • Clustered outliers: Distinct subpopulation
  • Systematic outliers: May indicate process changes

Red Flags in Data Visualization

Warning Signs:

  1. Gaps in histograms: Missing data or measurement limitations
  2. Heaping: Values cluster at round numbers (10, 50, 100)
  3. Truncation: Data cut off at certain values
  4. Digit preference: People prefer certain ending digits
  5. Multiple modes: Hidden subgroups in your data

Key Takeaways

Essential Concepts to Remember

Variability:

  • Standard deviation is preferred over range for most analyses
  • CV allows comparison across different scales
  • IQR is resistant to outliers

Position:

  • Percentiles and quartiles provide relative position
  • Z-scores standardize across different distributions
  • Five-number summary gives complete overview

Visualization:

  • Bin width choice is critical for histogram interpretation
  • Boxplots excel at comparing groups and identifying outliers
  • Multiple visualizations provide different insights

Practical Guidelines

  1. Always visualize before calculating statistics
  2. Use multiple measures - no single statistic tells the whole story
  3. Consider the context - what makes sense for your data?
  4. Check for outliers - they can drastically affect your analysis
  5. Compare distributions using standardized measures when appropriate

Practice Problems

Try These Exercises

  1. Calculate the five-number summary for: 12, 15, 18, 22, 25, 28, 30, 35, 40, 45

  2. Create histograms with 5, 15, and 50 bins for the same dataset. What patterns do you observe?

  3. Interpret this scenario: Dataset A has mean=50, SD=5. Dataset B has mean=100, SD=10. Which is more variable?

  4. A student scores 85 on a test where the class mean is 78 and SD is 6. Calculate and interpret the z-score.

  5. Design a boxplot comparison for three different customer segments in your business.

Additional Resources

Questions?

Thank you for your attention!

Remember: The goal of descriptive statistics is to understand your data story - let the visualizations guide your insights!