Understanding Data (PSTAT 5A) – Week 1
2025-06-25
By the end of this lecture you should be able to:
Define descriptive statistics and explain its importance in data analysis (Section 1)
Distinguish between different types of data and measurement scales (Section 2)
Calculate and interpret measures of central tendency (mean, median, mode)(Section 3)
Understand when to use each measure of central tendency
Apply Python to compute descriptive statistics(Section 9)
Interpret basic descriptive statistics in real-world contexts(Section 10)
10 minutes
Facts are stubborn, but statistics are more pliable.
Mark Twain
Descriptive statistics are numerical and graphical methods used to summarize, organize, and describe data in a meaningful way.
Descriptive Statistics | Inferential Statistics |
---|---|
Describes what the data shows | Makes predictions about populations |
Summarizes sample data | Uses sample data to make generalizations |
No conclusions beyond the data | Draws conclusions beyond the immediate data |
Examples: mean, median, graphs | Examples: hypothesis testing, confidence intervals |
15 minutes
Understanding data types is crucial because:
graph TD
A[Variable Types]
A --> B[Categorical]
A --> C[Numerical]
B --> B1["Nominal<br/>e.g., gender, major"]
B --> B2["Ordinal<br/>e.g., rating, education level"]
C --> C1["Discrete<br/>e.g., count (# of students)"]
C --> C2["Continuous<br/>e.g., measurements like height, weight"]
classDef root fill:#e1f5fe,stroke:#01579b,stroke-width:3px
classDef categorical fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef numerical fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
classDef nominal fill:#fce4ec,stroke:#880e4f,stroke-width:2px
classDef ordinal fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef discrete fill:#e0f2f1,stroke:#00695c,stroke-width:2px
classDef continuous fill:#f1f8e9,stroke:#33691e,stroke-width:2px
class A root
class B categorical
class C numerical
class B1 nominal
class B2 ordinal
class C1 discrete
class C2 continuous
Scale | Type | Properties | Examples | Appropriate Statistics |
---|---|---|---|---|
Nominal | Categorical | Categories only | Gender, Color | Mode, Frequency |
Ordinal | Categorical | Order matters | Rankings, Grades | Mode, Median |
Interval | Numerical | Equal intervals, no true zero | Temperature (°C) | Mean, Median, Mode |
Ratio | Numerical | Equal intervals, true zero | Height, Weight, Income | All statistics |
35 minutes
%%{init: {
"flowchart": { "nodeSpacing": 300, "rankSpacing": 300 },
"themeVariables": {
"fontSize": "50px", /* <-- bump this up as needed */
"fontFamily": "Arial"
},
"width": 1200,
"height": 900
}}%%
graph TD
A["Measures of Central Tendency"]
A --> B["Mean (Average)"]
A --> C["Median (Middle Value)"]
A --> D["Mode (Most Frequent)"]
B --> B1["Sum of all values ÷<br/>Number of values"]
B --> B2["Best for:<br/>Symmetric distributions"]
B --> B3["Affected by:<br/>Outliers"]
C --> C1["Middle value when<br/>data is ordered"]
C --> C2["Best for:<br/>Skewed distributions"]
C --> C3["Resistant to:<br/>Outliers"]
D --> D1["Value that appears<br/>most frequently"]
D --> D2["Best for:<br/>Categorical data"]
D --> D3["Can have: Multiple modes<br/>or no mode"]
classDef main fill:#e3f2fd,stroke:#0d47a1,stroke-width:4px
classDef mean fill:#fff3e0,stroke:#e65100,stroke-width:4px
classDef median fill:#e8f5e8,stroke:#2e7d32,stroke-width:4px
classDef mode fill:#f3e5f5,stroke:#7b1fa2,stroke-width:4px
classDef details fill:#fafafa,stroke:#616161,stroke-width:4px
class A main
class B mean
class C median
class D mode
class B1,B2,B3,C1,C2,C3,D1,D2,D3 details
Central tendency describes the center or typical value of a dataset.
It answers the question: “What is a representative value for this data?”
12 minutes
🎯 Definition: The mean is the sum of all values divided by the number of values.
Where:
Student test scores: 85, 90, 78, 92, 88
Mean = \frac{85 + 90 + 78 + 92 + 88}{5} = \frac{433}{5} = 86.6
✅ Use the mean when:
12 minutes
🎯 Definition: The median is the middle value when data is arranged in ascending or descending order.
Data: 12, 15, 18, 20, 25
What is the median here ?
Median = 18 (middle value)
Data: 10, 15, 20, 25, 30, 35
What is the median here ?
Median = \frac{20 + 25}{2} = 22.5
✅ Use the median when:
Consider household incomes: $30,000, $32,000, $35,000, $38,000, $2,000,000
11 minutes
🎯 Definition: The mode is the value that appears most frequently in a dataset.
%%{init: {'flowchart': {'nodeSpacing': 100, 'rankSpacing': 100}, 'width': 600, 'height': 400}}%%
graph TD
A["Mode Types"]
A --> B["Unimodal<br/>One peak"]
A --> C["Bimodal<br/>Two peaks"]
A --> D["Multimodal<br/>Multiple peaks"]
A --> E["No Mode<br/>No repeated values"]
classDef main fill:#e1f5fe,stroke:#01579b,stroke-width:3px,font-size:16px
classDef types fill:#f5f5f5,stroke:#424242,stroke-width:2px,font-size:14px
class A main
class B,C,D,E types
Data: 2, 3, 3, 4, 5, 5, 5, 6, 7
Analysis:
Data: 1, 2, 2, 3, 4, 4, 5
Analysis:
Data: 1, 2, 3, 4, 5
Analysis:
Favorite colors: Red, Blue, Blue, Green, Blue, Red, Blue Mode = Blue
Often requires grouping into intervals or bins Example: Heights grouped into ranges
✅ Use the mode when:
Measure | Best for Data Type | Strengths | Weaknesses | Affected by Outliers? |
---|---|---|---|---|
Mean | Interval, Ratio | Uses all data, mathematically tractable | Sensitive to outliers | Yes |
Median | Ordinal, Interval, Ratio | Robust to outliers, represents middle | Ignores extreme values | No |
Mode | All types | Works with categorical, identifies most common | May not exist/be unique | No |
15 minutes
Library | Purpose | Key Functions |
---|---|---|
NumPy | Numerical computing | np.mean() , np.median() , np.std() |
Pandas | Data manipulation | df.describe() , df.mean() , df.median() |
SciPy | Scientific computing | stats.mode() , stats.describe() |
Matplotlib | Basic plotting | plt.plot() , plt.hist() , plt.boxplot() |
Seaborn | Statistical viz | sns.histplot() , sns.boxplot() |
# Import numpy library for numeric operations
import numpy as np
# Import pandas library for data structures
import pandas as pd
# Define sample data as a list of test scores
data = [85, 90, 78, 92, 88, 91, 85, 87, 89, 86]
# Compute the arithmetic mean using NumPy
mean_np = np.mean(data)
print(f"Mean (NumPy): {mean_np:.2f}")
# Convert the data list into a pandas DataFrame
df = pd.DataFrame({'scores': data})
# Compute the arithmetic mean using Pandas
mean_pd = df['scores'].mean()
print(f"Mean (Pandas): {mean_pd:.2f}")
# Manually sum all scores and divide by count
manual_mean = sum(data) / len(data)
print(f"Mean (Manual): {manual_mean:.2f}")
Mean (NumPy): 87.10
Mean (Pandas): 87.10
Mean (Manual): 87.10
# Compute the median value using NumPy
median_np = np.median(data)
print(f"Median (NumPy): {median_np:.2f}")
# Compute the median value using Pandas
median_pd = df['scores'].median()
print(f"Median (Pandas): {median_pd:.2f}")
# Sort the data list in ascending order
sorted_data = sorted(data)
# Determine the number of elements
n = len(sorted_data)
# If even count, average the two middle elements; else take middle element
if n % 2 == 0:
manual_median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
manual_median = sorted_data[n//2]
print(f"Median (Manual): {manual_median:.2f}")
Median (NumPy): 87.50
Median (Pandas): 87.50
Median (Manual): 87.50
# Import SciPy stats module to compute mode
from scipy import stats
# Use SciPy to find the most common value and its count
mode_result = stats.mode(data, keepdims=True)
print(f"Mode (SciPy): {mode_result.mode[0]}, Count: {mode_result.count[0]}")
# Use Pandas to get mode(s) from the DataFrame
mode_pd = df['scores'].mode()
print(f"Mode (Pandas): {mode_pd.values}")
# Import Counter for manual frequency counting
from collections import Counter
# Count occurrences of each score
counter = Counter(data)
# Find the highest frequency
max_count = max(counter.values())
# Identify all values that appear with that frequency
modes = [k for k, v in counter.items() if v == max_count]
print(f"Mode (Manual): {modes}, Count: {max_count}")
Mode (SciPy): 85, Count: 2
Mode (Pandas): [85]
Mode (Manual): [85], Count: 2
# import libraries
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
# define your data and compute statistics here
data = [85, 90, 78, 92, 88, 91, 85, 87, 89, 86]
mean_np = np.mean(data)
median_np = np.median(data)
# for mode, use Counter
counter = Counter(data)
max_count = max(counter.values())
modes = [k for k,v in counter.items() if v==max_count]
mode_val = modes[0]
5 minutes
Descriptive Statistics Part II will cover:
Calculate mean, median, and mode for: 12, 15, 18, 12, 20, 25, 12, 30
A dataset has mean = 50 and median = 45. What does this tell you about the distribution?
Why might median be preferred over mean for reporting household income?
Create a Python function to identify the most appropriate measure of central tendency for a given dataset.
Thank you for your attention!
Understanding Data - Descriptive Statistics I © 2025 Narjes Mathlouthi