Skip to content

Commit d4b34ca

Browse files
travisjneumanclaude
andcommitted
feat: add Module 07 — Data Analysis curriculum
Five projects covering pandas basics, filtering/grouping, data cleaning, matplotlib visualization, and a complete analysis report pipeline. Includes realistic CSV data files (students, messy sales, transactions). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1ee375f commit d4b34ca

23 files changed

Lines changed: 2006 additions & 0 deletions

File tree

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Module 07 / Project 01 — Pandas Basics
2+
3+
[README](../../../../README.md) · [Module Index](../README.md)
4+
5+
## Focus
6+
7+
- Loading CSV data with `pd.read_csv()`
8+
- Exploring a DataFrame: `head()`, `tail()`, `shape`, `dtypes`, `info()`, `describe()`
9+
- Selecting columns by name
10+
- Sorting rows with `sort_values()`
11+
12+
## Why this project exists
13+
14+
Before you can analyze data, you need to know how to load it and look at it. This project teaches you how to get a CSV file into a pandas DataFrame and use built-in methods to understand what the data looks like — how many rows, what columns exist, what types the values are, and what the basic statistics tell you. These exploration steps are the first thing every data analyst does with a new data set.
15+
16+
## Run
17+
18+
```bash
19+
cd projects/modules/07-data-analysis/01-pandas-basics
20+
python project.py
21+
```
22+
23+
## Expected output
24+
25+
```text
26+
=== Loading student data ===
27+
Loaded 30 rows and 4 columns from data/students.csv
28+
29+
=== First 5 rows (head) ===
30+
name subject grade age
31+
0 Alice Chen Math 92 17
32+
1 Bob Martinez Science 78 16
33+
2 Carol Johnson English 85 17
34+
3 David Kim Math 67 16
35+
4 Eva Patel Science 91 18
36+
37+
=== Shape ===
38+
Rows: 30, Columns: 4
39+
40+
=== Column types (dtypes) ===
41+
name object
42+
subject object
43+
grade int64
44+
age int64
45+
dtype: object
46+
47+
=== Summary statistics (describe) ===
48+
grade age
49+
count 30.000000 30.000000
50+
mean 80.100000 17.000000
51+
...
52+
53+
=== Selecting just name and grade columns ===
54+
(first 5 rows)
55+
name grade
56+
0 Alice Chen 92
57+
1 Bob Martinez 78
58+
...
59+
60+
=== Sorted by grade (highest first) ===
61+
(first 10 rows)
62+
name subject grade age
63+
18 Sam Turner Math 96 17
64+
...
65+
66+
Done.
67+
```
68+
69+
The exact numbers will match the CSV data. The `...` sections are abbreviated here — your output will show all rows and statistics.
70+
71+
## Alter it
72+
73+
1. Change `head()` to `head(10)` and see what happens. Try `tail(3)`.
74+
2. Sort by `age` instead of `grade`. What happens when two students have the same age?
75+
3. Select three columns instead of two. What does `df[["name", "subject", "grade"]]` return?
76+
4. Try `df["grade"].mean()` and `df["grade"].max()` — what do they return?
77+
78+
## Break it
79+
80+
1. Change the filename in `read_csv()` to a file that does not exist. What error do you get?
81+
2. Try selecting a column that does not exist: `df["score"]`. Read the error message.
82+
3. Remove the `import pandas as pd` line. What happens?
83+
84+
## Fix it
85+
86+
1. Wrap `read_csv()` in a try/except that catches `FileNotFoundError` and prints a friendly message.
87+
2. Before selecting a column, check if it exists: `if "score" in df.columns`.
88+
3. Put the import back.
89+
90+
## Explain it
91+
92+
1. What is a DataFrame? How is it different from a list of dictionaries?
93+
2. What does `describe()` tell you that `info()` does not?
94+
3. Why does `dtypes` show `object` for the name and subject columns instead of `string`?
95+
4. What is the difference between `df["grade"]` (one column) and `df[["grade"]]` (double brackets)?
96+
97+
## Mastery check
98+
99+
You can move on when you can:
100+
101+
- Load any CSV file into a DataFrame from memory.
102+
- Use `head()`, `shape`, `dtypes`, `info()`, and `describe()` to explore a new data set.
103+
- Select one or more columns from a DataFrame.
104+
- Sort a DataFrame by any column, ascending or descending.
105+
106+
## Next
107+
108+
[Project 02 — Filtering & Grouping](../02-filtering-grouping/)
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name,subject,grade,age
2+
Alice Chen,Math,92,17
3+
Bob Martinez,Science,78,16
4+
Carol Johnson,English,85,17
5+
David Kim,Math,67,16
6+
Eva Patel,Science,91,18
7+
Frank Lopez,English,73,17
8+
Grace Okafor,Math,88,16
9+
Henry Wang,Science,82,17
10+
Irene Novak,English,95,18
11+
James Brown,Math,54,16
12+
Karen Lee,Science,76,17
13+
Leo Garcia,English,89,16
14+
Maria Santos,Math,71,18
15+
Nathan Green,Science,93,17
16+
Olivia Reed,English,62,16
17+
Peter Zhao,Math,84,17
18+
Quinn Adams,Science,79,18
19+
Rachel Hill,English,90,17
20+
Sam Turner,Math,96,16
21+
Tina Wilson,Science,68,17
22+
Uma Desai,English,81,18
23+
Victor Cruz,Math,77,16
24+
Wendy Fox,Science,86,17
25+
Xavier Bell,English,59,16
26+
Yara Hussain,Math,94,18
27+
Zane Porter,Science,72,17
28+
Alice Turner,Math,83,16
29+
Ben Okafor,English,91,17
30+
Clara Reyes,Science,65,18
31+
Derek Nash,Math,87,16
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Notes — Pandas Basics
2+
3+
## What I learned
4+
5+
6+
## What confused me
7+
8+
9+
## What I want to explore next
10+
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
"""
2+
Project 01 — Pandas Basics
3+
4+
This script loads a CSV file of student grades into a pandas DataFrame
5+
and explores the data using built-in methods: head(), shape, dtypes,
6+
info(), describe(), column selection, and sorting.
7+
8+
Data file: data/students.csv (30 rows with name, subject, grade, age)
9+
"""
10+
11+
# pandas is the core library for data analysis in Python.
12+
# The convention is to import it as "pd" so you type less.
13+
# You installed it with: pip install pandas
14+
import pandas as pd
15+
16+
17+
def load_data(filepath):
18+
"""
19+
Load a CSV file into a pandas DataFrame.
20+
21+
pd.read_csv() reads a comma-separated file and returns a DataFrame —
22+
a table-like structure with labeled columns and numbered rows.
23+
Think of it as a spreadsheet you can manipulate with code.
24+
"""
25+
df = pd.read_csv(filepath)
26+
print(f"Loaded {len(df)} rows and {len(df.columns)} columns from {filepath}")
27+
return df
28+
29+
30+
def explore_head(df):
31+
"""
32+
Show the first few rows of the DataFrame.
33+
34+
head() returns the first 5 rows by default. This is the fastest way
35+
to see what your data looks like after loading it.
36+
"""
37+
print("\n=== First 5 rows (head) ===")
38+
print(df.head())
39+
40+
41+
def explore_shape(df):
42+
"""
43+
Show the dimensions of the DataFrame.
44+
45+
shape is a tuple (rows, columns). It tells you how big your data set
46+
is without printing all the data.
47+
"""
48+
rows, cols = df.shape
49+
print(f"\n=== Shape ===")
50+
print(f"Rows: {rows}, Columns: {cols}")
51+
52+
53+
def explore_dtypes(df):
54+
"""
55+
Show the data type of each column.
56+
57+
dtypes tells you whether each column holds numbers (int64, float64),
58+
text (object), dates, or other types. This matters because you cannot
59+
do math on text columns.
60+
61+
"object" in pandas usually means the column contains strings.
62+
"""
63+
print("\n=== Column types (dtypes) ===")
64+
print(df.dtypes)
65+
66+
67+
def explore_info(df):
68+
"""
69+
Show a concise summary of the DataFrame.
70+
71+
info() prints the column names, non-null counts, and data types
72+
all in one view. It is especially useful for spotting missing values —
73+
if a column has fewer non-null entries than total rows, some values
74+
are missing.
75+
"""
76+
print("\n=== Info ===")
77+
df.info()
78+
79+
80+
def explore_describe(df):
81+
"""
82+
Show summary statistics for numeric columns.
83+
84+
describe() calculates count, mean, std, min, 25%, 50% (median),
85+
75%, and max for every numeric column. This gives you a quick
86+
sense of the distribution — are grades clustered around 80?
87+
Is the youngest student 14 or 18?
88+
"""
89+
print("\n=== Summary statistics (describe) ===")
90+
print(df.describe())
91+
92+
93+
def select_columns(df):
94+
"""
95+
Select specific columns from the DataFrame.
96+
97+
df["column_name"] returns a single column as a Series.
98+
df[["col1", "col2"]] returns multiple columns as a new DataFrame.
99+
Notice the double brackets — the inner list tells pandas which
100+
columns you want.
101+
"""
102+
print("\n=== Selecting just name and grade columns ===")
103+
# Double brackets: pass a list of column names to get a DataFrame back.
104+
subset = df[["name", "grade"]]
105+
print("(first 5 rows)")
106+
print(subset.head())
107+
108+
109+
def sort_by_grade(df):
110+
"""
111+
Sort the DataFrame by the grade column, highest first.
112+
113+
sort_values() returns a new DataFrame with rows reordered.
114+
ascending=False puts the highest values at the top.
115+
The original DataFrame is not changed.
116+
"""
117+
print("\n=== Sorted by grade (highest first) ===")
118+
sorted_df = df.sort_values("grade", ascending=False)
119+
print("(first 10 rows)")
120+
print(sorted_df.head(10))
121+
122+
123+
def main():
124+
print("=== Loading student data ===")
125+
126+
# Step 1: Load the CSV into a DataFrame.
127+
# The file path is relative to where you run the script from.
128+
df = load_data("data/students.csv")
129+
130+
# Step 2: Explore the data using built-in methods.
131+
# These are the first things you should do with any new data set.
132+
explore_head(df)
133+
explore_shape(df)
134+
explore_dtypes(df)
135+
explore_info(df)
136+
explore_describe(df)
137+
138+
# Step 3: Select specific columns.
139+
select_columns(df)
140+
141+
# Step 4: Sort the data.
142+
sort_by_grade(df)
143+
144+
print("\nDone.")
145+
146+
147+
# This pattern means: only run main() when this file is executed directly.
148+
# If someone imports this file, main() will NOT run automatically.
149+
if __name__ == "__main__":
150+
main()

0 commit comments

Comments
 (0)