This project explores long-term patterns in MLB player development, salaries, careers, and physical characteristics using the Lahman Baseball Database (1871–2024).
It demonstrates relational modelling, analytical views, window functions, multi-step CTE pipelines, and Python-based EDA.
The project investigates four major analytical themes:
- Talent Pipelines — Which colleges produce MLB players, and how this has changed over time and geography.
- Team Salary & Payroll Dynamics — How payroll evolves, which teams invest most heavily, and how spending relates to postseason performance.
- Career Trajectories — Debut age, retirement age, career longevity, and the degree of team loyalty.
- Player Physical & Comparative Profiles — How height, weight, and handedness vary across eras, positions, leagues, and regions.
Advanced components include Hall of Fame comparisons, debut-to-final-team mapping, salary inequality metrics, era splits, and geography-linked physical analyses.
Dataset: Lahman Baseball Database
Version: 2024 release (1871–2024)
Website: https://sabr.org/lahman-database/
Repository data layout:
- data/core/ — Core Lahman CSV tables used in this project
- data/extra/ — Supplementary tables (not used)
- data/output/ - SQL outputs used for visualisation in the notebook.
- readme2024u.txt — Official data documentation
PostgreSQL setup:
Tables are created manually using sql/schema.sql.
Run this before loading data:
sql/schema.sql;
Analytical views may be stored in an optional schema:
CREATE SCHEMA IF NOT EXISTS mlb_analytics;
MLB_Analytics_Project/
│
├── data/
│ ├── core/ # Core CSV tables
│ ├── extra/ # Unused additional tables
│ ├── output/ # SQL output tables
│ └── readme2024u.txt # Official documentation
│
├── sql/
│ ├── schema.sql # Table creation
│ ├── views.sql # Analytical views
│ ├── analysis_queries.sql # Core business queries
│ ├── advanced_queries.sql # Extended analysis
│ └── optimised_queries.sql # View-based refined versions
│
├── docs/
│ ├── project_overview.md
│ ├── business_questions.md
│ └── schema_design.md
│
├── notebooks/
│ └── mlb_visual_analysis.ipynb # EDA notebook
│
└── README.md # This file
All questions analysed are documented in:
docs/business_questions.md
Topics include:
- College-to-MLB pipelines
- Debut & retirement age distributions
- Career length modelling
- Salary trends & payroll inequality
- Postseason success vs financial investment
- Hall of Fame vs non-HOF comparisons
- Physical traits across eras, leagues, and positions
- Ballpark geography and debut profiles
Questions follow the convention:
- (C) — Core question
- (advance query) — Extended portfolio analysis
Build an empty PostgreSQL database, then:
sql/schema.sql;
Load each CSV in data/core/ into its corresponding table.
Ensure header rows and delimiter settings are correct.
sql/views.sql;
This generates reusable analytical views.
Execute:
- sql/analysis_queries.sql
- sql/advanced_queries.sql
- sql/optimised_queries.sql
Open:
notebooks/mlb_visual_analysis.ipynb
Includes pandas–SQL integration and long-term visualisations.
- Manual relational schema creation
- Analytical view design
- Window functions (RANK, PERCENTILE, LAG/LEAD)
- Multi-step CTE pipelines
- Joining multi-table baseball records
- SQL performance tuning
- SQL + Python EDA workflows
This project builds a multi-layered, data-driven narrative of MLB’s evolution, spanning origins, economics, career outcomes, and physical traits. Although split into four sections, the analysis forms a continuous story about talent, opportunity, and athletic change.
Part I examines where MLB players come from and how the collegiate pipeline has changed across time.
- Around 28 percent of MLB players have a documented college affiliation, spread across 1,100+ schools, indicating a large and decentralised talent system.
- School representation increases sharply after 1950, driven by the growth of collegiate baseball and improved scouting.
- Geographic patterns shift: early talent comes primarily from the Northeast and Midwest, while modern pipelines favour the South and West.
- Schools with established baseball programs tend to produce more players and longer MLB careers.
Part II examines the financial landscape players enter as professionals.
- The salary dataset (1985–2016) captures the modern era of escalating team budgets.
- The New York Yankees maintain a historic lead in cumulative payroll spending.
- Other high-spending teams (Red Sox, Dodgers, Mets) invest heavily but do not match the Yankees' multi-decade consistency.
- The median payroll is roughly two-thirds of the top-tier payrolls, indicating structural financial inequality.
- Postseason success is closely associated with higher payrolls.
Part III explores what happens after players reach MLB.
- MLB careers are typically short, with only a minority sustaining decade-long tenures.
- Hall of Fame players debut younger, play longer, and accumulate more games than non-inductees.
- Team mobility is common; relatively few players spend their entire career with one franchise.
- Historically stable franchises anchor many long-tenured players.
Part IV focuses on athletic traits and debut context.
- Height and weight data are over 90 percent complete.
- Average debut height has risen from about 5 ft 8 in to more than 6 ft 1 in, with weight increases even larger.
- Pitchers debut slightly taller and heavier than non-pitchers.
- AL and NL players show nearly identical physical profiles in each era.
- Geographic patterns show players debuting in western or international parks tend to be slightly larger on average.
Across all sections, a coherent picture emerges:
- The MLB talent pipeline has broadened geographically and institutionally.
- Team financial capacity strongly shapes opportunity and competitive landscapes.
- Career outcomes depend on early debut age, longevity, mobility, and organisational context.
- Physical traits have evolved alongside strategic, developmental, and economic changes.
Together, these insights provide a comprehensive view of MLB’s evolution, from origins to opportunity, from financial investment to career trajectory, and from physical development to long-term outcomes.
This repository serves as both:
- A portfolio-grade SQL project demonstrating professional relational modelling, advanced query design, analytical view architecture, and performance optimisation.
- A substantive exploration of MLB history, spanning salaries, careers, development systems, physical traits, and postseason dynamics.
The Lahman Baseball Database is distributed under the
Creative Commons Attribution–ShareAlike 3.0 Unported License:
https://creativecommons.org/licenses/by-sa/3.0/



