NHL Play-by-Play Scraper

Scrapes NHL play-by-play (PXP) data from the NHL Stats API and stores it in SQLite. Includes a normalized player analytics schema, a canonical shot events layer, xG feature extraction, and exploratory analysis notebooks.

Scope

Fetch weekly schedules and play-by-play events from the NHL Stats API (api-web.nhle.com)
Store each game's events in nhl_data.db as table game_<game_id>
Resume interrupted scrapes without re-downloading completed games or skipping failed dates
Maintain normalized player analytics tables for player/game modeling workflows
Provide a canonical shot events schema with xG features for model development
Backfill derived tables (shot events, game context) for any games missing them

Project structure

data/
  nhl_data.db           SQLite database (not checked in; created at runtime)
src/
  main.py               Scrape loop with weekly pagination from 2007-10-03 to today
  nhl_api.py            NHL Stats API client (schedule + play-by-play endpoints)
  database.py           SQLite operations: raw events, collection tracking, player schema, xG schema
  xg_features.py        Pure feature-extraction functions (coordinates, score state, faceoff context, rest/travel)
  arena_reference.py    Static arena location data (lat/lon, UTC offset) for all 32 teams + historical
  backfill_status.py    CLI tool to inspect database completeness and backfill log progress
tests/
  conftest.py           Pytest path setup
  test_database.py      Schema, collection log, and data quality tests
  test_game_context.py  Game context table and backfill tests
  test_main.py          Scraper loop integration tests
  test_nhl_api.py       API parsing and error-path tests
  test_xg_features.py   xG feature extraction unit tests
  test_xg_schema.py     xG Phase 0 schema and validation tests
notebooks/
  faceoff_decay_analysis.ipynb   Phase 2 Area 3: shot quality decay by time since faceoff and zone
  rest_travel_analysis.ipynb     Phase 2 Area 1: rest days, back-to-back status, travel distance, timezone effects
  venue_bias_analysis.ipynb      Phase 2 Area 4: scorekeeper bias detection by venue
  zone_start_signal.ipynb        Phase 2 Area 2: zone deployment context from faceoff zone codes
docs/
  xg_model_roadmap.md              xG model development roadmap (main plan)
  xg_model_components/             Detailed component design docs (9 parts)

Collection tracking and resume

The scraper tracks progress in a collection_log table. Each date records how many games were found and how many were successfully collected:

Idempotent completion: completed_at is only set when all games for a date succeed. Partial failures leave the date incomplete.
Resume from failures: On restart, the scraper resumes from the earliest incomplete date, not the latest complete one, so no games are permanently skipped.
Per-game deduplication: Already-collected games are skipped individually, so retrying an incomplete date only re-fetches the games that failed.
Legacy data migration: fix_incomplete_collection_log runs at startup to correct any historical rows where completed_at was incorrectly set despite incomplete collection.

Database schema

Raw events layer

Per-game game_<game_id> tables with a UNIQUE(period, time, event, description) constraint. Legacy tables without the constraint are automatically deduplicated and migrated on startup.

Player analytics (normalized)

Initialize with ensure_player_database_schema(conn):

players, games, teams — core dimension tables
player_game_stats — one row per (player_id, game_id) with counting stats, TOI, and xG placeholders
player_game_features — materialized rolling/rank features with feature_set_version tracking

xG shot events

Initialize with ensure_xg_schema(conn):

shot_events — canonical shot event table with normalized coordinates, shot type, distance/angle to goal, score state, manpower state, faceoff timing/zone, and event_schema_version for training reproducibility
Data contracts: validated enums for shot types, manpower states, score states, and NHL rink coordinate bounds
validate_shot_events_quality() — checks shot type, manpower/score state, coordinate ranges, is_goal values, time remaining, and duplicate events

Game context

game_context — per-game metadata (teams, venue, venue lat/lon, UTC offset, home/away rest days, travel distance, timezone delta) derived from raw API data and arena reference data
Includes context_schema_version for version-aware backfill

Data quality validation

validate_player_game_stats_quality() — duplicate keys, negative/excessive TOI, invalid position groups
validate_shot_events_quality() — invalid enums, out-of-range coordinates, negative time, duplicate events

xG feature extraction (`xg_features.py`)

Pure Python functions (no DB or HTTP dependencies) for computing shot event features:

Coordinate normalization — flips coordinates so the shooting team always attacks toward +x
Distance and angle to goal — Euclidean distance and arc-tangent angle from normalized coordinates
Score state classification — tied / up1 / up2 / up3plus / down1 / down2 / down3plus
Manpower state classification — parses 4-digit situation codes into skater counts (5v5, 5v4, 4v5, etc.)
Faceoff context — seconds since last faceoff, faceoff zone code, recency bin (immediate / early / mid / late / steady_state), zone-recency interaction feature
Rest and travel — rest days between games, back-to-back flag, haversine travel distance, timezone delta

Arena reference data (`arena_reference.py`)

Static lookup table mapping team_id to arena city, UTC offset (standard time), latitude, and longitude. Covers all 32 current NHL teams plus historical franchises (Atlanta Thrashers, original Phoenix Coyotes, etc.).

Backfill status (`backfill_status.py`)

CLI tool for inspecting database completeness:

cd src
python backfill_status.py [--log-path PATH] [--tail-lines N]

Reports raw game table count, metadata/shot event/game context row counts, how many games are missing derived data, last completed collection date, and the tail of the backfill log.

Security

All dynamic SQL identifiers are validated through _quote_identifier (rejects non-word characters)
Column names from dict keys in insert_data are validated before use in queries
All values use parameterized queries (? placeholders)

Requirements

Python 3.8+
SQLite (included with standard Python)
requests (runtime), pytest (testing) — see requirements.txt

Installation

pip install -r requirements.txt

Usage

cd src
python main.py

Scrapes from 2007-10-03 through today, storing results in nhl_data.db. The scraper automatically resumes from where it left off on subsequent runs.

Notebooks

The notebooks/ directory contains Jupyter analysis notebooks for the xG model Phase 2 signal validation work. Each notebook connects to nhl_data.db and reads from the derived tables (shot_events, game_context). Run backfill_status.py first to confirm derived tables are populated before opening a notebook.

Notebook	Phase	Topic
`rest_travel_analysis.ipynb`	Phase 2 Area 1	Rest days, back-to-back, travel distance, timezone effects on shot quality
`zone_start_signal.ipynb`	Phase 2 Area 2	Faceoff zone code as a zone-deployment proxy
`faceoff_decay_analysis.ipynb`	Phase 2 Area 3	Shot quality decay by time since faceoff, separated by zone
`venue_bias_analysis.ipynb`	Phase 2 Area 4	Scorekeeper bias detection by venue

Testing

python3 -m venv /tmp/test-venv && /tmp/test-venv/bin/pip install -q pytest requests
/tmp/test-venv/bin/python -m pytest -q

Or if the venv already exists:

/tmp/test-venv/bin/python -m pytest -q

243 tests covering:

Raw table creation, deduplication, and unique constraints
Collection log idempotency (incomplete dates, retries, resume logic)
Player-schema phases (dimensions, fact table, feature table, quality checks)
xG shot events (DDL, validation paths, NULL coordinate handling, version-aware backfill)
xG feature extraction (coordinate normalization, distance/angle, score/manpower state, faceoff recency, rest/travel)
Game context extraction and backfill
NHL API parsing, error paths, rate limiting, and session reuse
Scraper loop pagination, date filtering, and resume behavior

No live NHL API calls are made during tests.

Notes

A full historical scrape issues many API requests; the built-in rate limiter spaces game API calls by 2 seconds.
HTTP connections are reused via requests.Session to reduce TCP/TLS overhead.
All SQL identifiers from external input are validated before use.
Derived tables (shot_events, game_context, player_game_features) store a schema version column so stale rows are automatically detected and replaced when the extraction logic changes.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.claude		.claude
data		data
docs		docs
knowledge_base		knowledge_base
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NHL Play-by-Play Scraper

Scope

Project structure

Collection tracking and resume

Database schema

Raw events layer

Player analytics (normalized)

xG shot events

Game context

Data quality validation

xG feature extraction (`xg_features.py`)

Arena reference data (`arena_reference.py`)

Backfill status (`backfill_status.py`)

Security

Requirements

Installation

Usage

Notebooks

Testing

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NHL Play-by-Play Scraper

Scope

Project structure

Collection tracking and resume

Database schema

Raw events layer

Player analytics (normalized)

xG shot events

Game context

Data quality validation

xG feature extraction (xg_features.py)

Arena reference data (arena_reference.py)

Backfill status (backfill_status.py)

Security

Requirements

Installation

Usage

Notebooks

Testing

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

xG feature extraction (`xg_features.py`)

Arena reference data (`arena_reference.py`)

Backfill status (`backfill_status.py`)

Packages