Scrapes NHL play-by-play (PXP) data from the NHL Stats API and stores it in SQLite. Includes a normalized player analytics schema, a canonical shot events layer, xG feature extraction, and exploratory analysis notebooks.
- Fetch weekly schedules and play-by-play events from the NHL Stats API (
api-web.nhle.com) - Store each game's events in
nhl_data.dbas tablegame_<game_id> - Resume interrupted scrapes without re-downloading completed games or skipping failed dates
- Maintain normalized player analytics tables for player/game modeling workflows
- Provide a canonical shot events schema with xG features for model development
- Backfill derived tables (shot events, game context) for any games missing them
data/
nhl_data.db SQLite database (not checked in; created at runtime)
src/
main.py Scrape loop with weekly pagination from 2007-10-03 to today
nhl_api.py NHL Stats API client (schedule + play-by-play endpoints)
database.py SQLite operations: raw events, collection tracking, player schema, xG schema
xg_features.py Pure feature-extraction functions (coordinates, score state, faceoff context, rest/travel)
arena_reference.py Static arena location data (lat/lon, UTC offset) for all 32 teams + historical
backfill_status.py CLI tool to inspect database completeness and backfill log progress
tests/
conftest.py Pytest path setup
test_database.py Schema, collection log, and data quality tests
test_game_context.py Game context table and backfill tests
test_main.py Scraper loop integration tests
test_nhl_api.py API parsing and error-path tests
test_xg_features.py xG feature extraction unit tests
test_xg_schema.py xG Phase 0 schema and validation tests
notebooks/
faceoff_decay_analysis.ipynb Phase 2 Area 3: shot quality decay by time since faceoff and zone
rest_travel_analysis.ipynb Phase 2 Area 1: rest days, back-to-back status, travel distance, timezone effects
venue_bias_analysis.ipynb Phase 2 Area 4: scorekeeper bias detection by venue
zone_start_signal.ipynb Phase 2 Area 2: zone deployment context from faceoff zone codes
docs/
xg_model_roadmap.md xG model development roadmap (main plan)
xg_model_components/ Detailed component design docs (9 parts)
The scraper tracks progress in a collection_log table. Each date records how many games were found and how many were successfully collected:
- Idempotent completion:
completed_atis only set when all games for a date succeed. Partial failures leave the date incomplete. - Resume from failures: On restart, the scraper resumes from the earliest incomplete date, not the latest complete one, so no games are permanently skipped.
- Per-game deduplication: Already-collected games are skipped individually, so retrying an incomplete date only re-fetches the games that failed.
- Legacy data migration:
fix_incomplete_collection_logruns at startup to correct any historical rows wherecompleted_atwas incorrectly set despite incomplete collection.
Per-game game_<game_id> tables with a UNIQUE(period, time, event, description) constraint. Legacy tables without the constraint are automatically deduplicated and migrated on startup.
Initialize with ensure_player_database_schema(conn):
players,games,teams— core dimension tablesplayer_game_stats— one row per(player_id, game_id)with counting stats, TOI, and xG placeholdersplayer_game_features— materialized rolling/rank features withfeature_set_versiontracking
Initialize with ensure_xg_schema(conn):
shot_events— canonical shot event table with normalized coordinates, shot type, distance/angle to goal, score state, manpower state, faceoff timing/zone, andevent_schema_versionfor training reproducibility- Data contracts: validated enums for shot types, manpower states, score states, and NHL rink coordinate bounds
validate_shot_events_quality()— checks shot type, manpower/score state, coordinate ranges, is_goal values, time remaining, and duplicate events
game_context— per-game metadata (teams, venue, venue lat/lon, UTC offset, home/away rest days, travel distance, timezone delta) derived from raw API data and arena reference data- Includes
context_schema_versionfor version-aware backfill
validate_player_game_stats_quality()— duplicate keys, negative/excessive TOI, invalid position groupsvalidate_shot_events_quality()— invalid enums, out-of-range coordinates, negative time, duplicate events
Pure Python functions (no DB or HTTP dependencies) for computing shot event features:
- Coordinate normalization — flips coordinates so the shooting team always attacks toward +x
- Distance and angle to goal — Euclidean distance and arc-tangent angle from normalized coordinates
- Score state classification — tied / up1 / up2 / up3plus / down1 / down2 / down3plus
- Manpower state classification — parses 4-digit situation codes into skater counts (5v5, 5v4, 4v5, etc.)
- Faceoff context — seconds since last faceoff, faceoff zone code, recency bin (immediate / early / mid / late / steady_state), zone-recency interaction feature
- Rest and travel — rest days between games, back-to-back flag, haversine travel distance, timezone delta
Static lookup table mapping team_id to arena city, UTC offset (standard time), latitude, and longitude. Covers all 32 current NHL teams plus historical franchises (Atlanta Thrashers, original Phoenix Coyotes, etc.).
CLI tool for inspecting database completeness:
cd src
python backfill_status.py [--log-path PATH] [--tail-lines N]Reports raw game table count, metadata/shot event/game context row counts, how many games are missing derived data, last completed collection date, and the tail of the backfill log.
- All dynamic SQL identifiers are validated through
_quote_identifier(rejects non-word characters) - Column names from dict keys in
insert_dataare validated before use in queries - All values use parameterized queries (
?placeholders)
- Python 3.8+
- SQLite (included with standard Python)
requests(runtime),pytest(testing) — seerequirements.txt
pip install -r requirements.txtcd src
python main.pyScrapes from 2007-10-03 through today, storing results in nhl_data.db. The scraper automatically resumes from where it left off on subsequent runs.
The notebooks/ directory contains Jupyter analysis notebooks for the xG model Phase 2 signal validation work. Each notebook connects to nhl_data.db and reads from the derived tables (shot_events, game_context). Run backfill_status.py first to confirm derived tables are populated before opening a notebook.
| Notebook | Phase | Topic |
|---|---|---|
rest_travel_analysis.ipynb |
Phase 2 Area 1 | Rest days, back-to-back, travel distance, timezone effects on shot quality |
zone_start_signal.ipynb |
Phase 2 Area 2 | Faceoff zone code as a zone-deployment proxy |
faceoff_decay_analysis.ipynb |
Phase 2 Area 3 | Shot quality decay by time since faceoff, separated by zone |
venue_bias_analysis.ipynb |
Phase 2 Area 4 | Scorekeeper bias detection by venue |
python3 -m venv /tmp/test-venv && /tmp/test-venv/bin/pip install -q pytest requests
/tmp/test-venv/bin/python -m pytest -qOr if the venv already exists:
/tmp/test-venv/bin/python -m pytest -q243 tests covering:
- Raw table creation, deduplication, and unique constraints
- Collection log idempotency (incomplete dates, retries, resume logic)
- Player-schema phases (dimensions, fact table, feature table, quality checks)
- xG shot events (DDL, validation paths, NULL coordinate handling, version-aware backfill)
- xG feature extraction (coordinate normalization, distance/angle, score/manpower state, faceoff recency, rest/travel)
- Game context extraction and backfill
- NHL API parsing, error paths, rate limiting, and session reuse
- Scraper loop pagination, date filtering, and resume behavior
No live NHL API calls are made during tests.
- A full historical scrape issues many API requests; the built-in rate limiter spaces game API calls by 2 seconds.
- HTTP connections are reused via
requests.Sessionto reduce TCP/TLS overhead. - All SQL identifiers from external input are validated before use.
- Derived tables (
shot_events,game_context,player_game_features) store a schema version column so stale rows are automatically detected and replaced when the extraction logic changes.