Skip to content

khtad/pythonNHLPxPScraperAndDefenderAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NHL Play-by-Play Scraper

Scrapes NHL play-by-play (PXP) data from the NHL Stats API and stores it in SQLite. Includes a normalized player analytics schema, a canonical shot events layer, xG feature extraction, and exploratory analysis notebooks.

Scope

  • Fetch weekly schedules and play-by-play events from the NHL Stats API (api-web.nhle.com)
  • Store each game's events in nhl_data.db as table game_<game_id>
  • Resume interrupted scrapes without re-downloading completed games or skipping failed dates
  • Maintain normalized player analytics tables for player/game modeling workflows
  • Provide a canonical shot events schema with xG features for model development
  • Backfill derived tables (shot events, game context) for any games missing them

Project structure

data/
  nhl_data.db           SQLite database (not checked in; created at runtime)
src/
  main.py               Scrape loop with weekly pagination from 2007-10-03 to today
  nhl_api.py            NHL Stats API client (schedule + play-by-play endpoints)
  database.py           SQLite operations: raw events, collection tracking, player schema, xG schema
  xg_features.py        Pure feature-extraction functions (coordinates, score state, faceoff context, rest/travel)
  arena_reference.py    Static arena location data (lat/lon, UTC offset) for all 32 teams + historical
  backfill_status.py    CLI tool to inspect database completeness and backfill log progress
tests/
  conftest.py           Pytest path setup
  test_database.py      Schema, collection log, and data quality tests
  test_game_context.py  Game context table and backfill tests
  test_main.py          Scraper loop integration tests
  test_nhl_api.py       API parsing and error-path tests
  test_xg_features.py   xG feature extraction unit tests
  test_xg_schema.py     xG Phase 0 schema and validation tests
notebooks/
  faceoff_decay_analysis.ipynb   Phase 2 Area 3: shot quality decay by time since faceoff and zone
  rest_travel_analysis.ipynb     Phase 2 Area 1: rest days, back-to-back status, travel distance, timezone effects
  venue_bias_analysis.ipynb      Phase 2 Area 4: scorekeeper bias detection by venue
  zone_start_signal.ipynb        Phase 2 Area 2: zone deployment context from faceoff zone codes
docs/
  xg_model_roadmap.md              xG model development roadmap (main plan)
  xg_model_components/             Detailed component design docs (9 parts)

Collection tracking and resume

The scraper tracks progress in a collection_log table. Each date records how many games were found and how many were successfully collected:

  • Idempotent completion: completed_at is only set when all games for a date succeed. Partial failures leave the date incomplete.
  • Resume from failures: On restart, the scraper resumes from the earliest incomplete date, not the latest complete one, so no games are permanently skipped.
  • Per-game deduplication: Already-collected games are skipped individually, so retrying an incomplete date only re-fetches the games that failed.
  • Legacy data migration: fix_incomplete_collection_log runs at startup to correct any historical rows where completed_at was incorrectly set despite incomplete collection.

Database schema

Raw events layer

Per-game game_<game_id> tables with a UNIQUE(period, time, event, description) constraint. Legacy tables without the constraint are automatically deduplicated and migrated on startup.

Player analytics (normalized)

Initialize with ensure_player_database_schema(conn):

  • players, games, teams — core dimension tables
  • player_game_stats — one row per (player_id, game_id) with counting stats, TOI, and xG placeholders
  • player_game_features — materialized rolling/rank features with feature_set_version tracking

xG shot events

Initialize with ensure_xg_schema(conn):

  • shot_events — canonical shot event table with normalized coordinates, shot type, distance/angle to goal, score state, manpower state, faceoff timing/zone, and event_schema_version for training reproducibility
  • Data contracts: validated enums for shot types, manpower states, score states, and NHL rink coordinate bounds
  • validate_shot_events_quality() — checks shot type, manpower/score state, coordinate ranges, is_goal values, time remaining, and duplicate events

Game context

  • game_context — per-game metadata (teams, venue, venue lat/lon, UTC offset, home/away rest days, travel distance, timezone delta) derived from raw API data and arena reference data
  • Includes context_schema_version for version-aware backfill

Data quality validation

  • validate_player_game_stats_quality() — duplicate keys, negative/excessive TOI, invalid position groups
  • validate_shot_events_quality() — invalid enums, out-of-range coordinates, negative time, duplicate events

xG feature extraction (xg_features.py)

Pure Python functions (no DB or HTTP dependencies) for computing shot event features:

  • Coordinate normalization — flips coordinates so the shooting team always attacks toward +x
  • Distance and angle to goal — Euclidean distance and arc-tangent angle from normalized coordinates
  • Score state classification — tied / up1 / up2 / up3plus / down1 / down2 / down3plus
  • Manpower state classification — parses 4-digit situation codes into skater counts (5v5, 5v4, 4v5, etc.)
  • Faceoff context — seconds since last faceoff, faceoff zone code, recency bin (immediate / early / mid / late / steady_state), zone-recency interaction feature
  • Rest and travel — rest days between games, back-to-back flag, haversine travel distance, timezone delta

Arena reference data (arena_reference.py)

Static lookup table mapping team_id to arena city, UTC offset (standard time), latitude, and longitude. Covers all 32 current NHL teams plus historical franchises (Atlanta Thrashers, original Phoenix Coyotes, etc.).

Backfill status (backfill_status.py)

CLI tool for inspecting database completeness:

cd src
python backfill_status.py [--log-path PATH] [--tail-lines N]

Reports raw game table count, metadata/shot event/game context row counts, how many games are missing derived data, last completed collection date, and the tail of the backfill log.

Security

  • All dynamic SQL identifiers are validated through _quote_identifier (rejects non-word characters)
  • Column names from dict keys in insert_data are validated before use in queries
  • All values use parameterized queries (? placeholders)

Requirements

  • Python 3.8+
  • SQLite (included with standard Python)
  • requests (runtime), pytest (testing) — see requirements.txt

Installation

pip install -r requirements.txt

Usage

cd src
python main.py

Scrapes from 2007-10-03 through today, storing results in nhl_data.db. The scraper automatically resumes from where it left off on subsequent runs.

Notebooks

The notebooks/ directory contains Jupyter analysis notebooks for the xG model Phase 2 signal validation work. Each notebook connects to nhl_data.db and reads from the derived tables (shot_events, game_context). Run backfill_status.py first to confirm derived tables are populated before opening a notebook.

Notebook Phase Topic
rest_travel_analysis.ipynb Phase 2 Area 1 Rest days, back-to-back, travel distance, timezone effects on shot quality
zone_start_signal.ipynb Phase 2 Area 2 Faceoff zone code as a zone-deployment proxy
faceoff_decay_analysis.ipynb Phase 2 Area 3 Shot quality decay by time since faceoff, separated by zone
venue_bias_analysis.ipynb Phase 2 Area 4 Scorekeeper bias detection by venue

Testing

python3 -m venv /tmp/test-venv && /tmp/test-venv/bin/pip install -q pytest requests
/tmp/test-venv/bin/python -m pytest -q

Or if the venv already exists:

/tmp/test-venv/bin/python -m pytest -q

243 tests covering:

  • Raw table creation, deduplication, and unique constraints
  • Collection log idempotency (incomplete dates, retries, resume logic)
  • Player-schema phases (dimensions, fact table, feature table, quality checks)
  • xG shot events (DDL, validation paths, NULL coordinate handling, version-aware backfill)
  • xG feature extraction (coordinate normalization, distance/angle, score/manpower state, faceoff recency, rest/travel)
  • Game context extraction and backfill
  • NHL API parsing, error paths, rate limiting, and session reuse
  • Scraper loop pagination, date filtering, and resume behavior

No live NHL API calls are made during tests.

Notes

  • A full historical scrape issues many API requests; the built-in rate limiter spaces game API calls by 2 seconds.
  • HTTP connections are reused via requests.Session to reduce TCP/TLS overhead.
  • All SQL identifiers from external input are validated before use.
  • Derived tables (shot_events, game_context, player_game_features) store a schema version column so stale rows are automatically detected and replaced when the extraction logic changes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors