From 89fab0bd4fb619662379eb7734f54b83717d94d6 Mon Sep 17 00:00:00 2001 From: jevansnyc Date: Wed, 1 Apr 2026 13:17:15 -0500 Subject: [PATCH 1/4] Add JS Asset Auditor engineering spec Engineering spec for the /audit-js-assets . Covers sweep protocol, Chrome DevTools MCP tooling, heuristic filtering, slug generation, init and diff modes. Closes #606 --- .../2026-04-01-js-asset-auditor-design.md | 216 ++++++++++++++++++ 1 file changed, 216 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md diff --git a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md new file mode 100644 index 00000000..d6168592 --- /dev/null +++ b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md @@ -0,0 +1,216 @@ +# JS Asset Auditor — Engineering Spec + +**Date:** 2026-04-01 +**Status:** Approved for engineering breakdown +**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) + +--- + +## Context + +The JS Asset Proxy requires a `js-assets.toml` file declaring which third-party JS assets to proxy. Without tooling, populating this file requires manually inspecting network requests in browser DevTools, extracting URLs, generating opaque slugs, and writing TOML — a tedious error-prone process that is a barrier to publisher onboarding. + +The Auditor eliminates this friction. It sweeps a publisher's page using the Chrome DevTools MCP, detects third-party JS assets, auto-generates `js-assets.toml` entries, and auto-detects `inject_in_head` from the page DOM. The operator's only remaining decision is reviewing the output before committing. + +It also runs as a monitoring tool — `--diff` mode compares a new sweep against the existing config and surfaces new or removed assets, giving publishers ongoing visibility into their third-party JS footprint. + +**Implementation:** Pure Claude Code skill — no Rust, no compiled code, no additional dependencies. Uses the Chrome DevTools MCP already configured in `.claude/settings.json`. + +--- + +## Command Interface + +```bash +/audit-js-assets https://www.publisher.com # init — generate js-assets.toml +/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +``` + +--- + +## Sweep Protocol + +1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary) +2. Open Chrome via `mcp__chrome-devtools__new_page`, navigate to target URL via `mcp__chrome-devtools__navigate_page` +3. Wait for full page load + ~6s settle window for async script loads (`mcp__chrome-devtools__wait_for`) +4. In parallel: + - `mcp__chrome-devtools__list_network_requests` → filter for requests where URL ends in `.js` or `Content-Type: application/javascript`, and origin ≠ `publisher.domain` + - `mcp__chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs +5. Apply heuristic filter (see below) +6. For each surviving asset, generate a `[[js_assets]]` entry (see below) +7. Write output (init or diff mode) +8. Print terminal summary +9. Close page via `mcp__chrome-devtools__close_page` + +--- + +## Heuristic Filter + +The following origin categories are excluded silently. The terminal summary reports what was filtered and why so operators can manually add entries if needed. + +| Category | Excluded origins | +|---|---| +| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | +| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | +| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | +| Social embeds | `platform.twitter.com`, `connect.facebook.net` | + +**`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied. + +Everything else surfaces for operator review. + +--- + +## Asset Entry Generation + +| Field | Derivation | +|---|---| +| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | +| `path` | `/{publisher_prefix}/{asset_stem}.js`, or wildcard variant if versioned path detected | +| `origin_url` | Full captured URL, with wildcard substitution applied if versioned | +| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | +| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | + +### Slug algorithm + +``` +publisher_prefix = first_8_chars(base62(sha256(publisher.domain + origin_url))) +asset_stem = filename_without_extension(origin_url) +slug = "{publisher_prefix}:{asset_stem}" +``` + +**Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time. + +**Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic. + +### Wildcard detection + +Path segments matching either pattern are replaced with `*`: +- Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`) +- Hash-like: `[a-f0-9]{6,}` or `[A-Za-z0-9]{8,}` between path separators + +The original URL is preserved as a comment above the generated entry so operators can verify the wildcard substitution is correct. + +--- + +## Init Mode Output + +### `js-assets.toml` (written to repo root) + +```toml +# Generated by /audit-js-assets on 2026-04-01 +# Publisher: publisher.com +# Source URL: https://www.publisher.com + +[[js_assets]] +# https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js +slug = "aB3kR7mN:prebid-load" +path = "/sdk/aB3kR7mN.js" +origin_url = "https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js" +inject_in_head = true + +[[js_assets]] +# https://raven-static.vendor.io/prod/1.19.8-hcskhn/raven.js (wildcard detected) +slug = "xQ9pL2wY:raven" +path = "/raven-static/*" +origin_url = "https://raven-static.vendor.io/prod/*/raven.js" +inject_in_head = false +``` + +### Terminal summary + +``` +JS Asset Audit — publisher.com +──────────────────────────────── +Detected: 8 third-party JS requests +Filtered: 3 (cdnjs.cloudflare.com ×2, sentry.io ×1) +Surfaced: 5 assets → js-assets.toml + + aB3kR7mN inject_in_head=true web.prebidwrapper.com/.../prebid-load.js + xQ9pL2wY inject_in_head=false raven-static.vendor.io/prod/*/raven.js [wildcard] + zM4nK8vP inject_in_head=true googletagmanager.com/gtm.js + ... + +Review inject_in_head values and commit js-assets.toml when ready. +Diff mode: /audit-js-assets --diff +``` + +--- + +## Diff Mode Output + +Compares sweep results against the existing `js-assets.toml`. + +| Condition | Behavior | +|---|---| +| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | +| Asset in file, not in sweep | **Missing** — flagged in terminal summary with `⚠`. Never auto-removed. | +| Asset in both | **Confirmed** — listed as present | + +New entries are appended as TOML comments so the file stays valid and nothing is activated without the operator explicitly uncommenting. + +### `js-assets.toml` (new entry appended as comment) + +```toml +# --- NEW (detected by /audit-js-assets --diff on 2026-04-01, uncomment to activate) --- +# [[js_assets]] +# # https://googletagmanager.com/gtm.js +# slug = "zM4nK8vP:gtm" +# path = "/sdk/zM4nK8vP.js" +# origin_url = "https://googletagmanager.com/gtm.js" +# inject_in_head = true +``` + +### Terminal summary (diff mode) + +``` +JS Asset Audit (diff) — publisher.com +──────────────────────────────── +Confirmed: 4 assets still present on page +New: 1 asset detected (appended as comment to js-assets.toml) +Missing: 1 asset no longer seen on page ⚠ + + NEW zM4nK8vP googletagmanager.com/gtm.js → review in js-assets.toml + MISSING xQ9pL2wY raven-static.vendor.io/... → may have been removed or renamed +``` + +--- + +## Implementation + +The Auditor is a Claude Code skill file. No compiled code. + +**Skill location:** `.claude/skills/audit-js-assets.md` + +**MCP tools used:** +- `mcp__chrome-devtools__new_page` — open browser tab +- `mcp__chrome-devtools__navigate_page` — load publisher URL +- `mcp__chrome-devtools__wait_for` — settle after page load +- `mcp__chrome-devtools__list_network_requests` — capture JS requests +- `mcp__chrome-devtools__evaluate_script` — detect head-loaded scripts via DOM query +- `mcp__chrome-devtools__close_page` — clean up tab + +**File tools used:** +- `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode) +- `Write` — write generated/updated `js-assets.toml` + +--- + +## Delivery Order + +The Auditor should be delivered **after Proxy Phase 1** (so `js-assets.toml` schema is defined) and **before Proxy Phase 2** (so engineering has real populated entries to test the cache pipeline against actual vendor origins). + +See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md). + +--- + +## Verification + +- Run `/audit-js-assets https://www.publisher.com` against a known test publisher page with identified third-party JS +- Verify generated entries match actual third-party JS observed on the page (cross-check in browser DevTools) +- Verify `inject_in_head = true` only for scripts that appear in `` (not ``) +- Verify wildcard detection fires for versioned path segments and not for stable paths +- Verify GTM (`googletagmanager.com`) is captured and not filtered +- Verify framework CDNs (`cdnjs.cloudflare.com` etc.) are filtered with reason in summary +- Run `--diff` against an unchanged page → all entries confirmed, no new/missing +- Run `--diff` after adding a new vendor script to the page → appears as `NEW` in summary +- Run `--diff` after removing a script → appears as `MISSING ⚠` in summary, file unchanged From d8a0d84c914261ecd3d6ffd1d4c95369b4a2de86 Mon Sep 17 00:00:00 2001 From: Christian Date: Fri, 10 Apr 2026 15:46:42 -0500 Subject: [PATCH 2/4] Address PR feedback on JS Asset Auditor spec Fix incorrect MCP tool name prefix, replace misused wait_for with evaluate_script setTimeout, correct list_network_requests filtering to use resourceTypes, resolve path derivation contradiction with consistent /js-assets/{prefix}/{stem}.js formula, pin slug separator and base62 charset, add URL Processing section with normalization rules and first-party boundary definition, tighten wildcard regex to require mixed character classes, and move skill location to .claude/commands/. --- .../2026-04-01-js-asset-auditor-design.md | 113 ++++++++++++------ 1 file changed, 74 insertions(+), 39 deletions(-) diff --git a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md index d6168592..aae3db57 100644 --- a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md +++ b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md @@ -2,7 +2,7 @@ **Date:** 2026-04-01 **Status:** Approved for engineering breakdown -**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) +**Related:** [JS Asset Proxy spec](2026-04-01-js-asset-proxy-design.md) _(on `js-asset-proxy-spec` branch until merged)_ --- @@ -21,8 +21,9 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against ## Command Interface ```bash -/audit-js-assets https://www.publisher.com # init — generate js-assets.toml -/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +/audit-js-assets https://www.publisher.com # init — generate js-assets.toml +/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +/audit-js-assets https://www.publisher.com --settle 15000 # longer settle for ad-tech-heavy pages ``` --- @@ -30,16 +31,38 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against ## Sweep Protocol 1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary) -2. Open Chrome via `mcp__chrome-devtools__new_page`, navigate to target URL via `mcp__chrome-devtools__navigate_page` -3. Wait for full page load + ~6s settle window for async script loads (`mcp__chrome-devtools__wait_for`) +2. Open Chrome via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page`, navigate to target URL via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page` +3. Wait for page load settle: `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` with `await new Promise(r => setTimeout(r, SETTLE_MS))` where `SETTLE_MS` defaults to 6000 (configurable via `--settle `) 4. In parallel: - - `mcp__chrome-devtools__list_network_requests` → filter for requests where URL ends in `.js` or `Content-Type: application/javascript`, and origin ≠ `publisher.domain` - - `mcp__chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs -5. Apply heuristic filter (see below) + - `mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests` with `resourceTypes: ["script"]` → post-filter to exclude first-party hosts (see URL Processing below) + - `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs +5. Apply URL normalization (see below), then heuristic filter (see below) 6. For each surviving asset, generate a `[[js_assets]]` entry (see below) 7. Write output (init or diff mode) 8. Print terminal summary -9. Close page via `mcp__chrome-devtools__close_page` +9. Close page via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__close_page` + +**`inject_in_head` semantics:** The DOM snapshot in step 4 captures the final state of `` after the settle window. Scripts that were briefly inserted and then removed by a loader will not appear. This is intentional — `inject_in_head = true` means "the script is present in `` at page-stable state." If a loader removes it before the snapshot, the proxy should not re-inject it. + +--- + +## URL Processing + +### First-party boundary + +A network request is **first-party** if the request URL's host, after stripping a leading `www.`, matches `publisher.domain` (from `trusted-server.toml`) after the same stripping. Matching is exact on the resulting strings. + +Publisher-owned CDN subdomains (e.g., `cdn.publisher.com`, `static.publisher.com`) are treated as third-party by default. If the publisher wants to exclude them, they can be added to a `first_party_hosts` list in the command invocation (e.g., `--first-party cdn.publisher.com`). + +### URL normalization + +Applied to every captured script URL before slug generation and before persisting `origin_url`: + +1. Strip fragment (`#...`) +2. Strip all query parameters — cache-busters (`?v=123`, `?cb=timestamp`), consent params, and session tokens all live in query strings. JS asset versioning uses path segments, not query params. +3. Strip trailing slash from the path + +The normalized URL is what gets stored in `origin_url` and fed into the slug hash. --- @@ -47,12 +70,14 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against The following origin categories are excluded silently. The terminal summary reports what was filtered and why so operators can manually add entries if needed. -| Category | Excluded origins | -|---|---| +**Matching:** Filter entries match if the request URL's host ends with the filter entry, with a dot-boundary check. For example, `googletagmanager.com` in the filter matches `www.googletagmanager.com` but not `evil-googletagmanager.com`. + +| Category | Excluded origins | +| -------------- | ------------------------------------------------------------------------------ | | Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | -| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | -| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | -| Social embeds | `platform.twitter.com`, `connect.facebook.net` | +| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | +| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | +| Social embeds | `platform.twitter.com`, `platform.x.com`, `connect.facebook.net` | **`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied. @@ -62,31 +87,38 @@ Everything else surfaces for operator review. ## Asset Entry Generation -| Field | Derivation | -|---|---| -| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | -| `path` | `/{publisher_prefix}/{asset_stem}.js`, or wildcard variant if versioned path detected | -| `origin_url` | Full captured URL, with wildcard substitution applied if versioned | -| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | -| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | +| Field | Derivation | +| ---------------- | --------------------------------------------------------------------------------------------------- | +| `slug` | `{publisher_prefix}:{asset_stem}` — see slug algorithm below | +| `path` | Fixed: `/js-assets/{publisher_prefix}/{asset_stem}.js`. Wildcard: `/js-assets/{publisher_prefix}/*` | +| `origin_url` | Normalized URL (see URL Processing), with wildcard substitution applied if versioned | +| `ttl_sec` | Omitted — proxy defaults to 1800 (wildcard) or 3600 (fixed) | +| `stale_ttl_sec` | Omitted — proxy defaults to 86400 (24h) | +| `inject_in_head` | `true` if URL appeared in head script list from DOM evaluation, else `false` | ### Slug algorithm ``` -publisher_prefix = first_8_chars(base62(sha256(publisher.domain + origin_url))) +publisher_prefix = first_8_chars(base62(sha256(publisher.domain + "|" + origin_url))) asset_stem = filename_without_extension(origin_url) slug = "{publisher_prefix}:{asset_stem}" ``` +The pipe (`|`) separator is required — it cannot appear in domain names or at the start of a URL, so the hash input is unambiguous. The `origin_url` fed into the hash must be the normalized URL (see URL Processing). + +**base62 charset:** `0-9A-Za-z` (digits first, then uppercase, then lowercase). This matches the `base62` crate convention. + **Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time. **Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic. ### Wildcard detection -Path segments matching either pattern are replaced with `*`: +Path segments matching any of these patterns are replaced with `*`: + - Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`) -- Hash-like: `[a-f0-9]{6,}` or `[A-Za-z0-9]{8,}` between path separators +- Hex hash: `[a-f0-9]{8,}` between path separators (lowercase hex, minimum 8 characters) +- Mixed alphanumeric hash: `[A-Za-z0-9]{8,}` between path separators, **must contain at least one digit and at least one letter** — this excludes pure-alpha dictionary words like `analytics` or `bootstrap` The original URL is preserved as a comment above the generated entry so operators can verify the wildcard substitution is correct. @@ -104,14 +136,14 @@ The original URL is preserved as a comment above the generated entry so operator [[js_assets]] # https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js slug = "aB3kR7mN:prebid-load" -path = "/sdk/aB3kR7mN.js" +path = "/js-assets/aB3kR7mN/prebid-load.js" origin_url = "https://web.prebidwrapper.com/golf-WnLmpLyEjL/default-v2/prebid-load.js" inject_in_head = true [[js_assets]] # https://raven-static.vendor.io/prod/1.19.8-hcskhn/raven.js (wildcard detected) slug = "xQ9pL2wY:raven" -path = "/raven-static/*" +path = "/js-assets/xQ9pL2wY/*" origin_url = "https://raven-static.vendor.io/prod/*/raven.js" inject_in_head = false ``` @@ -140,11 +172,11 @@ Diff mode: /audit-js-assets --diff Compares sweep results against the existing `js-assets.toml`. -| Condition | Behavior | -|---|---| -| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | +| Condition | Behavior | +| --------------------------- | ----------------------------------------------------------------------- | +| Asset in sweep, not in file | **New** — appended to `js-assets.toml` as a commented-out block | | Asset in file, not in sweep | **Missing** — flagged in terminal summary with `⚠`. Never auto-removed. | -| Asset in both | **Confirmed** — listed as present | +| Asset in both | **Confirmed** — listed as present | New entries are appended as TOML comments so the file stays valid and nothing is activated without the operator explicitly uncommenting. @@ -155,7 +187,7 @@ New entries are appended as TOML comments so the file stays valid and nothing is # [[js_assets]] # # https://googletagmanager.com/gtm.js # slug = "zM4nK8vP:gtm" -# path = "/sdk/zM4nK8vP.js" +# path = "/js-assets/zM4nK8vP/gtm.js" # origin_url = "https://googletagmanager.com/gtm.js" # inject_in_head = true ``` @@ -179,17 +211,20 @@ Missing: 1 asset no longer seen on page ⚠ The Auditor is a Claude Code skill file. No compiled code. -**Skill location:** `.claude/skills/audit-js-assets.md` +**Skill location:** `.claude/commands/audit-js-assets.md` **MCP tools used:** -- `mcp__chrome-devtools__new_page` — open browser tab -- `mcp__chrome-devtools__navigate_page` — load publisher URL -- `mcp__chrome-devtools__wait_for` — settle after page load -- `mcp__chrome-devtools__list_network_requests` — capture JS requests -- `mcp__chrome-devtools__evaluate_script` — detect head-loaded scripts via DOM query -- `mcp__chrome-devtools__close_page` — clean up tab + +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page` — open browser tab +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page` — load publisher URL +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests` — capture JS requests +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` — settle window + detect head-loaded scripts via DOM query +- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__close_page` — clean up tab + +**Permission grants required:** `navigate_page`, `list_network_requests`, and `close_page` are not currently approved in `.claude/settings.json`. Add them to `permissions.allow` before running the skill, or expect interactive permission prompts on first run. **File tools used:** + - `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode) - `Write` — write generated/updated `js-assets.toml` @@ -199,7 +234,7 @@ The Auditor is a Claude Code skill file. No compiled code. The Auditor should be delivered **after Proxy Phase 1** (so `js-assets.toml` schema is defined) and **before Proxy Phase 2** (so engineering has real populated entries to test the cache pipeline against actual vendor origins). -See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md). +See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md) _(on `js-asset-proxy-spec` branch until merged)_. --- From 1370d0b23c90668eb3698cf0abfda58cd10c9e2c Mon Sep 17 00:00:00 2001 From: Christian Date: Mon, 13 Apr 2026 14:16:09 -0500 Subject: [PATCH 3/4] update asset auditor design doc --- .../2026-04-01-js-asset-auditor-design.md | 200 ++++++++++++++---- 1 file changed, 158 insertions(+), 42 deletions(-) diff --git a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md index aae3db57..a8d2c141 100644 --- a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md +++ b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md @@ -10,39 +10,50 @@ The JS Asset Proxy requires a `js-assets.toml` file declaring which third-party JS assets to proxy. Without tooling, populating this file requires manually inspecting network requests in browser DevTools, extracting URLs, generating opaque slugs, and writing TOML — a tedious error-prone process that is a barrier to publisher onboarding. -The Auditor eliminates this friction. It sweeps a publisher's page using the Chrome DevTools MCP, detects third-party JS assets, auto-generates `js-assets.toml` entries, and auto-detects `inject_in_head` from the page DOM. The operator's only remaining decision is reviewing the output before committing. +The Auditor eliminates this friction. It sweeps a publisher's page using Playwright (headless Chromium), detects third-party JS assets, auto-generates `js-assets.toml` entries, and auto-detects `inject_in_head` from the page DOM. The operator's only remaining decision is reviewing the output before committing. It also runs as a monitoring tool — `--diff` mode compares a new sweep against the existing config and surfaces new or removed assets, giving publishers ongoing visibility into their third-party JS footprint. -**Implementation:** Pure Claude Code skill — no Rust, no compiled code, no additional dependencies. Uses the Chrome DevTools MCP already configured in `.claude/settings.json`. +**Implementation:** Claude Code plugin at `packages/js-asset-auditor/` containing a standalone Playwright CLI, a processing library, and a skill definition. No Rust, no compiled code. Can also be run directly without Claude Code. --- ## Command Interface ```bash -/audit-js-assets https://www.publisher.com # init — generate js-assets.toml -/audit-js-assets https://www.publisher.com --diff # diff — compare against existing file -/audit-js-assets https://www.publisher.com --settle 15000 # longer settle for ad-tech-heavy pages +# Via Claude Code plugin skill +/js-asset-auditor:audit-js-assets https://www.publisher.com # init — generate js-assets.toml +/js-asset-auditor:audit-js-assets https://www.publisher.com --diff # diff — compare against existing file +/js-asset-auditor:audit-js-assets https://www.publisher.com --settle 15000 # longer settle for ad-tech-heavy pages +/js-asset-auditor:audit-js-assets https://www.publisher.com --no-filter # bypass heuristic filtering +/js-asset-auditor:audit-js-assets https://www.publisher.com --headed # visible browser for debugging +/js-asset-auditor:audit-js-assets https://www.publisher.com --config # also generate trusted-server.toml + +# Direct CLI invocation (no Claude Code required) +node packages/js-asset-auditor/lib/audit.mjs https://www.publisher.com +node packages/js-asset-auditor/lib/audit.mjs https://www.publisher.com --domain publisher.com +node packages/js-asset-auditor/lib/audit.mjs https://www.publisher.com --diff --output js-assets.toml +node packages/js-asset-auditor/lib/audit.mjs https://www.publisher.com --config my-config.toml ``` --- ## Sweep Protocol -1. Read `trusted-server.toml` → extract `publisher.domain` (defines first-party boundary) -2. Open Chrome via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page`, navigate to target URL via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page` -3. Wait for page load settle: `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` with `await new Promise(r => setTimeout(r, SETTLE_MS))` where `SETTLE_MS` defaults to 6000 (configurable via `--settle `) -4. In parallel: - - `mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests` with `resourceTypes: ["script"]` → post-filter to exclude first-party hosts (see URL Processing below) - - `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` → `Array.from(document.head.querySelectorAll('script[src]')).map(s => s.src)` → collect head-loaded script URLs -5. Apply URL normalization (see below), then heuristic filter (see below) -6. For each surviving asset, generate a `[[js_assets]]` entry (see below) -7. Write output (init or diff mode) -8. Print terminal summary -9. Close page via `mcp__plugin_chrome-devtools-mcp_chrome-devtools__close_page` +The CLI (`packages/js-asset-auditor/lib/audit.mjs`) performs the full sweep: -**`inject_in_head` semantics:** The DOM snapshot in step 4 captures the final state of `` after the settle window. Scripts that were briefly inserted and then removed by a loader will not appear. This is intentional — `inject_in_head = true` means "the script is present in `` at page-stable state." If a loader removes it before the snapshot, the proxy should not re-inject it. +1. Resolve publisher domain: `--domain` flag → `trusted-server.toml` → infer from target URL +2. Launch headless Chromium via Playwright (visible with `--headed`) +3. Register a response listener for `resourceType() === 'script'` to capture all script network requests +4. Navigate to target URL (`page.goto`, 30s timeout, follows redirects transparently) +5. Wait for page load settle: `page.waitForTimeout(SETTLE_MS)` where `SETTLE_MS` defaults to 6000 (configurable via `--settle `) +6. Evaluate `document.head.querySelectorAll('script[src]')` to collect head-loaded script URLs +7. Close browser +8. Pass collected URLs to `processAssets()` from `lib/process.mjs` — applies URL normalization, first-party filtering, heuristic filtering, wildcard detection, slug generation +9. Write `js-assets.toml` output (init or diff mode) +10. Print JSON summary to stdout (progress lines go to stderr) + +**`inject_in_head` semantics:** The DOM snapshot in step 6 captures the final state of `` after the settle window. Scripts that were briefly inserted and then removed by a loader will not appear. This is intentional — `inject_in_head = true` means "the script is present in `` at page-stable state." If a loader removes it before the snapshot, the proxy should not re-inject it. --- @@ -50,9 +61,13 @@ It also runs as a monitoring tool — `--diff` mode compares a new sweep against ### First-party boundary -A network request is **first-party** if the request URL's host, after stripping a leading `www.`, matches `publisher.domain` (from `trusted-server.toml`) after the same stripping. Matching is exact on the resulting strings. +A network request is **first-party** if the request URL's host, after stripping a leading `www.`, matches the publisher domain after the same stripping. Matching is exact on the resulting strings. + +**Domain resolution order:** `--domain ` flag → `publisher.domain` from `trusted-server.toml` → inferred from the target URL's hostname. This makes the tool usable in any project — `trusted-server.toml` is not required. -Publisher-owned CDN subdomains (e.g., `cdn.publisher.com`, `static.publisher.com`) are treated as third-party by default. If the publisher wants to exclude them, they can be added to a `first_party_hosts` list in the command invocation (e.g., `--first-party cdn.publisher.com`). +**Auto-detection:** The target URL's hostname is automatically included as first-party, in addition to the resolved publisher domain. This ensures that auditing `https://golf.com` when `publisher.domain = "test-publisher.com"` correctly excludes `golf.com` scripts without requiring `--first-party golf.com`. + +Publisher-owned CDN subdomains (e.g., `cdn.publisher.com`, `static.publisher.com`) are treated as third-party by default. If the publisher wants to exclude them, they can be added via `--first-party cdn.publisher.com`. ### URL normalization @@ -72,15 +87,26 @@ The following origin categories are excluded silently. The terminal summary repo **Matching:** Filter entries match if the request URL's host ends with the filter entry, with a dot-boundary check. For example, `googletagmanager.com` in the filter matches `www.googletagmanager.com` but not `evil-googletagmanager.com`. -| Category | Excluded origins | -| -------------- | ------------------------------------------------------------------------------ | -| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | -| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | -| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | -| Social embeds | `platform.twitter.com`, `platform.x.com`, `connect.facebook.net` | +| Category | Excluded origins | +| ------------------- | --------------------------------------------------------------------------------------------- | +| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | +| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | +| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | +| Social embeds | `platform.twitter.com`, `platform.x.com`, `connect.facebook.net` | +| Google ad rendering | `pagead2.googlesyndication.com`, `tpc.googlesyndication.com`, `s0.2mdn.net`, | +| | `googleads.g.doubleclick.net`, `www.googleadservices.com` | +| Ad fraud detection | `adtrafficquality.google` | +| Ad verification | `adsafeprotected.com`, `moatads.com`, `doubleverify.com` | +| reCAPTCHA | `recaptcha.net`, `www.google.com/recaptcha/*`, `www.gstatic.com/recaptcha/*` | + +**Path-prefix matching:** Some hosts (e.g., `www.google.com`) serve both filterable and non-filterable resources. Entries with a path suffix (e.g., `www.google.com/recaptcha/*`) match only when the URL's path begins with the specified prefix. Plain host entries use dot-boundary suffix matching as before. **`googletagmanager.com` is not filtered** — GTM is ad tech and should be proxied. +**`securepubads.g.doubleclick.net` is not filtered** — this is the GPT ad server SDK. Publishers deliberately place this tag. Its sub-resources (e.g., `pubads_impl.js`) are also intentional. The filter targets ad-rendering infrastructure (iframes, creatives, verification), not ad-serving SDKs. + +**`--no-filter`** bypasses heuristic filtering entirely, surfacing all non-first-party scripts. First-party filtering always applies. + Everything else surfaces for operator review. --- @@ -110,13 +136,13 @@ The pipe (`|`) separator is required — it cannot appear in domain names or at **Rationale:** Fully opaque and hash-derived — no human naming required, no ambiguity for cryptic vendor filenames. The KV metadata (`origin_url`, `content_type`, `asset_slug`) serves as the lookup table. Operators can query `js-asset:{slug}` in the KV store to retrieve full provenance. The terminal summary also prints slug → origin_url at generation time. -**Important:** This algorithm must produce identical output to the Proxy's KV key derivation. Engineering should implement this as a shared utility (e.g., a small JS/TS helper in the skill, or a standalone `scripts/` utility) rather than duplicating the logic. +**Important:** This algorithm must produce identical output to the Proxy's KV key derivation. The reference implementation lives in `packages/js-asset-auditor/lib/slug.mjs` (standalone CLI) and `packages/js-asset-auditor/lib/process.mjs` (processing library), with a copy in `scripts/js-asset-slug.mjs`. Any changes must be synchronized across all files and the Rust proxy. ### Wildcard detection Path segments matching any of these patterns are replaced with `*`: -- Semver: `\d+\.\d+[\.\d-]*` (e.g., `1.19.8-hcskhn`) +- Semver: `\d+\.\d+[\.\d\w-]*` (e.g., `1.19.8-hcskhn`) - Hex hash: `[a-f0-9]{8,}` between path separators (lowercase hex, minimum 8 characters) - Mixed alphanumeric hash: `[A-Za-z0-9]{8,}` between path separators, **must contain at least one digit and at least one letter** — this excludes pure-alpha dictionary words like `analytics` or `bootstrap` @@ -207,26 +233,106 @@ Missing: 1 asset no longer seen on page ⚠ --- +## Integration Detection & Config Generation + +When invoked with `--config [path]`, the CLI also detects known integrations from the swept URLs and generates a `trusted-server.toml` with appropriate `[integrations.*]` sections. + +### Detection patterns + +Integration detection runs on raw URLs (before normalization) to preserve query parameters needed for field extraction. + +| URL Pattern | Integration | Extracted Fields | +| -------------------------------------------------- | ---------------------- | ----------------------------------------- | +| `securepubads.g.doubleclick.net/tag/js/gpt*` | `gpt` | `script_url` | +| `www.googletagmanager.com/gtm.js?id=GTM-XXX` | `google_tag_manager` | `container_id` from `?id=` | +| `sdk.privacy-center.org` | `didomi` | (defaults) | +| `js.datadome.co` | `datadome` | (defaults) | +| `aim.loc.kr/*identity-lockr*.js` | `lockr` | `sdk_url` | +| `*.edge.permutive.app/*-web.js` | `permutive` | `organization_id`, `workspace_id` from URL | +| `*/prebid.js`, `*/prebidjs.js` (+ .min variants) | `prebid` | (detect only) | +| `c.amazon-adsystem.com/aax2/apstag*` | `aps` | (detect only) | + +### Field categories + +- **Full** — all config fields have defaults or are auto-extracted. Config section is ready to use. +- **Partial** — some fields auto-extracted, others need manual input (marked with `# TODO:`). +- **Detect only** — integration detected but key fields (e.g., `server_url`, `pub_id`) require manual input. + +### Config output + +```toml +# Generated by js-asset-auditor on 2026-04-13 +# Source URL: https://www.publisher.com + +[publisher] +domain = "publisher.com" +# cookie_domain = ".publisher.com" +# origin_url = "https://origin.publisher.com" +# proxy_secret = "change-me" + +[integrations.gpt] +enabled = true +script_url = "https://securepubads.g.doubleclick.net/tag/js/gpt.js" # auto-detected +# cache_ttl_seconds = 3600 +# rewrite_script = true + +[integrations.google_tag_manager] +enabled = true +container_id = "GTM-TRCJMD6" # auto-detected + +[integrations.lockr] +enabled = true +sdk_url = "https://aim.loc.kr/identity-lockr-trust-server.js" # auto-detected +app_id = "" # TODO: set your Lockr Identity app_id +# api_endpoint = "https://identity.loc.kr" +``` + +If the target file already exists, the CLI errors unless `--force` is passed. + +--- + ## Implementation -The Auditor is a Claude Code skill file. No compiled code. +The Auditor is packaged as a Claude Code plugin at `packages/js-asset-auditor/` with three components: + +``` +packages/js-asset-auditor/ +├── .claude-plugin/plugin.json # Plugin manifest +├── skills/audit-js-assets/SKILL.md # Skill definition +├── bin/audit-js-assets # Executable (added to PATH by Claude Code) +├── lib/ +│ ├── audit.mjs # Playwright CLI — browser automation + orchestration +│ ├── detect.mjs # Integration detection engine + config generation +│ ├── process.mjs # Processing library — normalization, filtering, slugs, TOML +│ └── slug.mjs # Standalone slug generator +├── package.json # playwright dependency +└── settings.json # Auto-grants Bash(audit-js-assets:*) permission +``` + +1. **Playwright CLI** (`lib/audit.mjs`) — Launches headless Chromium, navigates to the target URL, collects script network requests and head script DOM state, then calls `processAssets()`. Outputs TOML file + JSON summary. Can be run directly without Claude Code. +2. **Processing library** (`lib/process.mjs`) — Pure Node.js module (no external dependencies) that exports `processAssets()` and individual utility functions. Handles URL normalization, first-party filtering, heuristic filtering, wildcard detection, slug generation, and TOML formatting. +3. **Claude Code skill** (`skills/audit-js-assets/SKILL.md`) — Thin wrapper that invokes the CLI via the `bin/audit-js-assets` executable and formats the JSON summary. -**Skill location:** `.claude/commands/audit-js-assets.md` +**Plugin installation:** + +```bash +# Local testing (loads for one session) +claude --plugin-dir packages/js-asset-auditor -**MCP tools used:** +# Via marketplace (permanent installation) +/plugin marketplace add / +/plugin install js-asset-auditor +``` -- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__new_page` — open browser tab -- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__navigate_page` — load publisher URL -- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__list_network_requests` — capture JS requests -- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__evaluate_script` — settle window + detect head-loaded scripts via DOM query -- `mcp__plugin_chrome-devtools-mcp_chrome-devtools__close_page` — clean up tab +**Setup (one-time after install):** -**Permission grants required:** `navigate_page`, `list_network_requests`, and `close_page` are not currently approved in `.claude/settings.json`. Add them to `permissions.allow` before running the skill, or expect interactive permission prompts on first run. +```bash +cd packages/js-asset-auditor && npm install && npx playwright install chromium +``` -**File tools used:** +**Standalone utilities:** -- `Read` — read `trusted-server.toml` (publisher domain) and existing `js-assets.toml` (diff mode) -- `Write` — write generated/updated `js-assets.toml` +- `scripts/js-asset-slug.mjs` — Standalone slug generator for individual URLs (kept outside the plugin for backward compatibility) --- @@ -240,12 +346,22 @@ See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md) _(on ## Verification -- Run `/audit-js-assets https://www.publisher.com` against a known test publisher page with identified third-party JS +- Run `node packages/js-asset-auditor/lib/audit.mjs https://www.publisher.com` against a known test publisher page - Verify generated entries match actual third-party JS observed on the page (cross-check in browser DevTools) - Verify `inject_in_head = true` only for scripts that appear in `` (not ``) -- Verify wildcard detection fires for versioned path segments and not for stable paths +- Verify wildcard detection fires for versioned path segments (e.g., `1.19.13-0fnlww`) and not for stable paths - Verify GTM (`googletagmanager.com`) is captured and not filtered -- Verify framework CDNs (`cdnjs.cloudflare.com` etc.) are filtered with reason in summary +- Verify Google ad rendering infra (`pagead2.googlesyndication.com`, `s0.2mdn.net` etc.) is filtered with reason in summary +- Verify `securepubads.g.doubleclick.net` (GPT) is **not** filtered +- Verify first-party auto-detection: auditing `golf.com` with `publisher.domain = "test-publisher.com"` excludes `golf.com` scripts - Run `--diff` against an unchanged page → all entries confirmed, no new/missing - Run `--diff` after adding a new vendor script to the page → appears as `NEW` in summary - Run `--diff` after removing a script → appears as `MISSING ⚠` in summary, file unchanged +- Run `/js-asset-auditor:audit-js-assets ` via Claude Code plugin → identical results to direct CLI invocation +- Run CLI without `trusted-server.toml` (using `--domain` or domain inference) → works in any project +- Run with `--config` → generates `trusted-server.toml` with detected integrations +- Verify GTM `container_id` is auto-extracted from `?id=GTM-XXXXX` query param +- Verify integrations with TODO fields are marked with `# TODO:` comments +- Verify `--config` without `--force` errors when target file exists +- Verify JSON summary includes `integrations` array when `--config` is used + From 52b959dafb05b593851816f08906871dcd7e86ca Mon Sep 17 00:00:00 2001 From: Christian Date: Mon, 13 Apr 2026 14:17:45 -0500 Subject: [PATCH 4/4] format --- .../2026-04-01-js-asset-auditor-design.md | 43 +++++++++---------- 1 file changed, 21 insertions(+), 22 deletions(-) diff --git a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md index a8d2c141..7db923c1 100644 --- a/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md +++ b/docs/superpowers/specs/2026-04-01-js-asset-auditor-design.md @@ -87,17 +87,17 @@ The following origin categories are excluded silently. The terminal summary repo **Matching:** Filter entries match if the request URL's host ends with the filter entry, with a dot-boundary check. For example, `googletagmanager.com` in the filter matches `www.googletagmanager.com` but not `evil-googletagmanager.com`. -| Category | Excluded origins | -| ------------------- | --------------------------------------------------------------------------------------------- | -| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | -| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | -| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | -| Social embeds | `platform.twitter.com`, `platform.x.com`, `connect.facebook.net` | -| Google ad rendering | `pagead2.googlesyndication.com`, `tpc.googlesyndication.com`, `s0.2mdn.net`, | -| | `googleads.g.doubleclick.net`, `www.googleadservices.com` | -| Ad fraud detection | `adtrafficquality.google` | -| Ad verification | `adsafeprotected.com`, `moatads.com`, `doubleverify.com` | -| reCAPTCHA | `recaptcha.net`, `www.google.com/recaptcha/*`, `www.gstatic.com/recaptcha/*` | +| Category | Excluded origins | +| ------------------- | ------------------------------------------------------------------------------ | +| Framework CDNs | `cdnjs.cloudflare.com`, `ajax.googleapis.com`, `cdn.jsdelivr.net`, `unpkg.com` | +| Error tracking | `sentry.io`, `bugsnag.com`, `rollbar.com` | +| Font services | `fonts.googleapis.com`, `fonts.gstatic.com` | +| Social embeds | `platform.twitter.com`, `platform.x.com`, `connect.facebook.net` | +| Google ad rendering | `pagead2.googlesyndication.com`, `tpc.googlesyndication.com`, `s0.2mdn.net`, | +| | `googleads.g.doubleclick.net`, `www.googleadservices.com` | +| Ad fraud detection | `adtrafficquality.google` | +| Ad verification | `adsafeprotected.com`, `moatads.com`, `doubleverify.com` | +| reCAPTCHA | `recaptcha.net`, `www.google.com/recaptcha/*`, `www.gstatic.com/recaptcha/*` | **Path-prefix matching:** Some hosts (e.g., `www.google.com`) serve both filterable and non-filterable resources. Entries with a path suffix (e.g., `www.google.com/recaptcha/*`) match only when the URL's path begins with the specified prefix. Plain host entries use dot-boundary suffix matching as before. @@ -241,16 +241,16 @@ When invoked with `--config [path]`, the CLI also detects known integrations fro Integration detection runs on raw URLs (before normalization) to preserve query parameters needed for field extraction. -| URL Pattern | Integration | Extracted Fields | -| -------------------------------------------------- | ---------------------- | ----------------------------------------- | -| `securepubads.g.doubleclick.net/tag/js/gpt*` | `gpt` | `script_url` | -| `www.googletagmanager.com/gtm.js?id=GTM-XXX` | `google_tag_manager` | `container_id` from `?id=` | -| `sdk.privacy-center.org` | `didomi` | (defaults) | -| `js.datadome.co` | `datadome` | (defaults) | -| `aim.loc.kr/*identity-lockr*.js` | `lockr` | `sdk_url` | -| `*.edge.permutive.app/*-web.js` | `permutive` | `organization_id`, `workspace_id` from URL | -| `*/prebid.js`, `*/prebidjs.js` (+ .min variants) | `prebid` | (detect only) | -| `c.amazon-adsystem.com/aax2/apstag*` | `aps` | (detect only) | +| URL Pattern | Integration | Extracted Fields | +| ------------------------------------------------ | -------------------- | ------------------------------------------ | +| `securepubads.g.doubleclick.net/tag/js/gpt*` | `gpt` | `script_url` | +| `www.googletagmanager.com/gtm.js?id=GTM-XXX` | `google_tag_manager` | `container_id` from `?id=` | +| `sdk.privacy-center.org` | `didomi` | (defaults) | +| `js.datadome.co` | `datadome` | (defaults) | +| `aim.loc.kr/*identity-lockr*.js` | `lockr` | `sdk_url` | +| `*.edge.permutive.app/*-web.js` | `permutive` | `organization_id`, `workspace_id` from URL | +| `*/prebid.js`, `*/prebidjs.js` (+ .min variants) | `prebid` | (detect only) | +| `c.amazon-adsystem.com/aax2/apstag*` | `aps` | (detect only) | ### Field categories @@ -364,4 +364,3 @@ See [delivery order in the Proxy spec](2026-04-01-js-asset-proxy-design.md) _(on - Verify integrations with TODO fields are marked with `# TODO:` comments - Verify `--config` without `--force` errors when target file exists - Verify JSON summary includes `integrations` array when `--config` is used -