diff --git a/skills/hotdata-cli/SKILL.md b/skills/hotdata-cli/SKILL.md index 3668669..6a2ab26 100644 --- a/skills/hotdata-cli/SKILL.md +++ b/skills/hotdata-cli/SKILL.md @@ -29,6 +29,20 @@ API URL defaults to `https://api.hotdata.dev/v1` or overridden via `HOTDATA_API_ All commands that accept `--workspace-id` are optional. If omitted, the active workspace is used. Use `hotdata workspaces set` to switch the active workspace interactively, or pass a workspace ID directly: `hotdata workspaces set `. The active workspace is shown with a `*` marker in `hotdata workspaces list`. **Omit `--workspace-id` unless you need to target a specific workspace.** +## Multi-step workflows (Model, Library, History, Chain, Indexes) + +These are **patterns** built from the commands below—not separate CLI subcommands: + +- **Model** — Markdown semantic map of your workspace (entities, keys, joins). Refresh using `connections`, `connections refresh`, `tables list`, and `datasets list`. For a **deep** modeling pass (connector enrichment, indexes, per-table detail), see [references/MODEL_BUILD.md](references/MODEL_BUILD.md). +- **Library** — Curated **`hotdata queries`** entries for repeatable SQL (`queries create`, `queries run`, …). +- **History** — Find prior **`hotdata results`** and saved queries (`results list`, `results `, `queries list`). +- **Chain** — Follow-ups via **`datasets create`** then `query` against `datasets.main.`. +- **Indexes** — Review SQL and schema, compare to existing indexes, create **sorted**, **bm25**, or **vector** indexes when it clearly helps; see [references/WORKFLOWS.md](references/WORKFLOWS.md#indexes). + +Full step-by-step procedures: [references/WORKFLOWS.md](references/WORKFLOWS.md). + +**Project-owned files:** Put `DATA_MODEL.md` or `data_model.md` (e.g. under `docs/`) in the **directory where you run `hotdata`**—your repo or project—not under `~/.claude/skills/` or other agent skill paths. Copy the template from [references/DATA_MODEL.template.md](references/DATA_MODEL.template.md) to start; use [references/MODEL_BUILD.md](references/MODEL_BUILD.md) when you need the full procedure. + ## Available Commands ### List Workspaces @@ -259,8 +273,11 @@ hotdata jobs [--workspace-id ] [--format table|json|yaml] ``` hotdata auth # Browser-based login hotdata auth status # Check current auth status +hotdata auth logout # Remove saved auth for the default profile ``` +Other commands (not covered in detail above): `hotdata connections new` (interactive connection wizard), `hotdata skills install|status`, `hotdata completions `. + ## Workflow: Running a Query 1. List connections: diff --git a/skills/hotdata-cli/references/DATA_MODEL.template.md b/skills/hotdata-cli/references/DATA_MODEL.template.md new file mode 100644 index 0000000..a2b1526 --- /dev/null +++ b/skills/hotdata-cli/references/DATA_MODEL.template.md @@ -0,0 +1,89 @@ +# Data model — `` + +> Copy this file to your **project** directory (e.g. `./DATA_MODEL.md`, `./data_model.md`, or `./docs/DATA_MODEL.md`). +> Do not commit workspace-specific content into agent skill folders. +> For a **full** build (per-table detail, connector enrichment, index summary), follow [MODEL_BUILD.md](MODEL_BUILD.md) from the installed skill’s `references/` (or this repo’s `skills/hotdata-cli/references/`). Relative links to `MODEL_BUILD.md` below work only while this file lives next to those references; in your project, open that path separately if the link 404s. + +**Workspace (Hotdata):** `` +**Last catalog refresh:** `` + +## Overview + +What data exists, which business domains it covers, and who owns this document. +_(Large workspaces: add a **table of contents** here—per connection, table counts.)_ + +## Purpose + +Short description of what this workspace is for and how the model should be used for queries. + +## Connections & sources + +| Connection ID | Name | Type | Role / domain | +|---------------|------|------|---------------| +| | | | | + +### Per-table detail (optional — use for deep models) + +_Use for important tables only, or expand all via [MODEL_BUILD.md](MODEL_BUILD.md). **Duplicate** this whole block (from the heading through the horizontal rule) for each table._ + +#### `..
` + +**Grain:** one row = one `…` +**Description:** + +| Column | Type | Nullable | PK/FK | Notes | +|--------|------|----------|-------|-------| + +**Relationships:** (PK, FKs, parent–child) +**Queryability:** (filters, joins, caveats) + +--- + +## Entities and grain (summary view) + +For each business entity: + +- **Entity:** +- **Grain:** one row per … +- **Primary tables:** `connection.schema.table` +- **Key columns:** + +## Cross-connection joins + +Document safe join paths and caveats (fan-out, timing, different refresh cadence, type mismatches). + +## Search & index summary (optional) + +| Table | Column | Kind (vector / text / …) | Index status | Notes | +|-------|--------|--------------------------|--------------|-------| +| | | | | | + +_Use `hotdata indexes list -c --schema --table
` per table as needed._ + +## Datasets (uploaded) + +Catalog from `hotdata datasets list` / `hotdata datasets `: + +| Label | Table name (`datasets.main.…`) | Grain | Notes | +|-------|-------------------------------|-------|-------| +| | | | | + +## Derived tables (Chain) + +Stable `datasets.main.*` tables built for **Chain** workflows (not necessarily uploaded file datasets): + +| Table name | Built from | Purpose | Owner / TTL | +|------------|------------|---------|-------------| +| | | | | + +## Saved query index (Library) + +Link business questions to saved queries (ids/names from `hotdata queries list`): + +| Question / report | Saved query name | ID (optional) | +|-------------------|------------------|---------------| +| | | | + +## Notes + +Assumptions, known gaps, and refresh checklist. diff --git a/skills/hotdata-cli/references/MODEL_BUILD.md b/skills/hotdata-cli/references/MODEL_BUILD.md new file mode 100644 index 0000000..102b079 --- /dev/null +++ b/skills/hotdata-cli/references/MODEL_BUILD.md @@ -0,0 +1,125 @@ +# Building a workspace data model (advanced) + +Optional **deep pass** for a single authoritative markdown model. For a short checklist only, use the **Model** section in [WORKFLOWS.md](WORKFLOWS.md) and [DATA_MODEL.template.md](DATA_MODEL.template.md). + +**Output:** Save as `DATA_MODEL.md`, `data_model.md`, or `docs/DATA_MODEL.md` in the **project directory** where you run `hotdata` (not inside agent skill folders). + +--- + +## 1. Discover connections + +```bash +hotdata connections list +``` + +For each connection, record `id`, `name`, and `source_type`. + +--- + +## 2. Enumerate tables, columns, and datasets + +If the catalog may be **stale** (recent DDL, new tables missing), run **`hotdata connections refresh `** for affected connections **before** relying on `tables list`. + +**Per connection:** + +```bash +hotdata tables list --connection-id +``` + +**Uploaded datasets:** + +```bash +hotdata datasets list +hotdata datasets +``` + +Capture schema for each dataset (columns, types) from the detail view. + +You can also refresh after enumeration if you discover drift: + +```bash +hotdata connections refresh +``` + +--- + +## 3. Enrich beyond column names (optional but valuable) + +Use **connector and tooling docs** when `source_type` (or table shapes) match: + +- **Vendor / ELT docs** — Your loader or integration vendor’s published schemas for canonical tables, PKs/FKs, and field semantics (link what you use so a human can verify). +- **dlt** — [verified sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources) for normalized layouts. +- **dlt-loaded data** — If you see `_dlt_id`, `_dlt_load_id`, `_dlt_parent_id`: treat as pipeline metadata; `_dlt_parent_id` often links flattened child rows to parents when no explicit FK exists. Exclude these from **grain** statements unless the question is specifically about loads. +- **Vectors** — Columns typed as lists of floats (e.g. embedding columns) are candidates for vector search; note them. +- **Well-known SaaS shapes** — Apply general patterns (e.g. Stripe charges/customers, HubSpot contacts/deals) only when naming and structure fit; **link** the doc you used so a human can verify. + +Do **not** invent facts: if context is missing, say so and suggest a small sample query: + +```bash +hotdata query "SELECT * FROM ..
LIMIT 5" +``` + +--- + +## 4. Infer relationships + +For each table, capture where reasonable: + +1. **Grain** — One row = one `…` (required per table; if unknown, say unknown). +2. **Primary keys** — `id`, `_id`, or composite patterns from names + types. +3. **Foreign keys** — `_id` / `_fk` / name matches to other tables; confirm with connector docs when possible. +4. **Parent–child** — Flattened API/JSON tables (often nested names) and dlt parent keys. +5. **Cross-connection** — Same logical entity in two connections (keys, type mismatches, caveats). + +For **small** schemas (e.g. ≤5 tables in a domain), a short **ASCII diagram** helps. For larger ones, group by domain in prose (e.g. billing, identity, product). + +--- + +## 5. Search and index awareness + +For tables you care about: + +```bash +hotdata indexes list -c --schema --table
[-w ] +``` + +Note: + +- **Vector**-friendly columns (embeddings) vs **BM25**-friendly text (`title`, `body`, `description`, …). +- **Time** columns — event grain vs slowly changing dimensions. +- **Facts vs dimensions** — for analytics-oriented workspaces. + +When suggesting a new index, use the same connection/schema/table/column names as in `tables list` and the main skill’s `indexes create` examples. + +--- + +## 6. Document structure + +Start from [DATA_MODEL.template.md](DATA_MODEL.template.md) and extend as needed: + +- **Overview** — Domains and what the workspace is for. +- **Per connection** — Optional subsection per source; for **deep** models, **repeat** one block per `connection.schema.table` (grain, column table with name/type/nullable/PK-FK/notes, relationships, queryability, caveats)—the template’s single `####` heading is a pattern to copy for each table. +- **Datasets** — Same treatment as connection tables where relevant. +- **Cross-connection joins** — Keys, semantics, type caveats. +- **Search / index summary** — Table, column, index status, intended use. + +If the workspace has **many** tables (e.g. 50+), add a **table of contents** after the overview (connection → table counts). + +--- + +## Error handling + +- If a CLI command fails, record the error in the doc and **continue** when possible. +- Unreachable connections or empty table lists: note in the connections table (e.g. unreachable / no tables). +- Do not abort the whole model for one bad connection. + +--- + +## Rules (keep quality high) + +- Every table gets an explicit **grain** (or “unknown”). +- Prefer **documented** connector semantics over guesswork; **link** external docs when you use them. +- Flag **test/dev** tables (`test`, `tmp`, `dev`, `staging` in names) as non-production when applicable. +- Note **Utf8-stored numbers** and cast requirements where relevant. +- Do not leave column **Notes** empty when domain knowledge or docs apply; “—” is weak unless the column is opaque/internal. +- Align table names with **`hotdata tables list`** output (`connection.schema.table`). diff --git a/skills/hotdata-cli/references/WORKFLOWS.md b/skills/hotdata-cli/references/WORKFLOWS.md new file mode 100644 index 0000000..392fb7c --- /dev/null +++ b/skills/hotdata-cli/references/WORKFLOWS.md @@ -0,0 +1,212 @@ +# Hotdata CLI workflows + +Procedures for **Model**, **Library**, **History**, **Chain**, and **Indexes**. These compose existing `hotdata` commands; they are not separate subcommands. + +## Where files live + +| Concept | Location | +|--------|----------| +| **Model** | Your **project** root or `docs/` (e.g. `DATA_MODEL.md` / `data_model.md`). Never store workspace-specific model text inside agent skill directories. | +| **Library** | Hotdata **saved queries** (`queries create` / `list` / `run`). Optional local index (e.g. `QUERIES.md`) listing names and intent. | +| **History** | `hotdata results list` / `results `; saved queries. Optional append-only log under `.hotdata/query-log.jsonl` if you add a wrapper. | +| **Chain** | Intermediate tables in **`datasets.main.*`**; document stable ones in the Model file under **Derived tables (Chain)**. | +| **Indexes** | Recommendations and decisions live in Hotdata (`indexes list` / `indexes create`). Optional project log (e.g. `INDEXES.md`) if you track rationale outside the catalog. | + +--- + +## Model + +**Goal:** A markdown map of entities, keys, grain, and how connections relate—on top of the live **catalog** from Hotdata. + +### Initialize + +1. Copy `references/DATA_MODEL.template.md` from this skill bundle to your project as `DATA_MODEL.md` or `docs/DATA_MODEL.md`. +2. Fill workspace-specific sections as you discover schema. + +### Deep model pass (optional) + +For a **full** catalog-style document—datasets, enrichment from connector or loader docs (e.g. dlt), relationships, search/index notes, and stricter documentation rules—follow **[MODEL_BUILD.md](MODEL_BUILD.md)**. Use it when the light template is not enough; skip it for small or fast-moving workspaces. + +### Refresh catalog facts (run from project root) + +When metadata may be **stale**, run `connections refresh` for affected connections **before** relying on `tables list` (same order as below). + +```bash +hotdata workspaces list +hotdata connections list +# For each connection you care about: +hotdata connections refresh # after DDL / stale metadata +hotdata tables list +hotdata tables list --connection-id +hotdata datasets list +hotdata datasets # schema detail per dataset +``` + +Use output to update **Connections**, **Tables**, **Columns**, and **Datasets** in the model. Optional: small exploratory queries once names are known: + +```bash +hotdata query "SELECT * FROM ..
LIMIT 5" +``` + +**Rule:** Use `hotdata tables list` for discovery; do not use `query` against `information_schema` for that (see main skill). + +--- + +## Library + +**Goal:** Repeatable SQL as **saved queries** so agents use `queries run` instead of pasting ad hoc SQL. + +### Promote a query + +```bash +hotdata queries create --name "Descriptive Name" --sql "SELECT ..." [--description "..."] [--tags "a,b"] +``` + +### Use the library + +```bash +hotdata queries list +hotdata queries +hotdata queries run +hotdata queries update [--name ...] [--sql ...] [--tags ...] [--category ...] [--table-size ...] +``` + +**Suggestions** from past sessions are not generated by the CLI today; capture candidates manually or with your own tooling, then `queries create` after review. + +### Optional project index + +Maintain `QUERIES.md` (or a section in `DATA_MODEL.md`) mapping **business questions** → saved query name or id. + +--- + +## History + +**Goal:** Find prior work: stored results and saved definitions. + +### Results + +```bash +hotdata results list [-w ] [--limit N] [--offset N] +hotdata results [-w ] +``` + +Query footers include a `result-id` when applicable—record it for later. **Prefer `hotdata results ` over re-running identical heavy SQL.** + +### Saved queries as history + +```bash +hotdata queries list +hotdata queries +``` + +**Limitation:** Ad-hoc `hotdata query "..."` text is not listed unless you still have the `result_id`, a saved query, or a **local log** (e.g. append JSON lines to `.hotdata/query-log.jsonl` from a wrapper script). + +--- + +## Chain + +**Goal:** Follow-up analysis on a **bounded** intermediate without rescanning huge base tables. + +**Pattern:** materialize → query `datasets.main.*`. + +1. **Base** — run saved or ad hoc SQL: + + ```bash + hotdata queries run + # or + hotdata query "SELECT ..." + ``` + + If the CLI returns a `query_run_id`, poll: + + ```bash + hotdata query status + ``` + +2. **Materialize** — land a table in datasets (pick one): + + ```bash + hotdata datasets create --label "chain revenue slice" --sql "SELECT ..." [--table-name chain_revenue_slice] + hotdata datasets create --label "from saved" --query-id [--table-name ...] + ``` + +3. **Chain** — query the dataset: + + ```bash + hotdata datasets list # find table_name if needed + hotdata query "SELECT * FROM datasets.main. WHERE ..." + ``` + +**Naming:** Prefer predictable `--table-name` values, e.g. `chain__`, and list long-lived chains in **Model → Derived tables (Chain)**. + +--- + +## Indexes + +**Goal:** Find filters, joins, sorts, full-text, and vector access patterns that are **missing** indexes, then **create** them when the benefit is clear. + +### 1. Gather workload and schema + +- **Saved queries** — Inspect SQL for recurring `WHERE`, `JOIN`, `GROUP BY`, `ORDER BY`, and any use of full-text or vector access (e.g. SQL that calls `bm25_search`, or workloads you run via **`hotdata search`** — see main skill **Search**). + + ```bash + hotdata queries list + hotdata queries + ``` + +- **Ad-hoc SQL** — Use the same lens on queries from session history or a local log, if you keep one (see **History**). + +- **Table/column types** — Confirm columns exist and types fit the index you plan: + + ```bash + hotdata tables list --connection-id + ``` + +High-cardinality **text** columns (`title`, `body`, `description`, …) may warrant **bm25** if you use or plan text search. **Embedding** / list-of-float columns may warrant **vector** (+ `--metric`). Equality/range/sort on discrete fields often map to **sorted** (default index type)—confirm fit with your workload and product limits when in doubt. + +### 2. Compare to existing indexes + +For each `connection.schema.table` you care about: + +```bash +hotdata indexes list -c --schema --table
[-w ] +``` + +Skip creating a duplicate: same table + overlapping columns + same purpose (e.g. another bm25 on the same column). + +### 3. Create indexes when justified + +Use stable names (e.g. `idx_
__`). Examples: + +```bash +# Sorted (default) — filters, joins, ordering on scalar columns +hotdata indexes create -c --schema --table
\ + --name idx_orders_created --columns created_at --type sorted + +# BM25 — full-text on one text column (required for bm25_search on that column) +hotdata indexes create -c --schema --table
\ + --name idx_posts_body_bm25 --columns body --type bm25 + +# Vector — embeddings; requires --metric +hotdata indexes create -c --schema --table
\ + --name idx_chunks_embedding --columns embedding --type vector --metric l2 +``` + +Large builds: add `--async` and track with **`hotdata jobs list`** / **`hotdata jobs `** (see main skill **Indexes** and **Jobs**). + +### 4. Verify + +Re-run representative **`hotdata query`** or **`hotdata search`** workloads. Update **Model → Search & index summary** (if you maintain a data model doc) so future agents know what exists. + +### Guardrails + +- Prefer **evidence** (repeated predicates, slow queries, or planned search) over speculative indexes. +- **Production:** get explicit approval before `indexes create` when impact or cost is uncertain. +- Align **connection id**, **schema**, and **table** with `hotdata tables list` output. + +--- + +## Cross-cutting + +- **Workspace:** Use active workspace or `-w` / `--workspace-id` when targeting a non-default workspace. +- **Jobs:** For async work (indexes, some refreshes), `hotdata jobs list` and `hotdata jobs `.