Skip to content

Add dataset tags to SDK for identification (DE-7033)#456

Open
vinay553 wants to merge 1 commit intomasterfrom
vinayparakala/de-7033-add-dataset-tags-to-sdk
Open

Add dataset tags to SDK for identification (DE-7033)#456
vinay553 wants to merge 1 commit intomasterfrom
vinayparakala/de-7033-add-dataset-tags-to-sdk

Conversation

@vinay553
Copy link
Copy Markdown
Contributor

@vinay553 vinay553 commented Apr 6, 2026

Summary

  • Add tags field to DatasetInfo model so dataset.info() returns dataset tags
  • Add get_tags(), add_tags(), remove_tags() methods to Dataset class for programmatic tag management
  • Enables customers (e.g. Redwoods) to identify via API whether a dataset was labeled by Scale or another vendor

Test plan

  • Verify dataset.info() returns tags for a dataset with tags set in the UI
  • Verify dataset.add_tags(["Labeled by: Scale"]) adds tags
  • Verify dataset.get_tags() returns the current tag list
  • Verify dataset.remove_tags(["Labeled by: Scale"]) removes tags
  • Verify backward compatibility — dataset.info() works against a server that doesn't yet return tags (defaults to [])

🤖 Generated with Claude Code

Greptile Summary

This PR adds a tags field to DatasetInfo and three new tag-management methods (get_tags, add_tags, remove_tags) to the Dataset class, enabling programmatic access to dataset tags. The implementation is clean, follows existing codebase patterns (DELETE-with-body, pydantic v1/v2 shim), and includes good backward-compatibility handling via the coerce_null_tags validator.

Confidence Score: 5/5

Safe to merge; all findings are P2 style suggestions that do not block correctness.

Both comments are P2: one is a defensive null-coercion improvement and the other is a test assertion robustness suggestion. Neither represents a present defect. The core logic, backward-compatibility handling, and test coverage are solid.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
nucleus/data_transfer_object/dataset_info.py Adds tags: List[str] = [] field with a coerce_null_tags validator for backward compatibility; pydantic v1/v2 import shim is consistent with existing patterns in pydantic_base.py.
nucleus/dataset.py Adds get_tags(), add_tags(), remove_tags() methods. DELETE-with-body pattern is established elsewhere in the codebase. get_tags() doesn't apply the same null-coercion as DatasetInfo, so a server returning {"tags": null} would silently return None to callers.
tests/test_dataset.py Good integration test coverage for all three new methods including idempotency and TypeError guard; assert remaining2 == remaining is order-sensitive and may flake if the API returns tags in non-deterministic order.

Sequence Diagram

sequenceDiagram
    participant Client as SDK Client
    participant Dataset as Dataset
    participant API as Nucleus API

    Client->>Dataset: dataset.info()
    Dataset->>API: GET dataset/{id}/info
    API-->>Dataset: JSON (may include tags: null)
    Dataset-->>Client: DatasetInfo (tags defaults to [])

    Client->>Dataset: dataset.get_tags()
    Dataset->>API: GET dataset/{id}/tags
    API-->>Dataset: {"tags": [...]}
    Dataset-->>Client: List[str]

    Client->>Dataset: dataset.add_tags(["Labeled by: Scale"])
    Dataset->>API: POST dataset/{id}/tags {"tags": [...]}
    API-->>Dataset: {"tags": [...]}
    Dataset-->>Client: Updated List[str]

    Client->>Dataset: dataset.remove_tags(["Labeled by: Scale"])
    Dataset->>API: DELETE dataset/{id}/tags {"tags": [...]}
    API-->>Dataset: {"tags": [...]}
    Dataset-->>Client: Remaining List[str]
Loading

Fix All in Cursor Fix All in Claude Code

Prompt To Fix All With AI
This is a comment left during a code review.
Path: nucleus/dataset.py
Line: 444

Comment:
**`get_tags()` not backward-compatible with null server response**

`DatasetInfo` uses a `coerce_null_tags` validator so a server returning `{"tags": null}` gracefully becomes `[]`. But `get_tags()` returns `response["tags"]` directly — if the server returns `null` for the `tags` value, callers receive `None` instead of `[]`, silently breaking any code that iterates the result. Consider aligning with the `DatasetInfo` pattern:

```suggestion
        return response.get("tags") or []
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_dataset.py
Line: 286

Comment:
**Order-sensitive equality assertion**

`remaining2 == remaining` requires the server to return tags in the same order each time. If the API returns a set-semantics list with non-deterministic ordering, this assertion will flake. Consider using a set comparison instead:

```suggestion
    assert set(remaining2) == set(remaining)
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (7): Last reviewed commit: "Add dataset tags to SDK for identificati..." | Re-trigger Greptile

@vinay553 vinay553 force-pushed the vinayparakala/de-7033-add-dataset-tags-to-sdk branch 2 times, most recently from 3eff859 to 8f06f7a Compare April 6, 2026 21:52
@vinay553 vinay553 requested a review from a team April 6, 2026 21:56
@vinay553 vinay553 force-pushed the vinayparakala/de-7033-add-dataset-tags-to-sdk branch from 8f06f7a to bf092a2 Compare April 6, 2026 21:57
Copy link
Copy Markdown
Contributor

@edwinpav edwinpav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some unit/integration tests that call this on the datasets used by the tests?

Expose dataset tags through the Python SDK so customers can identify
datasets labeled by Scale vs other vendors via the API.

- Add `tags` field to DatasetInfo model (returned by dataset.info())
- Add get_tags(), add_tags(), remove_tags() methods to Dataset class
- Use POST /tags/remove instead of DELETE to avoid proxy body-stripping
- Use pydantic v1/v2 compat shim for null-coercion validator
- Guard against passing a bare string instead of a list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vinay553 vinay553 force-pushed the vinayparakala/de-7033-add-dataset-tags-to-sdk branch from bf092a2 to ccb48bc Compare April 8, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants