Adding large payload support for the standalone SDK#280
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds “large payload” support to the standalone Durable Task Java SDK by introducing an Azure Blob Storage–backed payload externalization module and wiring gRPC interceptors + orchestrator-response chunking into the core client/worker to avoid gRPC message size limits.
Changes:
- Introduces new
:azure-blob-payloadsmodule implementing payload externalization/resolution via a gRPCClientInterceptorand an Azure Blob–backedPayloadStore. - Adds gRPC interceptor support to
DurableTaskGrpcClientBuilder/DurableTaskGrpcWorkerBuilder, plus worker capability announcement and orchestrator completion chunking. - Adds a runnable sample and unit/integration tests covering token handling, interceptor behavior, and chunking behavior.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| settings.gradle | Adds the new :azure-blob-payloads Gradle module to the build. |
| samples/src/main/java/io/durabletask/samples/LargePayloadSample.java | New sample demonstrating end-to-end large payload externalization with DTS + Azurite. |
| samples/build.gradle | Adds a runLargePayloadSample task and depends on :azure-blob-payloads. |
| internal/durabletask-protobuf/protos/orchestrator_service.proto | Updates protobuf contract (tags, rewind action, purge timeout). |
| internal/durabletask-protobuf/PROTO_SOURCE_COMMIT_HASH | Updates upstream proto source commit hash. |
| client/src/test/java/com/microsoft/durabletask/OrchestratorChunkingTest.java | Adds unit tests for worker chunking + action-size validation + capability announcement. |
| client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcWorkerBuilder.java | Adds interceptor registration, LP capability flag, and configurable chunk size. |
| client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcWorker.java | Applies interceptors, announces LP capability, and implements orchestrator-response chunking/validation. |
| client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcClientBuilder.java | Adds interceptor registration support to the client builder. |
| client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcClient.java | Applies registered interceptors to the client channel. |
| client/build.gradle | Adds grpc-inprocess for new in-process gRPC unit tests. |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/PayloadStore.java | Introduces PayloadStore abstraction for out-of-band payload storage. |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/PayloadStorageException.java | Defines exception type for permanent storage failures. |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadWorkerExtensions.java | Adds worker-side helper methods to enable externalized payloads + capability flag. |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadStorageOptions.java | Adds configuration options (threshold/max/container/auth/compression). |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadInterceptor.java | Implements interceptor that externalizes outbound payloads and resolves inbound tokens. |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadClientExtensions.java | Adds client-side helper methods to enable externalized payloads. |
| azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/BlobPayloadStore.java | Implements Azure Blob Storage payload store with optional gzip compression. |
| azure-blob-payloads/src/test/java/com/microsoft/durabletask/azureblobpayloads/PayloadTokenTest.java | Unit tests for token encode/decode and token detection. |
| azure-blob-payloads/src/test/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadStorageOptionsTest.java | Unit tests for options defaults and validation behavior. |
| azure-blob-payloads/src/test/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadInterceptorTest.java | Unit tests for request externalization + response resolution across message types. |
| azure-blob-payloads/src/test/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadIntegrationTest.java | Integration tests requiring DTS emulator + Azurite for end-to-end validation. |
| azure-blob-payloads/src/test/java/com/microsoft/durabletask/azureblobpayloads/BlobPayloadStoreTest.java | Unit tests for blob upload/download/compression behavior using mocks. |
| azure-blob-payloads/spotbugs-exclude.xml | SpotBugs exclusions for the new module. |
| azure-blob-payloads/build.gradle | Build/test/spotbugs config and dependencies for the new module. |
YunchuWang
left a comment
There was a problem hiding this comment.
Code Review Summary
Change intent: Add large payload externalization support to the standalone Durable Task Java SDK via a new azure-blob-payloads module. Payloads exceeding a configurable threshold are transparently uploaded to Azure Blob Storage through a gRPC ClientInterceptor, with automatic orchestrator response chunking for oversized gRPC messages.
Overall risk: Moderate
Merge recommendation: Safe with fixes — two performance issues (F1, F2) should be addressed before merge.
Architecture
This is a well-designed refactoring that moves payload externalization from application-level (manual per-call-site handling) down to the transport layer (gRPC interceptor). Key strengths:
- Transparency — application code (orchestrations, activities) is completely unaware of large payloads
- Separation of concerns — externalization logic concentrated in one interceptor
- Extensibility —
addInterceptorAPI is generic, future interceptors (encryption, compression) can be added - Cross-SDK alignment — chunking logic matches .NET SDK behavior
- Test quality — comprehensive coverage with mock unit tests and Azurite-backed integration tests
Findings by Severity
- High: 2
- Medium: 3
- Low: 2
- Nit: 2
Proto changes note (N1): The proto file includes unrelated additions (ActivityRequest.tags, RewindOrchestrationAction, PurgeInstanceFilter.timeout). These appear to be upstream proto sync. No functional impact, but worth noting in the PR description if intentional.
Sample note (N2): LargePayloadSample.java line 123 hardcodes the blob:v1: token prefix check as a defensive assertion. Acceptable for a sample, but will need updating if the token format ever changes.
YunchuWang
left a comment
There was a problem hiding this comment.
Code Review Summary
Change intent: Add large payload externalization support to the standalone Durable Task Java SDK via a new azure-blob-payloads module. Payloads exceeding a configurable threshold are transparently uploaded to Azure Blob Storage through a gRPC ClientInterceptor, with automatic orchestrator response chunking for oversized gRPC messages.
Overall risk: Moderate
Merge recommendation: Safe with fixes — two performance issues (F1, F2) should be addressed before merge.
Architecture
This is a well-designed refactoring that moves payload externalization from application-level (manual per-call-site handling) down to the transport layer (gRPC interceptor). Key strengths:
- Transparency — application code (orchestrations, activities) is completely unaware of large payloads
- Separation of concerns — externalization logic concentrated in one interceptor
- Extensibility —
addInterceptorAPI is generic, future interceptors can be added - Cross-SDK alignment — chunking logic matches .NET SDK behavior
- Test quality — comprehensive coverage with mock unit tests and Azurite-backed integration tests
Findings by Severity
- High: 2 (F1, F2)
- Medium: 3 (F3, F4, F5)
- Low: 2 (F6, F7)
- Nit: 2 (N1, N2)
Additional Notes
[F6] Low / P3 — maybeResolve (download path) has no error handling: payloadStore.download() in maybeResolve can throw if a blob is deleted or there's a network failure. Unlike the upload path (which has try-catch in externalizeActivityResponse/externalizeOrchestratorResponse), the download path has no protection. An exception here propagates through onMessage, breaking the entire getWorkItems streaming call. Consider adding try-catch around download calls to at least log the error and let the orchestration fail explicitly rather than silently disconnecting the stream.
[N1] Nit — Proto changes include unrelated additions: ActivityRequest.tags, RewindOrchestrationAction, PurgeInstanceFilter.timeout appear to be upstream proto sync additions unrelated to large payloads. No functional impact, but worth noting in the PR description.
[N2] Nit — Sample hardcodes token prefix: LargePayloadSample.java line 123 checks payload.startsWith("blob:v1:") as a defensive assertion. Acceptable for a sample, but will need updating if the token format changes.
YunchuWang
left a comment
There was a problem hiding this comment.
Code Review Summary
Change intent: Add large payload externalization support to the standalone Durable Task Java SDK via a new azure-blob-payloads module. Payloads exceeding a configurable threshold are transparently uploaded to Azure Blob Storage through a gRPC ClientInterceptor, with automatic orchestrator response chunking for oversized gRPC messages.
Overall risk: Moderate
Merge recommendation: Safe with fixes — two performance issues (F1, F2) should be addressed before merge.
Architecture
This is a well-designed refactoring that moves payload externalization from application-level (manual per-call-site handling) down to the transport layer (gRPC interceptor). Key strengths:
- Transparency — application code (orchestrations, activities) is completely unaware of large payloads
- Separation of concerns — externalization logic concentrated in one interceptor
- Extensibility —
addInterceptorAPI is generic, future interceptors can be added - Cross-SDK alignment — chunking logic matches .NET SDK behavior
- Test quality — comprehensive coverage with mock unit tests and Azurite-backed integration tests
Findings by Severity
- High: 2 (F1, F2)
- Medium: 3 (F3, F4, F5)
- Low: 2 (F6, F7)
- Nit: 2 (N1, N2)
Additional Notes
[F6] Low / P3 — maybeResolve (download path) has no error handling: payloadStore.download() in maybeResolve can throw if a blob is deleted or network failure. Unlike the upload path (which has try-catch), the download path has no protection. An exception propagates through onMessage, breaking the entire getWorkItems streaming call. Consider adding try-catch to let the orchestration fail explicitly rather than silently disconnecting.
[N1] Nit — Proto changes include unrelated additions: ActivityRequest.tags, RewindOrchestrationAction, PurgeInstanceFilter.timeout appear to be upstream proto sync. No functional impact but worth noting in PR description.
[N2] Nit — Sample hardcodes token prefix: LargePayloadSample.java line 123 checks payload.startsWith("blob:v1:"). Acceptable for a sample but will need updating if token format changes.
YunchuWang
left a comment
There was a problem hiding this comment.
Code Review Summary
Change intent: Add large payload externalization support to the standalone Durable Task Java SDK via a new azure-blob-payloads module. Payloads exceeding a configurable threshold are transparently uploaded to Azure Blob Storage through a gRPC ClientInterceptor, with automatic orchestrator response chunking for oversized gRPC messages.
Overall risk: Moderate
Merge recommendation: Safe with fixes — two performance issues (F1, F2) should be addressed before merge.
Architecture
This is a well-designed refactoring that moves payload externalization from application-level (manual per-call-site handling) down to the transport layer (gRPC interceptor). Key strengths:
- Transparency — application code (orchestrations, activities) is completely unaware of large payloads
- Separation of concerns — externalization logic concentrated in one interceptor
- Extensibility —
addInterceptorAPI is generic, future interceptors can be added - Cross-SDK alignment — chunking logic matches .NET SDK behavior
- Test quality — comprehensive coverage with mock unit tests and Azurite-backed integration tests
Findings by Severity
- High: 2 (F1, F2)
- Medium: 3 (F3, F4, F5)
- Low: 2 (F6, F7)
- Nit: 2 (N1, N2)
Additional Notes
[F6] Low / P3 — maybeResolve (download path) has no error handling: payloadStore.download() in maybeResolve can throw if a blob is deleted or network failure. Unlike the upload path (which has try-catch), the download path has no protection. An exception propagates through onMessage, breaking the entire getWorkItems streaming call. Consider adding try-catch to let the orchestration fail explicitly rather than silently disconnecting.
[N1] Nit — Proto changes include unrelated additions: ActivityRequest.tags, RewindOrchestrationAction, PurgeInstanceFilter.timeout appear to be upstream proto sync. No functional impact but worth noting in PR description.
[N2] Nit — Sample hardcodes token prefix: LargePayloadSample.java line 123 checks payload.startsWith("blob:v1:"). Acceptable for a sample but will need updating if token format changes.
YunchuWang
left a comment
There was a problem hiding this comment.
Code review — High & Medium findings
Focused review of the large-payload externalization + auto-chunking path. 3 High + 6 Medium issues below. Happy to discuss any of these.
[High] F1 — CompleteOrchestrationAction.carryoverEvents and TaskFailureDetails.stackTrace are not externalized
File: azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/LargePayloadInterceptor.java — externalizeOrchestratorAction, COMPLETEORCHESTRATION branch.
CompleteOrchestrationAction has two payload surfaces beyond result/details that can legitimately grow very large:
repeated HistoryEvent carryoverEvents(continue-as-new buffersEventRaised/ExecutionStartedpayloads for the next instance).TaskFailureDetails.stackTrace/errorMessage(effectively unbounded user content).
Only result and details go through maybeExternalize. When supportsLargePayloads=true, completeOrchestratorTaskWithChunking also skips the pre-send size validator, so an oversized CompleteOrchestrationAction (large carryover events on continue-as-new) will hit the wire and fail with RESOURCE_EXHAUSTED. The orchestration then loops and re-fails on every replay.
Recommendation: Walk carryoverEvents through the same event-payload externalization used on the inbound path, and run failureDetails.getStackTrace() through maybeExternalize. Apply the same treatment to TaskFailureDetails on ActivityResponse failures.
[High] F2 — Transient blob-resolve failure during WorkItem ingestion creates an unrecoverable replay loop
File: LargePayloadInterceptor.java — onMessage wrapper (response path).
Any exception in resolveResponsePayloads is converted to Status.INTERNAL and thrown from onMessage. For the streaming GetWorkItems RPC, this terminates the stream; the worker reconnects; the sidecar re-dispatches the same work item; we re-download the same (still-failing) blob; we crash again. No circuit breaker, no backoff, no way to fail a single poisoned WorkItem out.
This applies both to:
- Transient failures (429 / 5xx after 8-retry exhaustion), and
- Permanent failures (404 due to retention/cleanup/container rename, malformed token, container mismatch).
For permanent 404s the only recovery is a manual purge — a real foot-gun in production.
Recommendation:
- For permanent failures (404, malformed token, container mismatch) complete the work item as non-retriable failed via
completeActivityTask/completeOrchestratorTaskwithTaskFailureDetails.isNonRetriable=truerather than throwing from the stream listener. - For transient failures use
Status.UNAVAILABLE(notINTERNAL) so sidecar-side backoff applies.
[High] F3 — isKnownPayloadToken matches any user string starting with blob:v1:, causing false-positive download attempts
File: azure-blob-payloads/src/main/java/com/microsoft/durabletask/azureblobpayloads/BlobPayloadStore.java — isKnownPayloadToken.
The only check is value.startsWith(TOKEN_PREFIX). Any user-provided JSON value whose content happens to start with blob:v1: (configuration data, a URL fragment, a test fixture, or an attacker-controlled RaiseEvent input) will be treated as a token on the response path. download then either:
- Hits the container-mismatch check and throws
IllegalArgumentException→ flows into F2 and poisons the work item forever, or - If the user string happens to match the configured container, issues a blob GET for an attacker-chosen blob name. Disclosure is impractical (UUID-128 blob names) but DoS via a crafted event input is.
Recommendation: Validate the full token grammar — require blob:v1:<container>:<32-hex-chars> — not just the prefix. Also reduces the blast radius of F2.
[Medium] F4 — Pre-send action-size validation skipped entirely when supportsLargePayloads=true, even for non-externalized fields
File: client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcWorker.java — completeOrchestratorTaskWithChunking.
The guard is if (!this.supportsLargePayloads) — i.e. trust the interceptor to shrink everything. But the interceptor only externalizes specific StringValue fields (result, input, details, reason, customStatus, …). A CompleteOrchestrationAction with huge carryoverEvents (see F1) or large failureDetails will not be reduced by the interceptor and will not be caught by the validator either.
Recommendation: Always run validateActionsSize. The interceptor sits on sendMessage and will already have had its chance by the time the blocking stub invokes gRPC. Alternatively, only bypass validation once every payload-carrying field is confirmed externalizable.
[Medium] F5 — New Java chunking depends exclusively on proto-deprecated isPartial / chunkIndex
File: internal/durabletask-protobuf/protos/orchestrator_service.proto + DurableTaskGrpcWorker.completeOrchestratorTaskWithChunking.
The proto comment in this PR says /* Chunking logic has since been deprecated and fields related to it are marked as such */, and both fields are [deprecated=true]. The new Java code depends entirely on them. If a newer sidecar drops or ignores these fields, chunked responses will be silently treated as a completion on chunk 0 and subsequent chunks dropped — very hard to diagnose.
Recommendation: Confirm with DTS backend owners that:
- Deprecated-but-supported semantics will hold for the supported lifecycle, and
- The long-term replacement is large-payload externalization (shipped in this same PR), with chunking as a fallback for users who can't configure Blob.
Add a javadoc comment in the Java code explicitly calling out the deprecation and the intended horizon.
[Medium] F6 — BlobPayloadStore.upload uses unconditional overwrite
File: BlobPayloadStore.java — both compressed (uploadWithResponse with null requestConditions) and uncompressed (blob.upload(stream, len, true)) paths.
Blob names are random UUIDs so collision is astronomically unlikely, but an If-None-Match: * precondition adds defense-in-depth at zero operational cost — e.g. against a future bug where a caller-supplied PayloadStore generates deterministic names, or a future refactor.
Recommendation: Use BlobRequestConditions.setIfNoneMatch("*") on uploads and convert 409 to a hard failure.
[Medium] F7 — Inconsistent null handling across LargePayloadStorageOptions setters
File: LargePayloadStorageOptions.java.
setConnectionString(null) silently converts to "", but setCredential(null) and setAccountUri(null) store null. A caller cannot reliably clear a previously-set connection string to switch to identity auth.
Recommendation: Pick one behavior and apply it to all three. Simplest is to accept null on all three and normalize in BlobPayloadStore where hasConnectionString / hasIdentityAuth is computed.
[Medium] F8 — containerVerified uses get then set instead of compareAndSet
File: BlobPayloadStore.java — lazy container-ensure path.
Multiple concurrent first-uploads can each call createIfNotExists(). It's idempotent so this is harmless today, but the pattern is a classic TOCTOU that static analyzers (SpotBugs) may flag.
Recommendation:
if (this.containerVerified.compareAndSet(false, true)) {
try { this.containerClient.createIfNotExists(); }
catch (BlobStorageException e) {
if (e.getStatusCode() != 409) {
this.containerVerified.set(false); // allow retry
throw new PayloadStorageException(...);
}
}
}[Medium] F9 — maxChunkSizeBytes default of 3.9 MB has no safety margin over gRPC's default 4 MB
File: DurableTaskGrpcWorkerBuilder.java — maxChunkSizeBytes = 4_089_446.
The size check uses response.getSerializedSize() (protobuf payload only). gRPC framing adds 5 bytes (compression flag + length prefix) and HTTP/2 framing adds more. The chunking loop also uses a 0.99 multiplier before falling to the accurate check, so chunks can approach the full 3.9 MB. If the sidecar runs with the gRPC default maxInboundMessageSize (4 MB = 4,194,304 B) the margin is <100 KB.
Recommendation: Either drop the default to ~3.5 MB, or document clearly in javadoc that the sidecar must accept ≥4 MB inbound messages and link to the .NET constant so the two SDKs can't drift silently.
Out of scope for this comment set (tracked locally, not posted)
- F10 client-side
sendMessagenot wrapping upload errors. - F11
azuremanagedtest dependency. - F12 unrelated proto additions (
ActivityRequest.tags,RewindOrchestrationAction,PurgeInstanceFilter.timeout). - F13 stray
;;onchunkIndex. - F14
estimateChunkSerializedSizeclone-per-check performance.
Happy to elaborate on any of these or pair on the fixes.
Issue describing the changes in this PR
Adding large payload support for the standalone SDK
resolves #issue_for_this_pr
Pull request checklist
CHANGELOG.md