Skip to content

Fix thread safety issues in Communicator, Session, and AsyncPromiseFulfillerDecorator#433

Open
neinhart wants to merge 1 commit intoChargeTimeEU:masterfrom
neinhart:fix/thread-safety
Open

Fix thread safety issues in Communicator, Session, and AsyncPromiseFulfillerDecorator#433
neinhart wants to merge 1 commit intoChargeTimeEU:masterfrom
neinhart:fix/thread-safety

Conversation

@neinhart
Copy link
Copy Markdown

@neinhart neinhart commented Apr 5, 2026

Problem

Several thread safety issues exist in ocpp-common that can cause data corruption
or resource exhaustion under concurrent load:

  1. Communicator: transactionQueue uses non-thread-safe ArrayDeque, but is
    accessed concurrently by sendCall() and the RetryRunner thread without
    synchronization. failedFlag is a plain boolean with no cross-thread
    visibility guarantee.

  2. AsyncPromiseFulfillerDecorator: Default executor is Executors.newCachedThreadPool()
    which creates threads without bound. When many chargers send requests simultaneously,
    this can exhaust memory with OutOfMemoryError: unable to create new native thread.

  3. Session.onCall(): The synchronized keyword serializes all incoming CALL messages
    per session, but all internal state is already thread-safe (ConcurrentHashMap,
    stateless Gson, immutable FeatureRepository). Additionally, pendingPromises
    entries are never removed after completion, causing unbounded map growth.

Changes

  • Replace ArrayDeque with ConcurrentLinkedDeque and boolean with AtomicBoolean
    in Communicator
  • Replace unbounded CachedThreadPool with bounded ThreadPoolExecutor in
    AsyncPromiseFulfillerDecorator (core=CPUs, max=CPUs×2, queue=1000,
    CallerRunsPolicy for natural backpressure). Existing setExecutor() API
    is preserved for custom configuration.
  • Remove unnecessary synchronized from Session.onCall()
  • Add whenComplete callback to auto-remove pendingPromises entries

Tests

Added 3 new test classes (10 tests total):

  • AsyncPromiseFulfillerDecoratorTest — async delegation, non-blocking, executor resilience
  • CommunicatorConcurrencyTest — concurrent sendCall, concurrent sendCall + reconnect
  • SessionConcurrencyTest — concurrent onCall, pendingPromises cleanup, concurrent completion

…lfillerDecorator

- Replace non-thread-safe ArrayDeque with ConcurrentLinkedDeque in
  Communicator transaction queue to fix race condition where RetryRunner
  thread and sendCall() concurrently access the queue without
  synchronization
- Replace boolean failedFlag with AtomicBoolean for cross-thread
  visibility between RetryRunner and EventHandler
- Replace unbounded CachedThreadPool with bounded ThreadPoolExecutor
  in AsyncPromiseFulfillerDecorator (core=CPUs, max=CPUs*2, queue=1000,
  CallerRunsPolicy) to prevent thread explosion when many chargers send
  requests simultaneously
- Remove unnecessary synchronized from Session.onCall() — all internal
  state is already thread-safe (ConcurrentHashMap, stateless Gson,
  immutable FeatureRepository). The lock serialized all incoming CALL
  messages per session despite OCPP supporting concurrent requests via
  unique message IDs
- Add automatic cleanup of pendingPromises via whenComplete callback
  to prevent unbounded map growth
@robert-s-ubi
Copy link
Copy Markdown
Contributor

Hi @neinhart, I haven't fully looked into all issues, but this one:

| Additionally, pendingPromises entries are never removed after completion, causing unbounded map growth.

does not seem correct to me. See Session#completePendingPromise() which removes pendingPromises when completing.

@neinhart
Copy link
Copy Markdown
Author

Hi @robert-s-ubi, thanks for the careful review and the correction — you're right that my wording in the PR description was inaccurate.

Let me trace through the code more precisely:

Normal success path — you're correct:

  1. SimplePromiseFulfiller.fulfill() calls eventHandler.handleRequest(request) and gets a Confirmation
  2. It then calls eventHandler.asyncCompleteRequest(messageId, conf) (SimplePromiseFulfiller.java:47)
  3. asyncCompleteRequest delegates to session.completePendingPromise() (Server.java:143)
  4. completePendingPromise removes the entry from the map. ✅

So in the successful path, entries are removed correctly as you pointed out.

Exception path — where I think there is still a leak:

In SimplePromiseFulfiller.fulfill() (lines 49-54), if eventHandler.handleRequest(request) throws an exception, the catch block calls promise.completeExceptionally(ex) directly on the promise, bypassing asyncCompleteRequest/completePendingPromise.

} catch (Exception ex) {
    logger.warn(\"fulfillPromise() failed\", ex);
    if (promise != null) {
        promise.completeExceptionally(ex);  // <-- bypasses completePendingPromise
    }
}

In this case, whenComplete(ConfirmationHandler) fires and sends back the error response, but the entry in pendingPromises is never cleaned up. Each exception from a handler leaks one entry.

The promise.whenComplete((result, error) -> pendingPromises.remove(id)) callback I added is a no-op in the success path (the entry is already gone) but handles the exception path as well.

That said, I agree the PR description overstates the issue — it's an edge case in the exception path, not a general leak. Would you prefer:

  1. I keep the whenComplete cleanup as defensive and update the PR description to accurately describe the exception-path scenario, or
  2. I remove this specific change from the PR (since it's a minor edge case) and the PR focuses only on the clear thread-safety fixes (ArrayDeque, AtomicBoolean, ThreadPoolExecutor, synchronized removal)?

Happy to go either way. Thanks again for taking the time to review!

@robert-s-ubi
Copy link
Copy Markdown
Contributor

That said, I agree the PR description overstates the issue — it's an edge case in the exception path, not a general leak. Would you prefer:

  1. I keep the whenComplete cleanup as defensive and update the PR description to accurately describe the exception-path scenario, or
  2. I remove this specific change from the PR (since it's a minor edge case) and the PR focuses only on the clear thread-safety fixes (ArrayDeque, AtomicBoolean, ThreadPoolExecutor, synchronized removal)?

@neinhart I'd prefer option 2, and put the promise leak in a separate PR. Something similar was already fixed with the promises for outgoing requests in this commit: 32903d4 - and that change replaced the existing promise removals with a single whenComplete() removal. I'd like to solve this issue the same way, if possible.

Copy link
Copy Markdown
Contributor

@robert-s-ubi robert-s-ubi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the tests yet. I suppose not all will apply when the completion leak fix is removed?

Comment on lines +271 to +272
"An error occurred. Sending this information: uniqueId {}: action: {}, errorCode: {},"
+ " errorDescription: {}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this unnecessary formatting change.

Comment on lines +208 to +209
"Payload for Action is syntactically correct but at least one of the fields violates"
+ " occurrence constraints";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this unnecessary formatting change.

Comment on lines +213 to +214
"An internal error occurred and the receiver was not able to process the requested Action"
+ " successfully";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this unnecessary formatting change.

if (request.validate()) {
CompletableFuture<Confirmation> promise = new CompletableFuture<>();
promise.whenComplete(new ConfirmationHandler(id, action, communicator));
promise.whenComplete((result, error) -> pendingPromises.remove(id));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert and make this change in a different PR.

Comment on lines +45 to +46
int coreSize = Runtime.getRuntime().availableProcessors();
int maxSize = coreSize * 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the effects of this. Note that this also affects the request handling on the charging station side, and I've seen CSMS "bomb" charging stations with parallel ChangeConfigurationRequests for all read-write configuration settings. Would this still work the same as before?

Also, I gather that once the threads are exhausted (which may be only 2 on a single-core charging station), the executor switches to executing the request handler within the onCall() caller's thread. But if then an exception is thrown in the request handler, it would not result in the same response as when it was thrown in a separate thread, would it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants