fix: handle direct model answers in ReACT loop by markstur · Pull Request #763 · generative-computing/mellea

markstur · 2026-03-27T23:26:20Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes fix: react loop is missing a check for result without tool calls #762

fix: handle direct model answers in ReACT loop

The ReACT framework now properly handles cases where the model provides
a direct answer without calling tools. Previously, these answers were
ignored and the loop would continue until exhausting the budget.

Added test coverage for both scenarios (no tools, unused tools).

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

github-actions · 2026-03-27T23:26:31Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

markstur · 2026-03-27T23:31:22Z

Note to reviewers: I do have concerns about my limited test environment and side effects. I'm seeing this as a good fix for when react_using_mellea with DuckDuckGo starts to find nothing. I suspect it could be impacted by DDGS rate limiting but not sure why it looks like this. Bottom line though is that there is a case where we have an answer in step value (is_complete) but we miss our is_final handling.

I wonder if there is any reason to only do this check after running out of iterations (last ditch handling), but it seem seems more right to me to just use the value when this elif case happens.

jakelorocco

Hi @markstur, thanks for the PR! I think this might not be an ideal way to fix the issue. I do agree that our current version of the react thinking pattern does get stuck in loops (especially for simpler answers that can be accomplished in one response).

However, I don't think we should automatically assume that a step with no tool calls and a response is the final answer. There are moments where the model will output it's thoughts as those intermediate steps and then continue.

The issue I see the most (especially with smaller models) is that the model thinks it's final tool has already been called. As a result, it just keeps repeating the same output and gets stuck till the loop exhausts.

I think there are a few potential solutions:

We could change the requirements for calling a tool to finalize. A lot of react patterns just look for "final_answer:" and parse the output. We could also add this in addition to the current tool call approach.
We could try to detect repetitions and prompt the model out of those situations. I'm not quite sure if the exact repetitions are applicable only to granite models / small models / our prompts though.
We could add a subsequent LLM call after each step that is a requirement that validates if the question has been answered. This adds overhead to each loop iteration, but is likely relatively low since the context should be cached. Then, if that requirement is valid, we could do one more prompt to extract the final answer using the tool / the current approach.

markstur · 2026-03-30T15:29:20Z

Thanks @jakelorocco

Yes I agree my naive approach is probably assuming too much. I was unsuccessful when I first tried to fix this with some alternative approaches but I think I need to revisit because the problem was flaky at the time. I think I can reproduce it better now (maybe).

I'll see if I can get good results that align better with your bulleted suggestions.

planetf1

Just one small comment on test classification otherwise LGTM

planetf1 · 2026-04-14T16:39:49Z

+
+
+@pytest.mark.ollama
+@pytest.mark.llm


Suggested change

@pytest.mark.llm

@pytest.mark.e2e

We updated our markers (llm->e2e).

this change is needed for tests

Consider @pytest.mark.qualitative if we're doing anything that depends on actual llm results (and not just testing connectivity) ie values that we parse.

markstur · 2026-04-18T00:46:09Z

@planetf1 @jakelorocco I was letting this sit because I don't have a good solution, but I wanted to at least share this hack. It waits until the iteration loop ends and then asks the LLM if it has an answer before failing. It works for me in this ollama small model loopy situation. So before I give up on this, I thought I'd share this and ask if you think there is a more Melleaic way of doing this. Tweaking the jinja templates didn't do it for me. Probably because this is a small model behaving badly.

String parsing for "final answer" in a variety of formats would work sometimes, but also is hacky and inconsistent.

I like the idea of not doing it in every loop, but maybe as @jakelorocco said, it could be fast enough with caching.

Can/should we have a way to pass in an optional are-you-done test for the case where tools are not getting called. And call it either whenever tools are not called, or at the end, or after X loops...

planetf1 · 2026-04-20T10:58:55Z

I think having that callback makes sense as it allows for any of the options described here to be implemented, but as a specific designed intent by the author. Something like:

async def react(
      ...
      answer_check: Callable[[ReactThought, ChatContext, int], Awaitable[bool]] | None = None,
  )

I think passing the iteration (int) makes sense -- the condition may make use of it too to get progressively more flexible in taking an answer (do we need to include max turns/budget as that is eventually the limit for react? though getting more complex)

The ReACT framework now properly handles cases where the model provides a direct answer without calling tools. Previously, these answers were ignored and the loop would continue until exhausting the budget. Added test coverage for both scenarios (no tools, unused tools). Fixes: generative-computing#762 Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

With some models (and Ollama for example) we get stuck where the model has the answer but won't call finalize. Before failing due to iteration limit, ask the model if it has the answer and if it responds True then use it. Note: This is only done at the end of iterations because it is questionable to penalize other models on each iteration. When failure is the only option, it seems to be worth a try. Fixes: generative-computing#762 Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

* optional callback * example shows checking for "Final Answer" each iteration * and also querying the LLM when loop budget is reached Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

markstur · 2026-04-21T00:15:58Z

updated to a version with an optional callback. This is definitely useful and powerful.

Still a bit TBD what the right params and return should be. I like at least having the iteration number and loop budget.
I think it's ok to call this after final tool is called (feature) but that's almost a different use case.
TBD to call this when there is no step.value (skip or expect the callable to check).

The main thing I'd like to do is handle more of the finalizing within the callback, but I stopped here, because I'm feeling like that is too complicated for implementers (and doc). Want to look a bit closer at that. Still need to settle on what the expected result and modifications should be.

planetf1 · 2026-04-23T08:58:09Z

 from mellea.stdlib.context import ChatContext


+class TrueOrFalse(pydantic.BaseModel):


We have TrueOrFalse in three places in this PR (since one is a test, and one is an example I get there's justification)-- this particular one isn't used? Do you expect to use it in the framework?)

planetf1 · 2026-04-23T09:00:55Z

+
+
+@pytest.mark.ollama
+@pytest.mark.llm


this change is needed for tests

planetf1 · 2026-04-23T09:09:24Z

+    if loop_budget == -1 or turn_num < loop_budget:
+        return False
+
+    content = mfuncs.chat(


We're calling a sync method from async code. Technically it works in this sample as there's nothing else happing.

But if the user copied it and used it when running multiple react loops, or within something like FastAPI where there's a shared event loop the this callback would freeze out other handlers whilst the llm call is running

can we await on the async variant instead ie await mfuncs.achat(....) though note also the method sig is slightly different - a tuple vs a list -- I think this gives a better example - and as these are examples where we show appropriate usage I do think it's worth changing

planetf1 · 2026-04-23T09:15:50Z

    model_options: dict | None = None,
    tools: list[AbstractMelleaTool] | None,
    loop_budget: int = 10,
+    answer_check: Callable[


it's good we have the flexibility though it's a lot of positional parameters - may be rationale for a new type (or Protocol) - but we do neither elsewhere given that callables only have 2 parms often. Technically it's not an issue - the code works. It's more about usability give it's an external API. Worth thinking about, but I wouldn't a say a blocker

Essential will be to get the docstring right at least

planetf1 · 2026-04-23T09:37:37Z

+
+@pytest.mark.ollama
+@pytest.mark.llm
+async def test_react_direct_answer_without_tools():


The control-flow change in react.py (the new elif answer_check and step.value branch) is fully testable without a live LLM using the ScriptedBackend already in test_react_framework.py.

We're been adding more unit/integration tests to reduce the need to run e2e with slower times/overhead locally -- so having some mocking like this adds in coverage for relevant code paths

Suggest adding to test/stdlib/frameworks/test_react_framework.py (untested)

@pytest.mark.asyncio async def test_react_answer_check_terminates_on_direct_response(): """answer_check returning True on a no-tool-call turn exits the loop.""" backend = ScriptedBackend([_ScriptedTurn(value="42")]) async def always_done(goal, step, ctx, backend, opts, turn, budget): return True result, _ = await react( goal="answer", context=ChatContext(), backend=backend, tools=None, loop_budget=5, answer_check=always_done, ) assert result.value == "42"

planetf1 · 2026-04-23T09:50:11Z

+        answer_check: optional callable to determine if the agent has completed its task.
+            Called every iteration when no tool calls are made and step.value exists (if provided).
+            Receives (goal, step, context, backend, model_options, turn_num, loop_budget).
+            Returns bool indicating if the task is complete.
+            If None, no answer check is performed (loop continues until finalizer or budget exhausted).


Possible suggestion — the answer_check Args entry uses informal prose (Receives (...)) that Griffe renders as plain text. Following the project convention in requirement.py (inline prose with types):

Suggested change

answer_check: optional callable to determine if the agent has completed its task.

Called every iteration when no tool calls are made and step.value exists (if provided).

Receives (goal, step, context, backend, model_options, turn_num, loop_budget).

Returns bool indicating if the task is complete.

If None, no answer check is performed (loop continues until finalizer or budget exhausted).

answer_check: Optional async callable invoked each iteration when no tool

calls are made and ``step.value`` is non-empty. Receives ``goal`` (str),

``step`` (ComputedModelOutputThunk[str]), ``context`` (ChatContext),

``backend`` (Backend), ``model_options`` (dict | None), ``turn_num`` (int),

and ``loop_budget`` (int). Return ``True`` to accept ``step.value`` as the

final answer, ``False`` to continue. If ``None``, the loop runs until

``final_answer`` is called or ``loop_budget`` is exhausted.

planetf1 · 2026-04-23T09:50:36Z

+            If None, no answer check is performed (loop continues until finalizer or budget exhausted).

    Returns:
        A (ModelOutputThunk, Context) if `return_sampling_results` is `False`, else returns a `SamplingResult`.


Possible suggestion — return_sampling_results is not a parameter on react(), this is stale, and will render incorrectly in the API docs:

Suggested change

A (ModelOutputThunk, Context) if `return_sampling_results` is `False`, else returns a `SamplingResult`.

Tuple of ``(ComputedModelOutputThunk[str], ChatContext)`` — the final answer thunk and updated context.

planetf1 · 2026-04-23T10:26:14Z

@@ -0,0 +1,108 @@
+"""Test ReACT framework handling of direct answers without tool calls."""


This file probably should be in test/stdlib/frameworks to match the source structure?o

markstur requested a review from a team as a code owner March 27, 2026 23:26

github-actions Bot added the bug Something isn't working label Mar 27, 2026

jakelorocco reviewed Mar 30, 2026

View reviewed changes

planetf1 reviewed Apr 14, 2026

View reviewed changes

markstur requested a review from a team as a code owner April 18, 2026 00:18

markstur requested review from AngeloDanducci and ajbozarth and removed request for AngeloDanducci and ajbozarth April 18, 2026 00:18

markstur marked this pull request as draft April 18, 2026 00:19

markstur added 3 commits April 20, 2026 14:32

feat: add callback to react() to allow checking for final answer

b0a8b40

* optional callback * example shows checking for "Final Answer" each iteration * and also querying the LLM when loop budget is reached Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

markstur force-pushed the issue_762 branch from ac1b8ab to b0a8b40 Compare April 21, 2026 00:05

planetf1 reviewed Apr 23, 2026

View reviewed changes

		from mellea.stdlib.context import ChatContext


		class TrueOrFalse(pydantic.BaseModel):

-        answer_check: optional callable to determine if the agent has completed its task.
-            Called every iteration when no tool calls are made and step.value exists (if provided).
-            Receives (goal, step, context, backend, model_options, turn_num, loop_budget).
-            Returns bool indicating if the task is complete.
-            If None, no answer check is performed (loop continues until finalizer or budget exhausted).
+        answer_check: Optional async callable invoked each iteration when no tool
+            calls are made and ``step.value`` is non-empty. Receives ``goal`` (str),
+            ``step`` (ComputedModelOutputThunk[str]), ``context`` (ChatContext),
+            ``backend`` (Backend), ``model_options`` (dict | None), ``turn_num`` (int),
+            and ``loop_budget`` (int). Return ``True`` to accept ``step.value`` as the
+            final answer, ``False`` to continue. If ``None``, the loop runs until
+            ``final_answer`` is called or ``loop_budget`` is exhausted.

	A (ModelOutputThunk, Context) if `return_sampling_results` is `False`, else returns a `SamplingResult`.
	Tuple of ``(ComputedModelOutputThunk[str], ChatContext)`` — the final answer thunk and updated context.

		@@ -0,0 +1,108 @@
		"""Test ReACT framework handling of direct answers without tool calls."""



		@pytest.mark.ollama
		@pytest.mark.llm



		@pytest.mark.ollama
		@pytest.mark.llm

Conversation

markstur commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

markstur commented Mar 27, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

markstur commented Mar 30, 2026

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markstur commented Apr 18, 2026

Uh oh!

planetf1 commented Apr 20, 2026

Uh oh!

markstur commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markstur commented Mar 27, 2026 •

edited

Loading

planetf1 Apr 23, 2026 •

edited

Loading

planetf1 Apr 23, 2026 •

edited

Loading