Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc by ErlisLushtaku · Pull Request #32 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-04-06T21:57:42Z

Updated dependencies to support Qwen3.5
Added thinking token budget to prevent Qwen from using all budget without outputting verdict.
...

- fix dependencies - add structured output to prevent judge from not respecting the prompt

kargibora · 2026-04-07T08:28:58Z

 [project.optional-dependencies]
-vllm = ["vllm==0.10.2", "transformers>=4.55.2,<5.0.0"]
+# vLLM on PyPI pins transformers<5; optional extra matches that so `uv lock` can resolve.
+vllm = ["vllm>=0.17.0,<1.0.0", "transformers>=4.56.0,<5.0.0"]


vllm>=0.17.0,<1.0.0 is a very wide range. A few concerns:

Was this tested with a prebuilt wheel or built from source? Building vLLM from source on cluster nodes often fails due to CUDA kernel compilation issues.

Is the StructuredOutputsParams import path (vllm.sampling_params) stable across this entire range? It may have been introduced in 0.17 and could move. For example StructuredOutputParams was a bit different when vllm==0.11.0. Thus I think it makes more sense to create more stable versioning

Good point. I tightened the range. 0.18.1 was working. I think the StructuredOutputParams is stable accross the new range.

Changed it to v0.19+ so that we can use the thinking token limit parameter, and also they have some fixes for Qwen3.5

- Switch from choice-based structured outputs to JSON schema constraint - Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0

… output

…port-qwen-3.5

…ench baseline from huggingface and update huggingface repo

…gex stripping since the structured output wasn't working for isolating thinking tokens anyway

…so that we have more customizability - Introduced `truncate_judge_input_chars` and `max_judge_model_len` to `BaseCliArgs` for better control over judge-side input limits.

- Refactor baseline assignment for Arena-Hard datasets to support different baselines based on category same as original benchmark.

…ted token count for max_model_len

ErlisLushtaku and others added 5 commits April 6, 2026 23:02

update dependencies to support Qwen 3.5

c6b2b0a

slurmpilot scripts

1f4bae8

update dep versions

25b0355

fix support for VLLM

ab065fd

- fix dependencies - add structured output to prevent judge from not respecting the prompt

remove qwen35 smoke launcher

ef1c92c

ErlisLushtaku changed the title ~~Support qwen 3.5~~ Support Qwen3.5 Apr 6, 2026

kargibora reviewed Apr 7, 2026

View reviewed changes

ErlisLushtaku force-pushed the erlislushtaku/fix/support-qwen-3.5 branch from ab3db1b to ef1c92c Compare April 7, 2026 14:19

ErlisLushtaku added 6 commits April 7, 2026 16:23

use json schema structured outputs, tighten vllm range

32f2e7e

- Switch from choice-based structured outputs to JSON schema constraint - Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0

fix formatting

5f2edf0

Fix Qwen3.5 with mt-bench

cffb6dd

use latest vllm with thinking tokens limits and thinking field in the…

ac243aa

… output

Merge remote-tracking branch 'origin/main' into erlislushtaku/fix/sup…

41298a4

…port-qwen-3.5

thinking token handling improvements, mt-bench improvements, use mt-b…

319050d

…ench baseline from huggingface and update huggingface repo

ErlisLushtaku changed the title ~~Support Qwen3.5~~ Support Qwen3.5, fix mt-bench runs and other fixes Apr 14, 2026

Revert to free form generation, and use thinking token budget with re…

cb7ada5

…gex stripping since the structured output wasn't working for isolating thinking tokens anyway

ErlisLushtaku changed the title ~~Support Qwen3.5, fix mt-bench runs and other fixes~~ Changes related to running benchmark experiments for the paper: Support Qwen3.5, mt-bench, Skywork, and other changes Apr 17, 2026

ErlisLushtaku added 2 commits April 17, 2026 14:06

revert unnecessary changes and relics from earlier trials

84faa05

delete slurmpilot script

c063f3d

ErlisLushtaku commented Apr 17, 2026

View reviewed changes

Comment thread judgearena/utils.py Outdated

ErlisLushtaku and others added 7 commits April 17, 2026 14:24

Revert comment removal

ec7fc95

simplify and revert unnecessary changes

20ca9a5

Support Skywork

217dc8d

Add judge input character truncation and model length configurations …

8087c15

…so that we have more customizability - Introduced `truncate_judge_input_chars` and `max_judge_model_len` to `BaseCliArgs` for better control over judge-side input limits.

add llmcompressor dev dependency for quantization

91d67ef

Update baseline handling for Arena-Hard datasets

5e8efc9

- Refactor baseline assignment for Arena-Hard datasets to support different baselines based on category same as original benchmark.

Add m-arenahard-v2.0

2af4714

ErlisLushtaku added 2 commits April 22, 2026 01:08

add default baseline for mt-bench

da6818e

handle prohibited content errors for gemini in openrouter

891c417

update system prompt with alpaca eval version, fix mismatch for expec…

fb36154

…ted token count for max_model_len

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32
ErlisLushtaku wants to merge 24 commits intomainfrom
erlislushtaku/fix/support-qwen-3.5

ErlisLushtaku commented Apr 6, 2026 •

edited

Loading

Uh oh!

kargibora Apr 7, 2026

Uh oh!

ErlisLushtaku Apr 7, 2026

Uh oh!

ErlisLushtaku Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ErlisLushtaku commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kargibora Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErlisLushtaku commented Apr 6, 2026 •

edited

Loading