Skip to content

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32

Open
ErlisLushtaku wants to merge 24 commits intomainfrom
erlislushtaku/fix/support-qwen-3.5
Open

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32
ErlisLushtaku wants to merge 24 commits intomainfrom
erlislushtaku/fix/support-qwen-3.5

Conversation

@ErlisLushtaku
Copy link
Copy Markdown
Collaborator

@ErlisLushtaku ErlisLushtaku commented Apr 6, 2026

  • Updated dependencies to support Qwen3.5
  • Added thinking token budget to prevent Qwen from using all budget without outputting verdict.
  • ...

@ErlisLushtaku ErlisLushtaku changed the title Support qwen 3.5 Support Qwen3.5 Apr 6, 2026
Comment thread pyproject.toml Outdated
[project.optional-dependencies]
vllm = ["vllm==0.10.2", "transformers>=4.55.2,<5.0.0"]
# vLLM on PyPI pins transformers<5; optional extra matches that so `uv lock` can resolve.
vllm = ["vllm>=0.17.0,<1.0.0", "transformers>=4.56.0,<5.0.0"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm>=0.17.0,<1.0.0 is a very wide range. A few concerns:

  • Was this tested with a prebuilt wheel or built from source? Building vLLM from source on cluster nodes often fails due to CUDA kernel compilation issues.
  • Is the StructuredOutputsParams import path (vllm.sampling_params) stable across this entire range? It may have been introduced in 0.17 and could move. For example StructuredOutputParams was a bit different when vllm==0.11.0. Thus I think it makes more sense to create more stable versioning

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I tightened the range. 0.18.1 was working. I think the StructuredOutputParams is stable accross the new range.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to v0.19+ so that we can use the thinking token limit parameter, and also they have some fixes for Qwen3.5

Comment thread judgearena/evaluate.py Outdated
Comment thread judgearena/evaluate.py
@ErlisLushtaku ErlisLushtaku force-pushed the erlislushtaku/fix/support-qwen-3.5 branch from ab3db1b to ef1c92c Compare April 7, 2026 14:19
- Switch from choice-based structured outputs to JSON schema constraint
- Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0
…ench baseline from huggingface and update huggingface repo
@ErlisLushtaku ErlisLushtaku changed the title Support Qwen3.5 Support Qwen3.5, fix mt-bench runs and other fixes Apr 14, 2026
…gex stripping since the structured output wasn't working for isolating thinking tokens anyway
@ErlisLushtaku ErlisLushtaku changed the title Support Qwen3.5, fix mt-bench runs and other fixes Changes related to running benchmark experiments for the paper: Support Qwen3.5, mt-bench, Skywork, and other changes Apr 17, 2026
Comment thread judgearena/utils.py Outdated
ErlisLushtaku and others added 7 commits April 17, 2026 14:24
…so that we have more customizability

- Introduced `truncate_judge_input_chars` and `max_judge_model_len` to `BaseCliArgs` for better control over judge-side input limits.
- Refactor baseline assignment for Arena-Hard datasets to support different baselines based on category same as original benchmark.
@ErlisLushtaku ErlisLushtaku changed the title Changes related to running benchmark experiments for the paper: Support Qwen3.5, mt-bench, Skywork, and other changes Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants