Skip to content

fix: support nested tags in structured compression regex (#201)#243

Open
ousamabenyounes wants to merge 1 commit intomicrosoft:mainfrom
ousamabenyounes:fix/issue-201
Open

fix: support nested tags in structured compression regex (#201)#243
ousamabenyounes wants to merge 1 commit intomicrosoft:mainfrom
ousamabenyounes:fix/issue-201

Conversation

@ousamabenyounes
Copy link
Copy Markdown

@ousamabenyounes ousamabenyounes commented Apr 11, 2026

What does this PR do?

Fixes #201

The regex pattern used in segment_structured_context to parse <llmlingua>…</llmlingua> blocks used [^<]+ for the content group. When a block contained an inner HTML/XML tag (e.g. <tag>nested</tag>), the < from the inner tag terminated the match early, silently dropping that segment and all subsequent ones.

Before (broken):

pattern = r"...(  [^<]+  )</llmlingua>"
# Input with <tag>…</tag> inside a block → last segment silently dropped
[('', 'False', '', '', 'Speaker 4:'), ..., ('', 'False', '', '', '\nSpeaker 4:')]  # missing last!

After (fixed):

pattern = r"...(  (?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*?  )</llmlingua>"
# All 6 segments captured, inner tags treated as plain text
[('', 'False', '', '', 'Speaker 4:'), ..., ('0.6', '', '', '', ' We have <tag>…</tag> here.')]

The new alternation (?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*? reads:

  • zero or more non-< characters, then optionally an inner tag that is not </llmlingua>
  • non-greedy so it still stops at the first </llmlingua>

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
    Issue: [Bug]: regex pattern does not handle nested tags in prompt (structured compression) #201
  • Did you make sure to update the documentation with your changes? (not applicable — internal parsing fix)
  • Did you write any new necessary tests?
    Added tests/test_nested_tag_regex.py with 4 unit tests: plain content regression, nested-tag fix, full multi-segment scenario, and rate-attribute preservation. Tests read the pattern directly from source to stay in sync with the code.

Who can review?

@iofu728 @QianhuiWu

Generated by Ora Studio
Vibe coded by ousamabenyounes

The content group used [^<]+ which fails when a <llmlingua> block
contains inner HTML/XML tags (e.g. <tag>...</tag>), silently dropping
all subsequent segments. Replaced with a non-greedy alternation that
skips over inner tags while still terminating at </llmlingua>.

Generated by Claude Code
Vibe coded by ousamabenyounes

Co-Authored-By: Claude <noreply@anthropic.com>
@ousamabenyounes
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: regex pattern does not handle nested tags in prompt (structured compression)

1 participant