Skip to content

fix: Do not escape regex in d2:validatePattern [DHIS2-21359]#100

Open
enricocolasante wants to merge 3 commits intomainfrom
DHIS2-21359
Open

fix: Do not escape regex in d2:validatePattern [DHIS2-21359]#100
enricocolasante wants to merge 3 commits intomainfrom
DHIS2-21359

Conversation

@enricocolasante
Copy link
Copy Markdown
Collaborator

@enricocolasante enricocolasante commented Apr 23, 2026

Fix: preserve regex backslash escapes in d2:validatePattern DHIS2-21359

Problem

d2:validatePattern was silently misinterpreting regex patterns that contained backslash-escaped characters inside character classes.

String literals in expressions are stored as Utf8StringNode, whose decode() method strips backslashes from escape sequences (\XX). For most string usage this is correct, but a regex pattern handed to d2:validatePattern should reach the regex engine with its backslashes intact.

The result was that a pattern like

[a-zA-Z0-9À-ȕ\'\-\'\`\'\ ]+

(where the apostrophes include typographic U+2018/U+2019) would be decoded to […'-'…]. Java then reads '(U+0027)-U+2018 as a character range spanning code points 39–8216, which accidentally covers the entire Cyrillic block (e.g. И = U+0418 = 1048). So inputs that should not match — like Иван — were incorrectly returning true.

A naive fix of passing the raw string (with backslashes) directly to the regex engine solves the JVM case but breaks JS: the Kotlin/JS regex engine runs in unicode mode and rejects unknown backslash escapes such as \', \`, and \ as invalid.

Fix

Added Utf8StringNode.decodeToRegex() alongside the existing decode(). It processes the raw expression-string value specifically for regex use:

  • Strips backslashes from expression-level escapes that are not valid regex escapes (\'', \ , \ + U+2018 → U+2018, etc.)
  • Preserves standard regex escapes that are valid on all platforms (\d, \w, \s, \\, \uXXXX, \xXX, etc.)
  • Preserves \- as \- — both Java and JS unicode-mode regex recognise this as a literal hyphen inside a character class, preventing it from being interpreted as a range operator between two adjacent characters

Calculator.evalToRawString now calls decodeToRegex for STRING nodes and recursively unwraps ARGUMENT/PAR nodes. Non-literal arguments (variables, sub-expressions) fall back to evalToString so existing behaviour is unchanged.

Test plan

  • testValidatePattern_NoMatch — asserts Иван (Cyrillic) does not match a Latin/extended-Latin name pattern, on JVM and JS
  • testValidatePattern_Match — asserts John (ASCII Latin) does match the same pattern, on JVM and JS

@enricocolasante enricocolasante force-pushed the DHIS2-21359 branch 2 times, most recently from 11ea7dc to aafe0d0 Compare April 23, 2026 17:48
@enricocolasante enricocolasante marked this pull request as ready for review April 24, 2026 06:59
@enricocolasante enricocolasante requested a review from jbee April 24, 2026 06:59
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant