Skip to content

Another template tokenizer implementation#2071

Open
jg-rp wants to merge 1 commit intoShopify:mainfrom
jg-rp:new-tokenizer
Open

Another template tokenizer implementation#2071
jg-rp wants to merge 1 commit intoShopify:mainfrom
jg-rp:new-tokenizer

Conversation

@jg-rp
Copy link
Copy Markdown
Contributor

@jg-rp jg-rp commented Apr 6, 2026

This pull request demonstrates an alternative template tokenizer implementation.

Like the tokenizer from #2056, we use String#getbyte, String#byteindex and String#byteslice instead of a StringScanner.

On an M2 Mac Mini, running PHASE=tokenize bundle exec rake benchmark:strict2 gives:

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
           tokenize:   511.000 i/100ms
Calculating -------------------------------------
           tokenize:      5.112k (± 0.1%) i/s  (195.61 μs/i) -    102.711k in  20.091372s

The same benchmark for Liquid v5.12.0:

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
           tokenize:   265.000 i/100ms
Calculating -------------------------------------
           tokenize:      2.658k (± 0.1%) i/s  (376.29 μs/i) -     53.265k in  20.043212s

And for #2056:

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
           tokenize:   454.000 i/100ms
Calculating -------------------------------------
           tokenize:      4.567k (± 0.1%) i/s  (218.96 μs/i) -     91.708k in  20.080724s

Unlike v5.12.0 and #2056, the tokenizer in this PR also outperforms the old splitting regex approach from v5.6.0 when benchmarking without YJIT.


I did want to suggest aStringScanner implementation like this:

RE_START = /\{[{%]/
RE_OUT_BODY = /\{\{[^}]*\}\}?|\{\{.*%\}/
RE_TAG_BODY = /\{%[^%}]*%\}/

# @param scanner [StringScanner]
# @return [Array[String]]
def tokenize(scanner)
  tokens = [] # : Array[String]

  loop do
    case scanner.peek(2)
    when "{{"
      if (match = scanner.scan(RE_OUT_BODY))
        tokens << match
      else
        tokens << "{{"
        scanner.pos += 2
      end
    when "{%"
      if (match = scanner.scan(RE_TAG_BODY))
        tokens << match
      else
        tokens << "{%"
        scanner.pos += 2
      end
    else
      if (match = scanner.scan_until(RE_START))
        tokens << match.chop!.chop!
        scanner.pos -= 2
      else
        tokens << scanner.rest unless scanner.eos?
        break
      end
    end
  end

  tokens
end

Which is a performance improvement over v5.12.0, but does not come close to the byte scanning approach, and doesn't have the same performance characteristics without YJIT.

@jg-rp jg-rp marked this pull request as ready for review April 7, 2026 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant