Skip to content

[core] Avoid key bytes OOM in ClusteringFileRewriter.sortAndRewriteFile#7642

Merged
JingsongLi merged 3 commits intoapache:masterfrom
JingsongLi:avoid_memory
Apr 14, 2026
Merged

[core] Avoid key bytes OOM in ClusteringFileRewriter.sortAndRewriteFile#7642
JingsongLi merged 3 commits intoapache:masterfrom
JingsongLi:avoid_memory

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi commented Apr 14, 2026

Purpose

Avoid key bytes OOM in ClusteringFileRewriter.sortAndRewriteFile. Removing the in-memory List<byte[]> collectedKeys and the batchPutIndex method eliminates the unbounded memory accumulation.

Tests

Existing Tests.

* Sort and rewrite unsorted file by clustering columns. Reads all KeyValue records, sorts them
* using an external sort buffer, and writes to new level-1 files. Checks the key index inline
* during writing to handle deduplication (FIRST_ROW skips duplicates, DEDUPLICATE marks old
* positions in DV) and updates the index without re-reading the output files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc seems to be not accurate. When rebuildIndex, the file now will be re-read.

Copy link
Copy Markdown
Contributor

@LsomeYeah LsomeYeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 7c93bd7 into apache:master Apr 14, 2026
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants