Skip to content

feat: Add support for Run-End Encoded arrays#308

Open
CurtHagenlocher wants to merge 6 commits intoapache:mainfrom
CurtHagenlocher:run-end-encoding
Open

feat: Add support for Run-End Encoded arrays#308
CurtHagenlocher wants to merge 6 commits intoapache:mainfrom
CurtHagenlocher:run-end-encoding

Conversation

@CurtHagenlocher
Copy link
Copy Markdown
Contributor

What's Changed

This PR adds basic support for Run-End Encoded arrays by following established codebase patterns.

Notably:

  • New ArrowTypeId added.
  • New array type RunEndEncodedArray added.
  • New visitor method to handle the new array type.
  • New entry in the IPC serializer field type switch.
  • New RunEndEncodedType nested type.
  • Basic feature tests.
  • C API support
  • Concatenation support

Co-authored-by: Jorge Candeias jorge.candeias@outcompute.com

Supercedes #260

JorgeCandeias and others added 4 commits February 13, 2026 00:29
Introduced RunEndEncodedType and RunEndEncodedArray classes to represent run-end encoded arrays, including validation and logical length calculation. Integrated REE support into ArrowArrayFactory and IPC serialization/deserialization (ArrowStreamWriter, ArrowReaderImplementation, ArrowTypeFlatbufferBuilder, MessageSerializer). Added unit tests for REE array creation, validation, serialization, and indexing. This enables efficient handling of consecutive runs of the same value in Arrow .NET.
… API, the integration tests and the concatenator.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for Run-End Encoded (REE) arrays across Apache.Arrow .NET, integrating the new logical type into core type/array modeling, IPC read/write, C Data interface import/export, concatenation, and test coverage.

Changes:

  • Introduces ArrowTypeId.RunEndEncoded, RunEndEncodedType, and RunEndEncodedArray, and wires them into visitors/factories.
  • Extends IPC serialization/deserialization and JSON integration parsing to recognize/run REE schemas and arrays.
  • Adds concatenation support and new/updated tests covering REE behavior (including IPC roundtrip and concatenation scenarios).

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/Apache.Arrow.Tests/TestData.cs Adds REE fields and array creation support in test schema/data generation.
test/Apache.Arrow.Tests/TableTests.cs Updates expected column counts due to added REE test columns.
test/Apache.Arrow.Tests/RunEndEncodedArrayTests.cs New unit tests for REE type/array creation, validation, IPC roundtrip, and factory build.
test/Apache.Arrow.Tests/ArrowReaderVerifier.cs Extends array comparison visitor to support RunEndEncodedArray.
test/Apache.Arrow.Tests/ArrowArrayConcatenatorTests.cs Adds concatenation tests for REE arrays (incl. sliced inputs and mismatch errors).
test/Apache.Arrow.IntegrationTest/JsonFile.cs Adds JSON integration parsing and array creation support for REE.
src/Apache.Arrow/Types/RunEndEncodedType.cs New nested type representing REE (run_ends + values) with run_ends type validation.
src/Apache.Arrow/Types/IArrowType.cs Adds ArrowTypeId.RunEndEncoded.
src/Apache.Arrow/Ipc/MessageSerializer.cs Adds IPC schema/type deserialization for REE field types.
src/Apache.Arrow/Ipc/ArrowTypeFlatbufferBuilder.cs Adds flatbuffer type emission for REE type.
src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Adds IPC record batch buffer/node traversal for RunEndEncodedArray.
src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs Updates reader buffer-count logic for REE arrays (no top-level buffers).
src/Apache.Arrow/C/CArrowSchemaImporter.cs Adds C Data interface schema import support for REE (+r).
src/Apache.Arrow/C/CArrowSchemaExporter.cs Adds C Data interface schema export format for REE (+r).
src/Apache.Arrow/C/CArrowArrayImporter.cs Adds C Data interface array import support for REE children handling.
src/Apache.Arrow/Arrays/RunEndEncodedArray.cs New array implementation for REE with logical length derivation and physical-index lookup.
src/Apache.Arrow/Arrays/ArrowArrayFactory.cs Enables building RunEndEncodedArray from ArrayData.
src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs Adds concatenation logic for REE arrays (run_ends adjustment + values concatenation).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 839 to 844
long metadataLength = WriteMessage(Flatbuf.MessageHeader.RecordBatch,
recordBatchOffset, recordBatchBuilder.TotalLength);

long bufferLength = WriteBufferData(recordBatchBuilder.Buffers);
recordBatchBuilder.DisposeDeferredArrays();

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DisposeDeferredArrays() is only called after WriteBufferData. If WriteMessage/WriteBufferData throws, the normalized REE arrays held in _deferredDisposals will leak (and may retain native buffers) because they’re never disposed. Wrap the message/body write in a try/finally (or make ArrowRecordBatchFlatBufferBuilder IDisposable and dispose it in finally) so deferred arrays are always released.

Copilot uses AI. Check for mistakes.
Comment on lines 879 to 885
long metadataLength = await WriteMessageAsync(Flatbuf.MessageHeader.RecordBatch,
recordBatchOffset, recordBatchBuilder.TotalLength,
cancellationToken).ConfigureAwait(false);

long bufferLength = await WriteBufferDataAsync(recordBatchBuilder.Buffers, cancellationToken).ConfigureAwait(false);
recordBatchBuilder.DisposeDeferredArrays();

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DisposeDeferredArrays() should run in a finally block here as well. As written, an exception from WriteMessageAsync/WriteBufferDataAsync will skip disposal of any deferred normalized arrays, causing a memory/resource leak.

Copilot uses AI. Check for mistakes.
Comment on lines 1029 to 1035

long metadataLength = WriteMessage(Flatbuf.MessageHeader.DictionaryBatch,
dictionaryBatchOffset, recordBatchBuilder.TotalLength);

long bufferLength = WriteBufferData(recordBatchBuilder.Buffers);
recordBatchBuilder.DisposeDeferredArrays();

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as record batches: DisposeDeferredArrays() is not in a finally. If WriteMessage/WriteBufferData throws while writing a dictionary batch, deferred arrays (e.g., normalized REE slices) won’t be disposed.

Copilot uses AI. Check for mistakes.
Comment on lines 1055 to 1060
long metadataLength = await WriteMessageAsync(Flatbuf.MessageHeader.DictionaryBatch,
dictionaryBatchOffset, recordBatchBuilder.TotalLength, cancellationToken).ConfigureAwait(false);

long bufferLength = await WriteBufferDataAsync(recordBatchBuilder.Buffers, cancellationToken).ConfigureAwait(false);
recordBatchBuilder.DisposeDeferredArrays();

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async dictionary writing has the same leak hazard: DisposeDeferredArrays() should be guaranteed via try/finally so deferred normalized arrays are disposed even when WriteMessageAsync/WriteBufferDataAsync throws.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +80
if (data.NullCount != 0)
{
throw new ArgumentException(
$"Run-end encoded arrays have no top-level validity bitmap and must report null count 0, but got {data.NullCount}.",
nameof(data));
}

ValidateRunEndsType(runEnds);

if (runEnds.Length != values.Length)
{
throw new ArgumentException(
$"Run ends array length ({runEnds.Length}) must equal values array length ({values.Length}).");
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When constructing from ArrayData, the top-level Length/Offset is not validated against the run_ends values. This allows malformed ArrayData where Offset+Length exceeds the last run_end, which will break Normalize()/FindPhysicalIndex semantics. Add a validation that the slice range (data.Offset + data.Length) does not exceed the logical length implied by the last run_end value.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants