feat: Add support for Run-End Encoded arrays#308
feat: Add support for Run-End Encoded arrays#308CurtHagenlocher wants to merge 6 commits intoapache:mainfrom
Conversation
Introduced RunEndEncodedType and RunEndEncodedArray classes to represent run-end encoded arrays, including validation and logical length calculation. Integrated REE support into ArrowArrayFactory and IPC serialization/deserialization (ArrowStreamWriter, ArrowReaderImplementation, ArrowTypeFlatbufferBuilder, MessageSerializer). Added unit tests for REE array creation, validation, serialization, and indexing. This enables efficient handling of consecutive runs of the same value in Arrow .NET.
… API, the integration tests and the concatenator.
There was a problem hiding this comment.
Pull request overview
Adds first-class support for Run-End Encoded (REE) arrays across Apache.Arrow .NET, integrating the new logical type into core type/array modeling, IPC read/write, C Data interface import/export, concatenation, and test coverage.
Changes:
- Introduces
ArrowTypeId.RunEndEncoded,RunEndEncodedType, andRunEndEncodedArray, and wires them into visitors/factories. - Extends IPC serialization/deserialization and JSON integration parsing to recognize/run REE schemas and arrays.
- Adds concatenation support and new/updated tests covering REE behavior (including IPC roundtrip and concatenation scenarios).
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Apache.Arrow.Tests/TestData.cs | Adds REE fields and array creation support in test schema/data generation. |
| test/Apache.Arrow.Tests/TableTests.cs | Updates expected column counts due to added REE test columns. |
| test/Apache.Arrow.Tests/RunEndEncodedArrayTests.cs | New unit tests for REE type/array creation, validation, IPC roundtrip, and factory build. |
| test/Apache.Arrow.Tests/ArrowReaderVerifier.cs | Extends array comparison visitor to support RunEndEncodedArray. |
| test/Apache.Arrow.Tests/ArrowArrayConcatenatorTests.cs | Adds concatenation tests for REE arrays (incl. sliced inputs and mismatch errors). |
| test/Apache.Arrow.IntegrationTest/JsonFile.cs | Adds JSON integration parsing and array creation support for REE. |
| src/Apache.Arrow/Types/RunEndEncodedType.cs | New nested type representing REE (run_ends + values) with run_ends type validation. |
| src/Apache.Arrow/Types/IArrowType.cs | Adds ArrowTypeId.RunEndEncoded. |
| src/Apache.Arrow/Ipc/MessageSerializer.cs | Adds IPC schema/type deserialization for REE field types. |
| src/Apache.Arrow/Ipc/ArrowTypeFlatbufferBuilder.cs | Adds flatbuffer type emission for REE type. |
| src/Apache.Arrow/Ipc/ArrowStreamWriter.cs | Adds IPC record batch buffer/node traversal for RunEndEncodedArray. |
| src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs | Updates reader buffer-count logic for REE arrays (no top-level buffers). |
| src/Apache.Arrow/C/CArrowSchemaImporter.cs | Adds C Data interface schema import support for REE (+r). |
| src/Apache.Arrow/C/CArrowSchemaExporter.cs | Adds C Data interface schema export format for REE (+r). |
| src/Apache.Arrow/C/CArrowArrayImporter.cs | Adds C Data interface array import support for REE children handling. |
| src/Apache.Arrow/Arrays/RunEndEncodedArray.cs | New array implementation for REE with logical length derivation and physical-index lookup. |
| src/Apache.Arrow/Arrays/ArrowArrayFactory.cs | Enables building RunEndEncodedArray from ArrayData. |
| src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs | Adds concatenation logic for REE arrays (run_ends adjustment + values concatenation). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| long metadataLength = WriteMessage(Flatbuf.MessageHeader.RecordBatch, | ||
| recordBatchOffset, recordBatchBuilder.TotalLength); | ||
|
|
||
| long bufferLength = WriteBufferData(recordBatchBuilder.Buffers); | ||
| recordBatchBuilder.DisposeDeferredArrays(); | ||
|
|
There was a problem hiding this comment.
DisposeDeferredArrays() is only called after WriteBufferData. If WriteMessage/WriteBufferData throws, the normalized REE arrays held in _deferredDisposals will leak (and may retain native buffers) because they’re never disposed. Wrap the message/body write in a try/finally (or make ArrowRecordBatchFlatBufferBuilder IDisposable and dispose it in finally) so deferred arrays are always released.
| long metadataLength = await WriteMessageAsync(Flatbuf.MessageHeader.RecordBatch, | ||
| recordBatchOffset, recordBatchBuilder.TotalLength, | ||
| cancellationToken).ConfigureAwait(false); | ||
|
|
||
| long bufferLength = await WriteBufferDataAsync(recordBatchBuilder.Buffers, cancellationToken).ConfigureAwait(false); | ||
| recordBatchBuilder.DisposeDeferredArrays(); | ||
|
|
There was a problem hiding this comment.
DisposeDeferredArrays() should run in a finally block here as well. As written, an exception from WriteMessageAsync/WriteBufferDataAsync will skip disposal of any deferred normalized arrays, causing a memory/resource leak.
|
|
||
| long metadataLength = WriteMessage(Flatbuf.MessageHeader.DictionaryBatch, | ||
| dictionaryBatchOffset, recordBatchBuilder.TotalLength); | ||
|
|
||
| long bufferLength = WriteBufferData(recordBatchBuilder.Buffers); | ||
| recordBatchBuilder.DisposeDeferredArrays(); | ||
|
|
There was a problem hiding this comment.
Same issue as record batches: DisposeDeferredArrays() is not in a finally. If WriteMessage/WriteBufferData throws while writing a dictionary batch, deferred arrays (e.g., normalized REE slices) won’t be disposed.
| long metadataLength = await WriteMessageAsync(Flatbuf.MessageHeader.DictionaryBatch, | ||
| dictionaryBatchOffset, recordBatchBuilder.TotalLength, cancellationToken).ConfigureAwait(false); | ||
|
|
||
| long bufferLength = await WriteBufferDataAsync(recordBatchBuilder.Buffers, cancellationToken).ConfigureAwait(false); | ||
| recordBatchBuilder.DisposeDeferredArrays(); | ||
|
|
There was a problem hiding this comment.
Async dictionary writing has the same leak hazard: DisposeDeferredArrays() should be guaranteed via try/finally so deferred normalized arrays are disposed even when WriteMessageAsync/WriteBufferDataAsync throws.
| if (data.NullCount != 0) | ||
| { | ||
| throw new ArgumentException( | ||
| $"Run-end encoded arrays have no top-level validity bitmap and must report null count 0, but got {data.NullCount}.", | ||
| nameof(data)); | ||
| } | ||
|
|
||
| ValidateRunEndsType(runEnds); | ||
|
|
||
| if (runEnds.Length != values.Length) | ||
| { | ||
| throw new ArgumentException( | ||
| $"Run ends array length ({runEnds.Length}) must equal values array length ({values.Length})."); | ||
| } |
There was a problem hiding this comment.
When constructing from ArrayData, the top-level Length/Offset is not validated against the run_ends values. This allows malformed ArrayData where Offset+Length exceeds the last run_end, which will break Normalize()/FindPhysicalIndex semantics. Add a validation that the slice range (data.Offset + data.Length) does not exceed the logical length implied by the last run_end value.
What's Changed
This PR adds basic support for Run-End Encoded arrays by following established codebase patterns.
Notably:
ArrowTypeIdadded.RunEndEncodedArrayadded.RunEndEncodedTypenested type.Co-authored-by: Jorge Candeias jorge.candeias@outcompute.com
Supercedes #260