feature: add selection, filtering, primitive comparison, formatting and limited casting by mobiusklein · Pull Request #257 · apache/arrow-dotnet

mobiusklein · 2026-02-08T19:17:05Z

What's Changed

I've added an Apache.Arrow.Operations module as recommended in #254 to hold some utilities for generically working with primitive arrays.

Included:

Array subset selection by index or boolean mask
Primitive array and string array equality and comparison
Limited numerical aggregations
A recursive, not quite complete pretty-printing routine I wrote for debugging.
A small suite of tests to exercise the fiddly bits of indices

Closes #254

I would like some early feedback to see if I am going in the right direction. As .NET isn't a language I work in routinely, I am likely missing common names or idioms for these operations. I also do not know the degree to which the JIT is doing loop unrolling to make these homogenous operations more performant, though if it needs to be hand rolled that is a strictly later problem.

…nd limited casting

CurtHagenlocher

Thanks so much for this work! I've left initial comments on the first two files and will try to return to it within the next day.

CurtHagenlocher · 2026-02-09T03:59:26Z

src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj

+    </PropertyGroup>
+
+    <PropertyGroup>
+        <TargetFrameworks>net8.0</TargetFrameworks>


Can we target netstandard2.0 also or do you expect it to be hard to support?

Okay, we obviously wouldn't be able to use the INumber-based methods and would have to replace them with individual implementations by type. That could perhaps be done another time.

DataFrame from the Microsoft.Data.Analysis project could serve as an example. They still have a T4 macro for netstandard2.0, but switched to INumber-based methods for other target frameworks.

https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.Data.Analysis/Computations/Arithmetic.netstandard.tt

CurtHagenlocher · 2026-02-09T04:00:50Z

src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj

+    <PropertyGroup>
+        <TargetFrameworks>net8.0</TargetFrameworks>
+        <ImplicitUsings>enable</ImplicitUsings>
+        <Nullable>enable</Nullable>


The other projects don't enable Nullable yet but there's no good reason not to enable it here. They also don't enable ImplicitUsings but I don't have the experience to know why we would want or not want it.

CurtHagenlocher · 2026-02-09T04:10:27Z

src/Apache.Arrow.Operations/Comparison.cs

+    {
+        var builder = new BooleanArray.Builder();
+        builder.Reserve(mask.Length);
+        foreach (var val in mask)


This can be done much more efficiently by getting the underlying ArrowBuffers from the source array and operating on them directly. If the validity buffer is set, it can be copied as-is (we don't have a way to share buffers, unfortunately). For the value buffer, we could allocate it with ArrowBuffer.Builder and then get its Span property. For both the ReadOnlySpan on the source buffer and the Span on the target buffer, we could then cast them to (ReadOnly)Span<ulong> and invert 64 bits at a time.

To some degree this depends on how interested you are in going down the optimization rabbit hole. You could also use Benchmark.NET and try it both ways to see what kind of difference it makes.

The extreme version of this kind of thing involves learning about some of the SIMD intrinsics that were added to .NET, like Avx2. I started down this path in 2023 but never ended up having the time for it ... .

The same approach could also be used for the binary operations, except that the validity buffers would need to be ANDed.

In general, I think it's best to implement these primitive operations in terms of ArrowBuffer and then have the Array-based methods call into those.

I think I probably mean Vector256 and not Avx2, at least initially.

If you were to go that route you should consider adding a dependency on System.Numerics.Tensors and for this method specifically TensorPrimivitives.OnesComplement:

byte[] dest = new byte[mask.Values.Length]; System.Numerics.Tensors.TensorPrimitives.OnesComplement<byte>(mask.Values, dest); var valueBuffer = new ArrowBuffer(dest); var validityBuffer = mask.Data.Buffers[0]; var data = new ArrayData( BooleanType.Default, mask.Length, mask.NullCount, mask.Offset, new[] { validityBuffer, valueBuffer }); return new BooleanArray(data);

The Microsoft docs state it's .NET 10+, but the code above compiles targeting .NET 8.0. I'm not sure, but it seems like a bug in the docs site.

Thank you. There's a lot of possible optimizations, some of which are probably beyond the scope of this first writing. I'll look into the bit inversion methods, although I am not familiar enough with the consequences of using the SIMD types.

it can be copied as-is (we don't have a way to share buffers, unfortunately)

@CurtHagenlocher could you please expand on what you mean by this? Is it that an array assumes it is the unique owner of its value and validity buffer and might mutate them directly?

Is it that an array assumes it is the unique owner of its value and validity buffer and might mutate them directly?

It is because an array assumes that it's the unique owner of its buffers, yes. So if you share the buffer and the first array is disposed, the second one will be holding an invalid buffer. To fix this, buffers would need something like a reference count. This would also be very helpful when exporting via the C API, but I think I've convinced myself that the functionality can't be added in a backwards-compatible way.

The library doesn't support mutable arrays either, which is also a gap if you wanted to optimize computation over arrays. A processing engine might determine, for instance that the expression a + b could just reuse a's storage as the target of the computation and avoid having to allocate a new array -- if a is never going to be used again. I'm skeptical that C# has the right semantics to implement this in a purely "safe" way but haven't put much effort into working anything out. (Of course, types like StringArray are extremely hard to mutate in place no matter what the capabilities of the implementation language.)

Okay, I rewrote this using the generic SIMD features for bitflag operations. I decided to do it by hand rather than add a new dependency since I don't understand the compatibility of Numerics.Tensor. Doing this for general-purpose numeric operations would be harder, at which point that newer library is probably the right choice. That kind of fast-path optimization would assume there are no nulls though.

src/Apache.Arrow.Operations/Comparison.cs

CurtHagenlocher · 2026-02-09T04:19:56Z

src/Apache.Arrow.Operations/Comparison.cs

+    /// <param name="rhs"></param>
+    /// <param name="allocator"></param>
+    /// <returns></returns>
+    public static BooleanArray Equal(StringArray lhs, string rhs, MemoryAllocator? allocator = null)


Should rhs be nullable to allow a comparison to null?

It's worth calling out in the documentation that this implements C# semantics, so null == null. These are different from SQL semantics where the value of null == X is null. I think that's the right thing to do, but it might be surprising to someone expecting more SQL-like behavior.

https://arrow.apache.org/docs/cpp/compute.html#comparisons
That is indeed surprising

Good point. Is it preferable to use SQL semantics? Alternatively, it could be configured with a flag argument, in which case the default behavior would still need to be documented.

src/Apache.Arrow.Operations/Comparison.cs

mobiusklein · 2026-02-10T23:32:13Z

Thank you for the initial feedback. I'll try to get back to this by next weekend. I'll see if I can unwrap the bitwise operations you mentioned, but I'm inclined to maintain the AVX/SIMD stuff as a strictly later problem.

adamreeve · 2026-02-23T03:33:42Z

src/Apache.Arrow.Operations/Select.cs

+    /// <param name="allocator">The memory allocator to build the new array from</param>
+    /// <returns></returns>
+    /// <exception cref="InvalidOperationException">If the mask and the array are not of equal size</exception>
+    public static Array Filter(Array array, BooleanArray mask, MemoryAllocator? allocator = null)


I haven't had a close look at this PR, but just wanted to mention that I've implemented the same operation at https://github.com/G-Research/ParquetSharp.Dataset/blob/main/ParquetSharp.Dataset/Filter/ArrayMaskApplier.cs.

It's implemented as an IArrowArrayVisitor with specialized implementations for different array types. It might be worth benchmarking the two approaches? Your approach has the advantage of working for any array type that ArrowArrayConcatenator supports, but I suspect there might be a lot of overhead with that when you have a large number of spans.

Yes, there would be. I had a hard time reading this implementation though, as it reaches into the internals of ArrayData, which don't think I have a good grasp of yet. I wrote the span slicing approach I did because it let me reliably delegate all that to the library.

It's not obvious, either, from reading your visitor what types it doesn't support?

I think I aimed to support all types at the time I wrote it, but it may be out of date now. If something like this was integrated into arrow-dotnet it might make sense to have a unit test to check that it works for all available types.

mobiusklein · 2026-03-21T20:32:09Z

Beyond the comparison of the Filter function, are there other pending changes requested?

CurtHagenlocher

Another partial review; sorry for the delay. This time I looked at everything but am not entirely done commenting. I do think we need a more efficient implementation of Filter/Select

CurtHagenlocher · 2026-03-29T02:26:06Z

src/Apache.Arrow.Operations/Comparison.cs

+using Apache.Arrow.Memory;
+using Apache.Arrow.Types;
+
+


nit: please remove duplicate blank lines

CurtHagenlocher · 2026-03-29T02:26:28Z

src/Apache.Arrow.Operations/Comparison.cs

+
+namespace Apache.Arrow.Operations;
+
+public static class BitVectorOps


nit: Please move this to its own file

CurtHagenlocher · 2026-03-29T02:29:30Z

src/Apache.Arrow.Operations/Comparison.cs

+
+        while ((size - offset) >= 8)
+        {
+            if ((size - offset) >= 64)


Would it make more sense to promote this to the top level? That is,

while ((size - offset) >= 64) { ... } if ((size - offset) >= 32) { ... } if ((size - offset) >= 16) { ... } if ((size - offset) >= 8) { ... } if ((size - offset) > 0) { ... }

That should remove one conditional jump and shrink the size of the while loop.

The same applies to the other operations in this class.

CurtHagenlocher · 2026-03-29T02:32:32Z

src/Apache.Arrow.Operations/Comparison.cs

+        return builder.Build();
+    }
+
+    public static ArrowBuffer And(ArrowBuffer buffer, ArrowBuffer buffer2)


Consider naming the two inputs something a little more distinctive like left and right or x and y.

Or lhs and rhs like in the functions below.

CurtHagenlocher · 2026-03-29T02:32:48Z

src/Apache.Arrow.Operations/Comparison.cs

+
+    public static ArrowBuffer And(ArrowBuffer buffer, ArrowBuffer buffer2)
+    {
+        var builder = new ArrowBuffer.BitmapBuilder(buffer.Length * 8);


I think this should probably validate that both buffers have the same length.

The same applies to the other two-argument functions in this class.

Okay, I see that the length comparisons are being done in the calling functions. Perhaps BitVectorOps should be internal instead of public then?

There's the added complication when combining validity buffers that one or both may be null. I think for And and Or there should probably be a shortcut at the beginning which says "if lhs is empty then return rhs". For Xor, if both buffers are empty then it should return empty. If just one buffer is empty then it should probably be an error.

CurtHagenlocher · 2026-03-29T03:14:05Z

src/Apache.Arrow.Operations/Comparison.cs

+    {
+        if (lhs.Length != rhs.Length) throw new ArgumentException("Arrays must have the same length");
+        var combined = BitVectorOps.And(lhs.ValueBuffer, rhs.ValueBuffer);
+        var combinedMask = BitVectorOps.And(lhs.NullBitmapBuffer, rhs.NullBitmapBuffer);


The "nulls" buffer (aka the validity buffer) can be null, meaning "no nulls" (or "all valid"). I'll add a comment to the bitwise operations to call this out.

CurtHagenlocher · 2026-03-29T03:15:00Z

src/Apache.Arrow.Operations/Comparison.cs

+    }
+
+    /// <summary>
+    /// Performa a pairwise boolean OR operation.


Suggested change

/// Performa a pairwise boolean OR operation.

/// Perform a pairwise boolean OR operation.

CurtHagenlocher · 2026-03-29T03:17:19Z

src/Apache.Arrow.Operations/Comparison.cs

+    /// <param name="allocator"></param>
+    /// <param name="nullHandling"></param>
+    /// <returns></returns>
+    public static BooleanArray Equal<T>(PrimitiveArray<T> lhs, T? rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T>


Should this also take a ComparisonNullHandling parameter?

CurtHagenlocher · 2026-03-29T03:18:59Z

src/Apache.Arrow.Operations/Comparison.cs

+    /// <param name="mask"></param>
+    /// <param name="allocator"></param>
+    /// <returns></returns>
+    public static BooleanArray Invert(BooleanArray mask, MemoryAllocator? allocator = null)


The allocator argument isn't used. Applies to many of these methods.

CurtHagenlocher · 2026-03-29T03:38:57Z

src/Apache.Arrow.Operations/Select.cs

+    /// <param name="allocator">The memory allocator to build the new array from</param>
+    /// <returns></returns>
+    /// <exception cref="InvalidOperationException"></exception>
+    public static Array Take(Array array, IList<(int, int)> spans, MemoryAllocator? allocator = null)


Take is usually used for the slicing operation (as in "skip/take"). Is there any reason this shouldn't just use the same name as Filter? And this verges on bikeshedding, but maybe Select is a good name for this operation?

This overload that takes an IList<(int, int)>, should this be private or is it more than an implementation detail of the other methods? I'm having trouble envisioning a scenario where this would be used.

CurtHagenlocher · 2026-03-29T04:04:06Z

src/Apache.Arrow.Operations/Comparison.cs

+            if ((size - offset) >= 64)
+            {
+                var part = buffer.Span.Slice(offset, 64);
+                Vector512<byte> vector = Vector512.Create(part);


I think these need to check Vector512.IsHardwareAccelerated before using it, and should have a fallback if it's false.

mobiusklein added 2 commits February 8, 2026 14:10

feature: add selection, filtering, primitive comparison, formatting a…

0ab5908

…nd limited casting

chore: add file header license

ea299d7

CurtHagenlocher reviewed Feb 9, 2026

View reviewed changes

adamreeve reviewed Feb 23, 2026

View reviewed changes

mobiusklein added 6 commits February 28, 2026 21:50

feature: add builders with capacity, IEnumerable impls

a66ad19

feature: refactor to use vector bits

9ab12a2

feature: rework more numerical casts

8f14079

chore: use ArgumentException

33e651a

feature: add sql-like null handling option in comparisons

3b1ed32

chore: reorder import

ebd1afb

CurtHagenlocher requested changes Mar 29, 2026

View reviewed changes

CurtHagenlocher reviewed Mar 29, 2026

View reviewed changes


		namespace Apache.Arrow.Operations;

		public static class BitVectorOps

	/// Performa a pairwise boolean OR operation.
	/// Perform a pairwise boolean OR operation.

Conversation

mobiusklein commented Feb 8, 2026

What's Changed

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wldevries Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mobiusklein commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mobiusklein commented Mar 21, 2026

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wldevries Feb 22, 2026 •

edited

Loading