feature: add selection, filtering, primitive comparison, formatting and limited casting#257
feature: add selection, filtering, primitive comparison, formatting and limited casting#257mobiusklein wants to merge 8 commits intoapache:mainfrom
Conversation
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Thanks so much for this work! I've left initial comments on the first two files and will try to return to it within the next day.
| </PropertyGroup> | ||
|
|
||
| <PropertyGroup> | ||
| <TargetFrameworks>net8.0</TargetFrameworks> |
There was a problem hiding this comment.
Can we target netstandard2.0 also or do you expect it to be hard to support?
There was a problem hiding this comment.
Okay, we obviously wouldn't be able to use the INumber-based methods and would have to replace them with individual implementations by type. That could perhaps be done another time.
There was a problem hiding this comment.
DataFrame from the Microsoft.Data.Analysis project could serve as an example. They still have a T4 macro for netstandard2.0, but switched to INumber-based methods for other target frameworks.
| <PropertyGroup> | ||
| <TargetFrameworks>net8.0</TargetFrameworks> | ||
| <ImplicitUsings>enable</ImplicitUsings> | ||
| <Nullable>enable</Nullable> |
There was a problem hiding this comment.
The other projects don't enable Nullable yet but there's no good reason not to enable it here. They also don't enable ImplicitUsings but I don't have the experience to know why we would want or not want it.
| { | ||
| var builder = new BooleanArray.Builder(); | ||
| builder.Reserve(mask.Length); | ||
| foreach (var val in mask) |
There was a problem hiding this comment.
This can be done much more efficiently by getting the underlying ArrowBuffers from the source array and operating on them directly. If the validity buffer is set, it can be copied as-is (we don't have a way to share buffers, unfortunately). For the value buffer, we could allocate it with ArrowBuffer.Builder and then get its Span property. For both the ReadOnlySpan on the source buffer and the Span on the target buffer, we could then cast them to (ReadOnly)Span<ulong> and invert 64 bits at a time.
To some degree this depends on how interested you are in going down the optimization rabbit hole. You could also use Benchmark.NET and try it both ways to see what kind of difference it makes.
The extreme version of this kind of thing involves learning about some of the SIMD intrinsics that were added to .NET, like Avx2. I started down this path in 2023 but never ended up having the time for it ... .
There was a problem hiding this comment.
The same approach could also be used for the binary operations, except that the validity buffers would need to be ANDed.
In general, I think it's best to implement these primitive operations in terms of ArrowBuffer and then have the Array-based methods call into those.
There was a problem hiding this comment.
I think I probably mean Vector256 and not Avx2, at least initially.
There was a problem hiding this comment.
If you were to go that route you should consider adding a dependency on System.Numerics.Tensors and for this method specifically TensorPrimivitives.OnesComplement:
byte[] dest = new byte[mask.Values.Length];
System.Numerics.Tensors.TensorPrimitives.OnesComplement<byte>(mask.Values, dest);
var valueBuffer = new ArrowBuffer(dest);
var validityBuffer = mask.Data.Buffers[0];
var data = new ArrayData(
BooleanType.Default, mask.Length, mask.NullCount,
mask.Offset, new[] { validityBuffer, valueBuffer });
return new BooleanArray(data);The Microsoft docs state it's .NET 10+, but the code above compiles targeting .NET 8.0. I'm not sure, but it seems like a bug in the docs site.
There was a problem hiding this comment.
Thank you. There's a lot of possible optimizations, some of which are probably beyond the scope of this first writing. I'll look into the bit inversion methods, although I am not familiar enough with the consequences of using the SIMD types.
it can be copied as-is (we don't have a way to share buffers, unfortunately)
@CurtHagenlocher could you please expand on what you mean by this? Is it that an array assumes it is the unique owner of its value and validity buffer and might mutate them directly?
There was a problem hiding this comment.
Is it that an array assumes it is the unique owner of its value and validity buffer and might mutate them directly?
It is because an array assumes that it's the unique owner of its buffers, yes. So if you share the buffer and the first array is disposed, the second one will be holding an invalid buffer. To fix this, buffers would need something like a reference count. This would also be very helpful when exporting via the C API, but I think I've convinced myself that the functionality can't be added in a backwards-compatible way.
The library doesn't support mutable arrays either, which is also a gap if you wanted to optimize computation over arrays. A processing engine might determine, for instance that the expression a + b could just reuse a's storage as the target of the computation and avoid having to allocate a new array -- if a is never going to be used again. I'm skeptical that C# has the right semantics to implement this in a purely "safe" way but haven't put much effort into working anything out. (Of course, types like StringArray are extremely hard to mutate in place no matter what the capabilities of the implementation language.)
There was a problem hiding this comment.
Okay, I rewrote this using the generic SIMD features for bitflag operations. I decided to do it by hand rather than add a new dependency since I don't understand the compatibility of Numerics.Tensor. Doing this for general-purpose numeric operations would be harder, at which point that newer library is probably the right choice. That kind of fast-path optimization would assume there are no nulls though.
| /// <param name="rhs"></param> | ||
| /// <param name="allocator"></param> | ||
| /// <returns></returns> | ||
| public static BooleanArray Equal(StringArray lhs, string rhs, MemoryAllocator? allocator = null) |
There was a problem hiding this comment.
Should rhs be nullable to allow a comparison to null?
It's worth calling out in the documentation that this implements C# semantics, so null == null. These are different from SQL semantics where the value of null == X is null. I think that's the right thing to do, but it might be surprising to someone expecting more SQL-like behavior.
There was a problem hiding this comment.
https://arrow.apache.org/docs/cpp/compute.html#comparisons
That is indeed surprising
There was a problem hiding this comment.
Good point. Is it preferable to use SQL semantics? Alternatively, it could be configured with a flag argument, in which case the default behavior would still need to be documented.
|
Thank you for the initial feedback. I'll try to get back to this by next weekend. I'll see if I can unwrap the bitwise operations you mentioned, but I'm inclined to maintain the AVX/SIMD stuff as a strictly later problem. |
| /// <param name="allocator">The memory allocator to build the new array from</param> | ||
| /// <returns></returns> | ||
| /// <exception cref="InvalidOperationException">If the mask and the array are not of equal size</exception> | ||
| public static Array Filter(Array array, BooleanArray mask, MemoryAllocator? allocator = null) |
There was a problem hiding this comment.
I haven't had a close look at this PR, but just wanted to mention that I've implemented the same operation at https://github.com/G-Research/ParquetSharp.Dataset/blob/main/ParquetSharp.Dataset/Filter/ArrayMaskApplier.cs.
It's implemented as an IArrowArrayVisitor with specialized implementations for different array types. It might be worth benchmarking the two approaches? Your approach has the advantage of working for any array type that ArrowArrayConcatenator supports, but I suspect there might be a lot of overhead with that when you have a large number of spans.
There was a problem hiding this comment.
Yes, there would be. I had a hard time reading this implementation though, as it reaches into the internals of ArrayData, which don't think I have a good grasp of yet. I wrote the span slicing approach I did because it let me reliably delegate all that to the library.
It's not obvious, either, from reading your visitor what types it doesn't support?
There was a problem hiding this comment.
I think I aimed to support all types at the time I wrote it, but it may be out of date now. If something like this was integrated into arrow-dotnet it might make sense to have a unit test to check that it works for all available types.
|
Beyond the comparison of the |
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Another partial review; sorry for the delay. This time I looked at everything but am not entirely done commenting. I do think we need a more efficient implementation of Filter/Select
| using Apache.Arrow.Memory; | ||
| using Apache.Arrow.Types; | ||
|
|
||
|
|
There was a problem hiding this comment.
nit: please remove duplicate blank lines
|
|
||
| namespace Apache.Arrow.Operations; | ||
|
|
||
| public static class BitVectorOps |
There was a problem hiding this comment.
nit: Please move this to its own file
|
|
||
| while ((size - offset) >= 8) | ||
| { | ||
| if ((size - offset) >= 64) |
There was a problem hiding this comment.
Would it make more sense to promote this to the top level? That is,
while ((size - offset) >= 64) { ... }
if ((size - offset) >= 32) { ... }
if ((size - offset) >= 16) { ... }
if ((size - offset) >= 8) { ... }
if ((size - offset) > 0) { ... }
That should remove one conditional jump and shrink the size of the while loop.
There was a problem hiding this comment.
The same applies to the other operations in this class.
| return builder.Build(); | ||
| } | ||
|
|
||
| public static ArrowBuffer And(ArrowBuffer buffer, ArrowBuffer buffer2) |
There was a problem hiding this comment.
Consider naming the two inputs something a little more distinctive like left and right or x and y.
There was a problem hiding this comment.
Or lhs and rhs like in the functions below.
|
|
||
| public static ArrowBuffer And(ArrowBuffer buffer, ArrowBuffer buffer2) | ||
| { | ||
| var builder = new ArrowBuffer.BitmapBuilder(buffer.Length * 8); |
There was a problem hiding this comment.
I think this should probably validate that both buffers have the same length.
There was a problem hiding this comment.
The same applies to the other two-argument functions in this class.
There was a problem hiding this comment.
Okay, I see that the length comparisons are being done in the calling functions. Perhaps BitVectorOps should be internal instead of public then?
There was a problem hiding this comment.
There's the added complication when combining validity buffers that one or both may be null. I think for And and Or there should probably be a shortcut at the beginning which says "if lhs is empty then return rhs". For Xor, if both buffers are empty then it should return empty. If just one buffer is empty then it should probably be an error.
| { | ||
| if (lhs.Length != rhs.Length) throw new ArgumentException("Arrays must have the same length"); | ||
| var combined = BitVectorOps.And(lhs.ValueBuffer, rhs.ValueBuffer); | ||
| var combinedMask = BitVectorOps.And(lhs.NullBitmapBuffer, rhs.NullBitmapBuffer); |
There was a problem hiding this comment.
The "nulls" buffer (aka the validity buffer) can be null, meaning "no nulls" (or "all valid"). I'll add a comment to the bitwise operations to call this out.
| } | ||
|
|
||
| /// <summary> | ||
| /// Performa a pairwise boolean OR operation. |
There was a problem hiding this comment.
| /// Performa a pairwise boolean OR operation. | |
| /// Perform a pairwise boolean OR operation. |
| /// <param name="allocator"></param> | ||
| /// <param name="nullHandling"></param> | ||
| /// <returns></returns> | ||
| public static BooleanArray Equal<T>(PrimitiveArray<T> lhs, T? rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T> |
There was a problem hiding this comment.
Should this also take a ComparisonNullHandling parameter?
| /// <param name="mask"></param> | ||
| /// <param name="allocator"></param> | ||
| /// <returns></returns> | ||
| public static BooleanArray Invert(BooleanArray mask, MemoryAllocator? allocator = null) |
There was a problem hiding this comment.
The allocator argument isn't used. Applies to many of these methods.
| /// <param name="allocator">The memory allocator to build the new array from</param> | ||
| /// <returns></returns> | ||
| /// <exception cref="InvalidOperationException"></exception> | ||
| public static Array Take(Array array, IList<(int, int)> spans, MemoryAllocator? allocator = null) |
There was a problem hiding this comment.
Take is usually used for the slicing operation (as in "skip/take"). Is there any reason this shouldn't just use the same name as Filter? And this verges on bikeshedding, but maybe Select is a good name for this operation?
This overload that takes an IList<(int, int)>, should this be private or is it more than an implementation detail of the other methods? I'm having trouble envisioning a scenario where this would be used.
| if ((size - offset) >= 64) | ||
| { | ||
| var part = buffer.Span.Slice(offset, 64); | ||
| Vector512<byte> vector = Vector512.Create(part); |
There was a problem hiding this comment.
I think these need to check Vector512.IsHardwareAccelerated before using it, and should have a fallback if it's false.
What's Changed
I've added an
Apache.Arrow.Operationsmodule as recommended in #254 to hold some utilities for generically working with primitive arrays.Included:
Closes #254
I would like some early feedback to see if I am going in the right direction. As .NET isn't a language I work in routinely, I am likely missing common names or idioms for these operations. I also do not know the degree to which the JIT is doing loop unrolling to make these homogenous operations more performant, though if it needs to be hand rolled that is a strictly later problem.