Metal backend: Materialize non-packed tensor views in reinterpret_tensor#19033
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19033
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled Job, 2 Unrelated FailuresAs of commit cf71b8e with merge base 1d37abd ( CANCELLED JOB - The following job was cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
6a23ab6 to
0f6aae2
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the Metal AOTI runtime shim to handle reinterpret_tensor views that have non-packed (holey) strides by materializing them into a newly allocated contiguous Metal buffer, aligning with ExecuTorch tensor construction requirements.
Changes:
- Added
is_packed_strides()to detect when a strided view has “holes” (storage extent >numel). - Added
materialize_packed()to allocate a new Metal buffer and copy elements from the strided source view into a packed layout. - Updated
aoti_torch__reinterpret_tensorto materialize non-packed views, compute contiguous strides for the new buffer, and update ownership/refcount bookkeeping accordingly.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AOTI generates reinterpret_tensor views with non-packed strides (e.g. chunk/split for RoPE rotation) that have holes in memory. ExecuTorch's make_tensor_ptr requires densely packed layouts. When aoti_torch__reinterpret_tensor encounters non-packed strides, allocate a new contiguous Metal buffer and copy elements using strided access from the source. Authored with Claude.
0f6aae2 to
cf71b8e
Compare
metascroy
left a comment
There was a problem hiding this comment.
Stamping, but maybe consider adopting SlimTensor to address this
AOTI generates reinterpret_tensor views with non-packed strides (e.g. chunk/split for RoPE rotation) that have holes in memory. ExecuTorch's make_tensor_ptr requires densely packed layouts.
When aoti_torch__reinterpret_tensor encounters non-packed strides, allocate a new contiguous Metal buffer and copy elements using strided access from the source.
Authored with Claude.