Add torch.Tensor fast path for StridedMemoryView via AOTI tensor bridge#1894
Add torch.Tensor fast path for StridedMemoryView via AOTI tensor bridge#1894leofang wants to merge 34 commits intoNVIDIA:mainfrom
Conversation
Provide a fast path for constructing a StridedMemoryView from a
torch.Tensor by reading tensor metadata directly through PyTorch's
AOT Inductor (AOTI) stable C ABI, avoiding DLPack/CAI protocol
overhead (~10 ns per tensor via pointer arithmetic).
Key design:
- Vendored AOTI shim header (aoti_shim.h) with extern "C" wrapping
- _tensor_bridge.pyx loaded lazily (only when a torch.Tensor is first
passed) to avoid undefined AOTI symbols at import time
- RTLD_GLOBAL bootstrap via sys.modules["torch._C"] before loading
_tensor_bridge.so
- torch detection via type(obj).__module__.startswith("torch")
- PyTorch is NOT a build-time or run-time dependency of cuda.core
Closes NVIDIA#749
Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pty .pxd - Remove unused aoti_torch_get_numel and aoti_torch_get_storage_offset declarations from aoti_shim.h and _tensor_bridge.pyx - Fix license headers on new files to 2026 (not 2024-2026) - Delete empty _tensor_bridge.pxd (nothing cimports from it) - Defer numpy dtype resolution for torch tensors: store raw AOTI dtype code in metadata, compute itemsize from a cheap lookup table, and only resolve the full numpy dtype on first .dtype access via get_dtype() Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of short-circuiting in __init__ and from_any_interface, add the AOTI fast path check to from_dlpack, from_cuda_array_interface, and from_array_interface. This ensures torch tensors always take the fast path regardless of which constructor the user calls. Simplify from_any_interface and _StridedMemoryViewProxy to just delegate to the from_* methods (which now handle torch internally). Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When stream_ptr is not -1, establish stream ordering between PyTorch's current CUDA stream (the producer) and the consumer stream, using the same event record + stream wait pattern as the CAI path. Uses aoti_torch_get_current_cuda_stream to get the producer stream, matching what PyTorch's own __dlpack__ does internally. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Factor out stream ordering into a cpdef sync_torch_stream() helper in _tensor_bridge.pyx, callable from both C (view_as_torch_tensor) and Python (_memoryview.pyx). Apply the same stream ordering in view_as_cai for torch tensors: PyTorch's __cuda_array_interface__ reports version 2 and omits the "stream" field, so the standard CAI sync path is a no-op — leaving the consumer with no guarantee that the producer's work is visible. We now detect torch tensors in the CAI path and query PyTorch's current CUDA stream via AOTI to establish proper ordering. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add check_aoti() inline helper to replace repetitive err/raise patterns for AOTI calls (one-liner per call) - Change itemsize type from int to size_t - Add test_torch_tensor_bridge_sliced_2d test case Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert itemsize back to int (size_t was unnecessary for small values) - Memoize int(stream_ptr) to avoid redundant Python operator conversion Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Better Cython 3 performance: except?-1 avoids the overhead of except* which always checks for exceptions. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The AOTI stable C ABI functions we use (get_dim, get_dtype, get_device_type, get_device_index, get_current_cuda_stream, complex dtype constants) were all introduced in PyTorch 2.3.0. Earlier versions are missing some or all of them. _is_torch_tensor now returns False when torch < 2.3, causing a graceful fallback to the standard DLPack/CAI paths. The version check result is memoized in a module-level variable. Also move `import ctypes, sys` from _get_tensor_bridge to module level. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the AOTI-based fast path for torch.Tensor in StridedMemoryView with ~10-20x speedup and stream ordering support. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cdata field changed from MaybeOwned<at::Tensor> (2.3-2.9) to at::Tensor (2.10+). Both layouts are compatible with our offset trick. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache the result of the torch tensor type check (module + hasattr + version) keyed by type(obj). Subsequent calls for the same type are a single dict lookup (~76 ns) instead of the full check (~186 ns). Non-torch objects also benefit as the cache returns False immediately after the first miss. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pyobj_to_aten_handle trick and AtenTensorHandle == at::Tensor* identity are undocumented internals that could change. Cap at the latest tested version so unknown future versions fall back to the standard DLPack/CAI paths. Bump after verifying each new release. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
24aeb0f to
8c019b9
Compare
| cdef inline AtenTensorHandle pyobj_to_aten_handle(object obj): | ||
| """Extract AtenTensorHandle by offsetting past PyObject_HEAD. | ||
|
|
||
| In PyTorch 2.3–2.9 the first field after PyObject_HEAD is | ||
| ``c10::MaybeOwned<at::Tensor> cdata``; from 2.10 onward it is | ||
| ``at::Tensor cdata``. In both cases the address of ``cdata`` | ||
| is usable as an ``AtenTensorHandle`` (``at::Tensor*``) for the | ||
| AOTI stable C ABI functions. | ||
| """ | ||
| return <AtenTensorHandle>(<char*><PyObject*>obj + sizeof(PyObject)) |
There was a problem hiding this comment.
Note: I have filed a feature request to discuss if this API can be formalized in AOTI directly, so that we can relax upper bound safely and be forward compatible: pytorch/pytorch#180107.
In PyTorch 2.3–2.9, THPVariable::cdata is c10::MaybeOwned<at::Tensor>, whose first member is bool isBorrowed_ (padded to 8 bytes) before the at::Tensor union member. The previous code always offset by sizeof(PyObject) which pointed to the bool tag (0x0), causing a segfault when AOTI functions dereferenced it as at::Tensor*. Add _get_cdata_extra_offset() that checks the torch version at runtime and adds 8 bytes for torch < 2.10 (MaybeOwned era). The result is memoized after the first call. Tested across PyTorch 2.3.1, 2.4.1, 2.5.1, 2.6.0, 2.7.1, 2.8.0, 2.9.1, 2.10.0, and 2.11.0 with CPU tensors (9 dtypes, sliced tensors, 0d/1d/4d shapes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document why readonly=False is correct for torch tensors: PyTorch always reports tensors as writable via both DLPack (flags=0) and CAI (data=(ptr, False)), and the AOTI C ABI has no readonly query. - Change the vendored aoti_shim.h SPDX from Apache-2.0 to BSD-3-Clause to match PyTorch's actual license. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add bfloat16 as a pytest.param with a skipif mark for ml_dtypes, removing the separate test_torch_tensor_bridge_bfloat16 function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the NVIDIA SPDX header with PyTorch's original BSD-3-Clause copyright text (from PyTorch LICENSE lines 3-11), following the same pattern as the vendored dlpack.h. Add aoti_shim.h to .spdx-ignore to bypass the NVIDIA-specific copyright check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test c7331a9 |
|
On Windows, MSVC requires a .lib to resolve __declspec(dllimport) symbols at link time. The AOTI symbols live in torch_cpu.dll (loaded by `import torch` at runtime) but torch is not a build-time dependency. Add: - aoti_shim.def: symbol list for generating the stub import library - AOTI_SHIM_API macro in aoti_shim.h: expands to __declspec(dllimport) on Windows, empty on Linux/macOS - build_hooks.py: on Windows, run `lib /DEF:... /OUT:...` to generate the stub .lib and link _tensor_bridge against it The stub .lib (~1KB) contains no code — it tells the linker that the symbols will come from torch_cpu.dll. At runtime, `import torch` loads the DLL before our extension is imported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 37fce1a |
The _tensor_bridge extension links against torch_cpu.dll via a stub import library. delvewheel tries to bundle this DLL into the wheel and fails because torch is not installed in the build environment. Exclude torch_cpu.dll and torch_python.dll with --no-dll so delvewheel skips them — they are provided by the user's torch install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ython into tensor-bridge-749
|
/ok to test f6a3032 |
delvewheel uses --exclude (not --no-dll) and semicolons as path separators on Windows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 2748a52 |
|
/ok to test d543be1 |
|
pre-commit.ci autofix |
1 similar comment
|
pre-commit.ci autofix |
| raise RuntimeError( | ||
| "torch._C is not loaded; cannot initialise the tensor bridge. " | ||
| "Make sure PyTorch is imported before passing a torch.Tensor.") | ||
| ctypes.CDLL(torch_C.__file__, mode=ctypes.RTLD_GLOBAL) |
There was a problem hiding this comment.
question: if Windows is supported here, should this path handle Windows explicitly instead of always using mode=ctypes.RTLD_GLOBAL? There isn’t a direct Windows equivalent for RTLD_GLOBAL
There was a problem hiding this comment.
As per the ctypes docs, on Windows mode is ignored.
|
/ok to test 615f984 |
rwgk
left a comment
There was a problem hiding this comment.
Generated with Cursor GPT-5.4 Extra High Fast
I manually checked the _tensor_bridge.pyx code referenced below: the findings seem correct.
-
High
cuda_core/cuda/core/_memoryview.pyx:221routes every recognized torch tensor intoview_as_torch_tensor(), butcuda_core/cuda/core/_tensor_bridge.pyx:318only reads raw metadata and never reenacts PyTorch's export guardrails. That means tensors PyTorch intentionally rejects at the protocol boundary, notablyrequires_grad, conjugated, non-strided/sparse, and wrong-current-device CUDA tensors, no longer fail the same way here.cuda_core/cuda/core/_tensor_bridge.pyx:343even says those cases are rejected, but there is no code that does it. -
Medium
cuda_core/cuda/core/_memoryview.pyx:146,cuda_core/cuda/core/_memoryview.pyx:1028, andcuda_core/cuda/core/_memoryview.pyx:1036make-1the explicit opt-out for synchronization and treatstream_ptr=Noneas ambiguous. The new torch path changes that:cuda_core/cuda/core/_tensor_bridge.pyx:359treatsNoneas "do nothing", soStridedMemoryView.from_dlpack(torch_cuda_tensor)now returns an unsynchronized view instead of forcing the caller to choose a stream policy. -
Medium The AOTI dtype tables at
cuda_core/cuda/core/_tensor_bridge.pyx:194andcuda_core/cuda/core/_tensor_bridge.pyx:227only coveruint8on the unsigned side, while the existing DLPack conversion starting atcuda_core/cuda/core/_memoryview.pyx:1108supportsuint16,uint32, anduint64. Because all torch tensors now short-circuit before the DLPack path atcuda_core/cuda/core/_memoryview.pyx:221, those torch dtypes regress from "works via DLPack" toTypeError("Unsupported AOTI dtype code"). The new dtype tests atcuda_core/tests/test_utils.py:1000only cover the narrower happy-path set.
| /* Opaque tensor handle -- corresponds to at::Tensor on the C++ side. */ | ||
| struct AtenTensorOpaque; | ||
| typedef struct AtenTensorOpaque* AtenTensorHandle; | ||
|
|
There was a problem hiding this comment.
To help future maintainers and agents:
/*
* IMPORTANT: Keep the AOTI_SHIM_API declaration list below in sync with
* aoti_shim.def. On Windows, build_hooks.py turns that .def file into the
* stub import library that MSVC needs to link _tensor_bridge without making
* PyTorch a build-time dependency. If you add, remove, or rename an imported
* AOTI symbol here, update aoti_shim.def in the same change.
*/
| ; Stub import library definition for PyTorch's AOTI stable C ABI symbols. | ||
| ; Used on Windows only: 'lib /DEF:aoti_shim.def /OUT:aoti_shim.lib /MACHINE:X64' | ||
| ; generates a minimal import library that satisfies the MSVC linker. | ||
| ; At runtime the symbols resolve from torch_cpu.dll (loaded by 'import torch'). |
There was a problem hiding this comment.
Similar to the suggested comment in aoti_shim.h:
; IMPORTANT: Keep this export list in sync with the AOTI_SHIM_API declarations
; in aoti_shim.h. build_hooks.py turns this file into the stub import library
; that MSVC uses to link _tensor_bridge, so any added/removed/renamed AOTI
; symbol must be updated in both files.
|
I forgot to add: I spent a few minutes to answer: Why did we not need a |
Summary
Add a fast path for constructing
StridedMemoryViewfromtorch.Tensorobjects using PyTorch's AOT Inductor (AOTI) stable C ABI, bypassing the DLPack/CAI protocol overhead.How it works
When a
torch.Tensoris passed to anyfrom_*classmethod (from_dlpack,from_cuda_array_interface,from_array_interface, orfrom_any_interface), the tensor metadata (data pointer, shape, strides, dtype, device) is read directly from the underlying C struct via AOTI function pointers, instead of going through the Python-level__dlpack__()or__cuda_array_interface__protocols.The key technique (
pyobj_to_aten_handle) extracts theAtenTensorHandleby offsetting pastPyObject_HEADin theTHPVariablestruct — pure C pointer arithmetic with zero Python API calls. The AOTI functions (aoti_torch_get_data_ptr,aoti_torch_get_sizes, etc.) then read tensor metadata through PyTorch's stable C ABI.PyTorch is NOT a build-time or runtime dependency. The AOTI symbols are resolved lazily at runtime from
torch._C(loaded withRTLD_GLOBAL) only when the user actually passes atorch.Tensor. The_tensor_bridgemodule is never imported atcuda.coreload time.Performance
Benchmarked with
%timeit(Python 3.12, PyTorch 2.11, NVIDIA RTX 6000 Ada):At the C level (no Python overhead), AOTI extracts all 7 metadata fields in ~14 ns — ~4x faster than the DLPack C exchange API (~60 ns) for the same metadata.
Stream ordering
stream_ptr != -1, establishes stream ordering between PyTorch's current CUDA stream (queried viaaoti_torch_get_current_cuda_stream) and the consumer stream, using the same event-based pattern as the existing CAI path.__cuda_array_interface__reports version 2 with nostreamfield, making the standard CAI sync path a no-op. We detect torch tensors in the CAI path and apply AOTI-based stream ordering to fix this safety gap.Version compatibility
THPVariablestruct layout andAtenTensorHandle == at::Tensor*identity are undocumented internals)Files changed
cuda/core/_tensor_bridge.pyx(new): AOTI tensor bridge —pyobj_to_aten_handle,view_as_torch_tensor,sync_torch_stream, dtype/itemsize mappingcuda/core/_include/aoti_shim.h(new): Vendored subset of PyTorch's AOTI stable C ABI declarationscuda/core/_memoryview.pyx: Torch detection (_is_torch_tensorwith type cache + version bounds), lazy bridge loading, fast path in allfrom_*classmethods, CAI stream safety fix, lazy dtype resolutiontests/test_utils.py: 12 new test cases (dtypes, shapes, slicing, devices, decorator)docs/source/release/1.0.0-notes.rst: Release notes entryCloses #749
Co-Authored-By: Emilio Castillo ecastillo@nvidia.com
Test plan