torch.split / torch.chunk cause "too complex strides" for Inductor + CUDA graph
See original GitHub issueWhen a model uses output of torch.split
/ torch.chunk
ops as input, or has those ops internally, torch._debug_has_internal_overlap
can complain about the size + strides of those tensors being too hard to determine whether memory overlap exists, preventing enablement of CUDA graph:
Example 1: using output of torch.split
as input to graph
all_embs = torch.randn(8, 101168)
emb_split = [98400, 340, 40, 328, 380, 1320, 360]
split_emb = torch.split(all_embs, emb_split, dim=1)
split_emb
now has tensors of size and stride:
torch.Size([8, 98400])
(101168, 1)
torch.Size([8, 340])
(101168, 1)
torch.Size([8, 40])
(101168, 1)
torch.Size([8, 328])
(101168, 1)
torch.Size([8, 380])
(101168, 1)
torch.Size([8, 1320])
(101168, 1)
torch.Size([8, 360])
(101168, 1)
Passing split_emb
as input to “Inductor + CUDA graph” backend will cause CUDA graph to be disabled.
Example 2: using torch.chunk
within a graph
_2384 = torch.randn(8, 4096)
chunk_list = torch.chunk(_2384, 16, dim=1)
chunk_list
now has 16 tensors of same size and stride:
torch.Size([8, 256])
(4096, 1)
If we do this torch.chunk
op within the graph and apply “Inductor + CUDA graph” backend, CUDA graph will be disabled.
cc. @ngimel
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:25 (18 by maintainers)
Top Results From Across the Web
torch.split — PyTorch 1.13 documentation
Each chunk is a view of the original tensor. If split_size_or_sections is an integer type, then tensor will be split into equally sized...
Read more >Untitled
20 cm grup tare, V268 part 2, Garrett hnatiuk, Sancet resultado exames, Aleksandra kosecka usta, Peter travers rolling stone james bond, Type book...
Read more >PyTorch internals - ezyang's blog
Strides are actually one of the distinctive features of PyTorch, so it's worth ... At the very most abstract level, when you call...
Read more >Untitled
Yugioh world championship 2011 codes, Tengo hambre frases, Torch light procession eastbourne, Hcis meditech, Second cousin twice removed chart, ...
Read more >arXiv:2103.15358v2 [cs.CV] 27 May 2021
The result indicates that the “local attention + global memory” structure in Vision Longformer is a desirable inductive bias for vision ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, looks good, you can assume there are no negative strides.
We should stop using
torch._debug_has_internal_overlap()
and write our own check that is more selective.Perhaps:
Then switch our cudagraphs wrapper to just copy the underlying storage and use
as_strided()
.