question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

hidden_dim constraint in transformer cuda kernel

See original GitHub issue

I found that there is constraint on the dimensionality when we use the transformer cuda kernel: https://github.com/microsoft/DeepSpeed/blob/d720fdb6857f4b71d922ca1e8efbe5271b5fb7c2/csrc/transformer/normalize_kernels.cu#L232-L250

I wonder what is the reason behind it? Is there any plan to support arbitrary dimensionality? Or, If I want to use hidden_dim=4096 or 8192, what do I need to do to make it work? Thanks.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
RezaYazdaniAminabadicommented, Nov 24, 2020

Hi @szhengac

Sorry for the delay! I have modified the transformer kernel so that it can support different hidden dimensions. I have the code in a different branch of deepspeed: https://github.com/microsoft/DeepSpeed/tree/transformer-kernel/support-arbitrary-hidden. I have already tested this for various hidden dimensions from 128 to 8192 (https://github.com/microsoft/DeepSpeed/blob/transformer-kernel/support-arbitrary-hidden/tests/unit/test_cuda_forward.py#L216-L244).

Could you please try this and let me know if it works for your training environment?

Thank you, Reza

1reaction
RezaYazdaniAminabadicommented, Oct 28, 2020

Hi @szhengac

Thanks for pointing this out. We are currently working on supporting arbitrary dimensions. There will be a code update soon to add this feature. Please stay tuned! 😃

Thanks. Reza

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enabling Efficient Inference of Transformer Models at ... - arXiv
Eliminating Kernel Invocation Overhead via Cuda-Graph: For small to moderate sized models with small batch sizes, as we reduce the actual ...
Read more >
A High-Performance Transformer Boosted for Variable-Length ...
In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a zero padding ...
Read more >
Efficient Training on a Single GPU - Hugging Face
Efficient Training on a Single GPU. This guide focuses on training large models efficiently on a single GPU. These approaches are still valid...
Read more >
HAT: Hardware-Aware Transformers for Efficient Natural ...
Transformers (HAT) with neural architecture ... a hardware latency constraint to find a special- ... with different numbers of heads, hidden dim, and....
Read more >
CUDA Pro Tip: Minimize the Tail Effect | NVIDIA Technical Blog
When I work on the optimization of CUDA kernels, I sometimes see a ... the kernel, I used the __launch_bounds__ attribute to constrain...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found