Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for XGLM: How to achieve faster inference speed?

See original GitHub issue

Describe a requested feature

Thanks for releasing this great library! I am currently working on deploying facebook/xglm-7.5B, which is currently not supported by parallelformers.

POLICY.md provides a comprehensive guide for parallelizing my own models. But I am a little bit unsure of

which weights to be parallelized and
how many GPUs should be used

for a better inference speed.

Architecture of XGLM-7.5B

root
├── model (XGLMModel)
│   ├── embed_tokens (Embedding) weight:[256008, 4096]
│   ├── embed_positions (XGLMSinusoidalPositionalEmbedding) weights:[2050, 4096]
│   ├── layers (ModuleList)
│   │   └── 0-31(XGLMDecoderLayer)
│   │       ├── self_attn (XGLMAttention)
│   │       │   └── k_proj,v_proj,q_proj,out_proj(Linear) weight:[4096, 4096] bias:[4096]
│   │       ├── self_attn_layer_norm,final_layer_norm(LayerNorm) weight:[4096] bias:[4096]
│   │       ├── fc1 (Linear) weight:[16384, 4096] bias:[16384]
│   │       └── fc2 (Linear) weight:[4096, 16384] bias:[4096]
│   └── layer_norm (LayerNorm) weight:[4096] bias:[4096]
└── lm_head (Linear) weight:[256008, 4096]

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

4reactions

hyunwoongkocommented, Apr 7, 2022

@z-bookworm I’ll add now

1reaction

un-certaintycommented, Apr 24, 2022

Hi @hyunwoongko May I have an update on this feature?

Top Results From Across the Web

XGLM - Hugging Face

Construct a “fast” XGLM tokenizer (backed by HuggingFace's tokenizers library). Adapted from RobertaTokenizer and XLNetTokenizer. Based on BPE.

Improving Inference Speeds of Transformer Models - Medium

Improving Inference Speeds of Transformer Models ... “With great models comes slower inference speeds”. Deep Learning has evolved immensely and it ...

5 Practical Ways to Speed Up your Deep Learning Model

In this blog, we've described five approaches to improve the inference time of your deep learning model. In particular, we'd advise you to ......

Issues · tunib-ai/parallelformers - GitHub

RuntimeError: CUDA error: peer access is not supported between these two ... Support for XGLM: How to achieve faster inference speed? enhancement New ......

arXiv:2112.10668v3 [cs.CL] 10 Nov 2022

passes XGLM 7.5B in machine translation on sev- ... candidates are randomly sampled to ensure fast inference and save API cost.