Run T5 on a CPU for inference
See original GitHub issueHi, great work on T5!
I’m looking to run T5 on a CPU for interactive predictions only, no fine-tuning. The given notebook provides great instructions for using T5 with a TPU, but I’m struggling to find how I could use it with a CPU?
I’ve tried changing the notebook similar to this:
model = t5.models.MtfModel(
model_dir=MODEL_DIR,
tpu=None,
model_parallelism=model_parallelism,
batch_size=train_batch_size,
layout_rules="ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch,model1:batch1,cpu:0", # sometimes I include this, sometimes I don't - it doesn't seem to matter
sequence_length={"inputs": 128, "targets": 32},
learning_rate_schedule=0.003,
save_checkpoints_steps=5000,
keep_checkpoint_max=keep_checkpoint_max if ON_CLOUD else None,
iterations_per_loop=100,
)
But I get these errors:
ValueError: Tensor dimension size not divisible by mesh dimension size: tensor_shape=Shape[outer_batch=1, batch=4, length=128] tensor_layout=TensorLayout(None, 0, None)
It seems likely that it has something to do with my TensorLayout being none
. Would you mind giving me some tips for this? Thanks a bunch in advance.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Fast T5 transformer model CPU inference with ... - YouTube
Question Generation using NLP course link: https://bit.ly/2PunWiWThe Colab notebook shown in the video is available in the course.
Read more >How to get Accelerated Inference API for T5 models? - Hub
I just copy/paste the following codes in a Google Colab notebook with my TOKEN_API in order to check the inference time with the...
Read more >Optimizing T5 and GPT-2 for Real-Time Inference with NVIDIA ...
This optimization leads to a 3–6x reduction in latency compared to PyTorch GPU inference, and a 9–21x compared to PyTorch CPU inference.
Read more >Deploy T5 11B for inference for less than $500 - philschmid
This blog will teach you how to deploy T5 11B for inference using Hugging Face Inference Endpoints. The T5 model was presented in...
Read more >Optimizing the T5 Model for Fast Inference - DataToBiz
The T5 model is an encoder-decoder model hence we tried to optimize the encoder first and then the decoder next. For doing this...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
(None, 0, None) indicates the sharding of the tensor-dimensions. The first and third tensor-dimensions are not split. The second tensor-dimension (batch=4) is split across the 0-th mesh-dimension.
Say that this is using model_parallelism=1 (i.e. pure data-parallelism) on an 8-core TPU. This would mean that the batch needs to be split 8 ways. But the batch size is 4, so this is impossible.
Solutions would be to either double the batch size, or to increase model-parallelism to 2 or 4.
On Wed, Apr 1, 2020 at 11:11 AM Colin Raffel notifications@github.com wrote:
Hi @nshazeer , thanks for clarification