Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run T5 on a CPU for inference

See original GitHub issue

Hi, great work on T5!

I’m looking to run T5 on a CPU for interactive predictions only, no fine-tuning. The given notebook provides great instructions for using T5 with a TPU, but I’m struggling to find how I could use it with a CPU?

I’ve tried changing the notebook similar to this:

model = t5.models.MtfModel(
    model_dir=MODEL_DIR,
    tpu=None,
    model_parallelism=model_parallelism,
    batch_size=train_batch_size,
    layout_rules="ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch,model1:batch1,cpu:0", # sometimes I include this, sometimes I don't - it doesn't seem to matter
    sequence_length={"inputs": 128, "targets": 32},
    learning_rate_schedule=0.003,
    save_checkpoints_steps=5000,
    keep_checkpoint_max=keep_checkpoint_max if ON_CLOUD else None,
    iterations_per_loop=100,
)

But I get these errors:

ValueError: Tensor dimension size not divisible by mesh dimension size: tensor_shape=Shape[outer_batch=1, batch=4, length=128] tensor_layout=TensorLayout(None, 0, None)

It seems likely that it has something to do with my TensorLayout being none. Would you mind giving me some tips for this? Thanks a bunch in advance.

Issue Analytics

State:
Created 4 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

nshazeercommented, Apr 1, 2020

(None, 0, None) indicates the sharding of the tensor-dimensions. The first and third tensor-dimensions are not split. The second tensor-dimension (batch=4) is split across the 0-th mesh-dimension.

Say that this is using model_parallelism=1 (i.e. pure data-parallelism) on an 8-core TPU. This would mean that the batch needs to be split 8 ways. But the batch size is 4, so this is impossible.

Solutions would be to either double the batch size, or to increase model-parallelism to 2 or 4.

On Wed, Apr 1, 2020 at 11:11 AM Colin Raffel notifications@github.com wrote:

@orsk-moscow https://github.com/orsk-moscow Are you running on a TPU or CPU? What model parallelism are you using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/67#issuecomment-607408855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDH33WDYZVDXNFIFA3VXKLRKN7UXANCNFSM4KP37GSQ .

0reactions

orsk-moscowcommented, Apr 2, 2020

Hi @nshazeer , thanks for clarification