Cannot compile T5 for inferentia
See original GitHub issueSystem Info
transformers
version: 4.20.1- Platform: Linux-5.18.7-arch1-1-x86_64-with-glibc2.34
- Python version: 3.8.13
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.11.0+cu102 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
The inferentia for Marian (Seq2Seq) is here
Here is the snippet of code I’m using :
Python Snippet for compiling T5 model
import os
import numpy as np
import torch
import torch.neuron
from torch.nn import functional as F
from transformers.generation_utils import GenerationMixin
from transformers.modeling_outputs import BaseModelOutput, Seq2SeqLMOutput
from transformers.modeling_utils import PreTrainedModel
from transformers.models.t5.configuration_t5 import T5Config
from transformers.models.t5.modeling_t5 import T5Model
from transformers.models.t5.tokenization_t5 import T5Tokenizer
model_id = "mrm8488/t5-base-finetuned-question-generation-ap"
num_texts = 1 # Number of input texts to decode
num_beams = 4 # Number of beams per input text
max_encoder_length = 32 # Maximum input token length
max_decoder_length = 32
def infer(model, tokenizer, text):
# Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)
batch = tokenizer(
text,
max_length=max_decoder_length,
truncation=True,
padding="max_length",
return_tensors="pt",
)
output = model.generate(
**batch,
max_length=max_decoder_length,
num_beams=num_beams,
num_return_sequences=num_beams,
)
results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]
print("Texts:")
for i, summary in enumerate(results):
print(i + 1, summary)
def reduce(hidden, index):
_, n_length, _ = hidden.shape
# Create selection mask
mask = torch.arange(n_length, dtype=torch.float32) == index
mask = mask.view(1, -1, 1)
# Broadcast mask
masked = torch.multiply(hidden, mask)
# Reduce along 1st dimension
summed = torch.sum(masked, 1)
return torch.unsqueeze(summed, 1)
class NeuronEncoder(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.encoder = model.encoder
def forward(self, input_ids, attention_mask):
return self.encoder(input_ids, attention_mask=attention_mask, return_dict=False)
class NeuronDecoder(torch.nn.Module):
def __init__(self, model, max_length):
super().__init__()
self.weight = model.shared.weight.clone().detach()
self.bias = model.final_logits_bias.clone().detach()
self.decoder = model.decoder
self.max_length = max_length
def forward(self, input_ids, attention_mask, encoder_outputs, index):
# Build a fixed sized causal mask for the padded decoder input ids
mask = np.triu(np.ones((self.max_length, self.max_length)), 1)
mask[mask == 1] = -np.inf
causal_mask = torch.tensor(mask, dtype=torch.float)
# Invoke the decoder
(hidden,) = self.decoder(
input_ids=input_ids,
encoder_hidden_states=encoder_outputs,
encoder_padding_mask=attention_mask,
decoder_padding_mask=None,
decoder_causal_mask=causal_mask,
return_dict=False,
use_cache=False,
)
# Reduce decoder outputs to the specified index (current iteration)
hidden = reduce(hidden, index)
# Compute final linear layer for token probabilities
logits = F.linear(hidden, self.weight, bias=self.bias)
return logits
class NeuronGeneration(PreTrainedModel, GenerationMixin):
def trace(
self, model, num_texts, num_beams, max_encoder_length, max_decoder_length
):
"""
Traces the encoder and decoder modules for use on Neuron.
This function fixes the network to the given sizes. Once the model has been
compiled to a given size, the inputs to these networks must always be of
fixed size.
Args:
model (GenerationMixin): The transformer-type generator model to trace
num_texts (int): The number of input texts to translate at once
num_beams (int): The number of beams to computer per text
max_encoder_length (int): The maximum number of encoder tokens
max_encoder_length (int): The maximum number of decoder tokens
"""
self.config.max_decoder_length = max_decoder_length
# Trace the encoder
inputs = (
torch.ones((num_texts, max_encoder_length), dtype=torch.long),
torch.ones((num_texts, max_encoder_length), dtype=torch.long),
)
encoder = NeuronEncoder(model)
self.encoder = torch.neuron.trace(encoder, inputs)
# Trace the decoder (with expanded inputs)
batch_size = num_texts * num_beams
inputs = (
torch.ones((batch_size, max_decoder_length), dtype=torch.long),
torch.ones((batch_size, max_encoder_length), dtype=torch.long),
torch.ones(
(batch_size, max_encoder_length, model.config.d_model),
dtype=torch.float,
),
torch.tensor(0),
)
decoder = NeuronDecoder(model, max_decoder_length)
self.decoder = torch.neuron.trace(decoder, inputs)
# ------------------------------------------------------------------------
# Beam Search Methods (Copied directly from transformers)
# ------------------------------------------------------------------------
def adjust_logits_during_generation(self, logits, cur_len, max_length):
if cur_len == 1 and self.config.force_bos_token_to_be_generated:
self._force_token_id_to_be_generated(logits, self.config.bos_token_id)
elif cur_len == max_length - 1 and self.config.eos_token_id is not None:
self._force_token_id_to_be_generated(logits, self.config.eos_token_id)
return logits
@staticmethod
def _force_token_id_to_be_generated(scores, token_id) -> None:
scores[:, [x for x in range(scores.shape[1]) if x != token_id]] = -float("inf")
# ------------------------------------------------------------------------
# Encoder/Decoder Invocation
# ------------------------------------------------------------------------
def prepare_inputs_for_generation(
self,
decoder_input_ids,
encoder_outputs=None,
attention_mask=None,
**model_kwargs,
):
# Pad the inputs for Neuron
current_length = decoder_input_ids.shape[1]
pad_size = self.config.max_decoder_length - current_length
return dict(
input_ids=F.pad(decoder_input_ids, (0, pad_size)),
attention_mask=attention_mask,
encoder_outputs=encoder_outputs.last_hidden_state,
current_length=torch.tensor(current_length - 1),
)
def get_encoder(self):
"""Helper to invoke the encoder and wrap the results in the expected structure"""
def encode(input_ids, attention_mask, **kwargs):
(output,) = self.encoder(input_ids, attention_mask)
return BaseModelOutput(
last_hidden_state=output,
)
return encode
def __call__(
self, input_ids, attention_mask, encoder_outputs, current_length, **kwargs
):
"""Helper to invoke the decoder and wrap the results in the expected structure"""
logits = self.decoder(
input_ids, attention_mask, encoder_outputs, current_length
)
return Seq2SeqLMOutput(logits=logits)
# ------------------------------------------------------------------------
# Serialization
# ------------------------------------------------------------------------
def save_pretrained(self, directory):
if os.path.isfile(directory):
print(f"Provided path ({directory}) should be a directory, not a file")
return
os.makedirs(directory, exist_ok=True)
torch.jit.save(self.encoder, os.path.join(directory, "encoder.pt"))
torch.jit.save(self.decoder, os.path.join(directory, "decoder.pt"))
self.config.save_pretrained(directory)
@classmethod
def from_pretrained(cls, directory):
config = T5Config.from_pretrained(directory)
obj = cls(config)
obj.encoder = torch.jit.load(os.path.join(directory, "encoder.pt"))
obj.decoder = torch.jit.load(os.path.join(directory, "decoder.pt"))
return obj
@property
def device(self):
return torch.device("cpu")
model_cpu = T5Model.from_pretrained(model_id)
tokenizer_cpu = T5Tokenizer.from_pretrained(model_id)
model_neuron = NeuronGeneration(model_cpu.config)
# 1. Compile the model
# Note: This may take a couple of minutes since both the encoder/decoder will be compiled
model_neuron.trace(
model=model_cpu,
num_texts=num_texts,
num_beams=num_beams,
max_encoder_length=max_encoder_length,
max_decoder_length=max_decoder_length,
)
# 2. Serialize an artifact
# After this call you will have an `encoder.pt`, `decoder.pt` and `config.json` in the neuron_name folder
model_neuron.save_pretrained(neuron_name)
tokenizer.save_pretrained(neuron_name)
model_neuron = NeuronGeneration.from_pretrained(neuron_name)
infer(model_neuron, tokenizer_cpu, sample_text)
To setup the environment :
- Create a venv (3.8 seems good for neuron-cc, 3.9 seems not)
- install neuron-cc
pip install neuron-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com
- Install the rest :
pip install pip install torch-neuron sagemaker transformers sentencepiece
- run the code above
- Get a :
TraceBack of the compilation
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Some weights of the model checkpoint at mrm8488/t5-base-finetuned-question-generation-ap were not used when initializing T5Model: ['lm_head.weight']
- This IS expected if you are initializing T5Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
INFO:Neuron:There are 2 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 627, fused = 587, percent fused = 93.62%
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$641; falling back to native python function call
ERROR:Neuron:No module named 'tensorflow'
Traceback (most recent call last):
File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 381, in op_converter
neuron_function = self.subgraph_compiler(
File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/decorators.py", line 67, in trace
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 627, compiled = 0, percent compiled = 0.0%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 49 [supported]
INFO:Neuron: => aten::ScalarImplicit: 2 [supported]
INFO:Neuron: => aten::abs: 1 [supported]
INFO:Neuron: => aten::add: 65 [supported]
INFO:Neuron: => aten::arange: 2 [supported]
INFO:Neuron: => aten::contiguous: 12 [supported]
INFO:Neuron: => aten::div: 2 [supported]
INFO:Neuron: => aten::dropout: 50 [supported]
INFO:Neuron: => aten::embedding: 2 [not supported]
INFO:Neuron: => aten::full_like: 1 [supported]
INFO:Neuron: => aten::gt: 1 [supported]
INFO:Neuron: => aten::linear: 72 [supported]
INFO:Neuron: => aten::log: 1 [supported]
INFO:Neuron: => aten::lt: 1 [supported]
INFO:Neuron: => aten::matmul: 24 [supported]
INFO:Neuron: => aten::mean: 25 [supported]
INFO:Neuron: => aten::min: 1 [supported]
INFO:Neuron: => aten::mul: 53 [supported]
INFO:Neuron: => aten::permute: 1 [supported]
INFO:Neuron: => aten::pow: 25 [supported]
INFO:Neuron: => aten::relu: 12 [supported]
INFO:Neuron: => aten::rsqrt: 25 [supported]
INFO:Neuron: => aten::rsub: 1 [supported]
INFO:Neuron: => aten::size: 14 [supported]
INFO:Neuron: => aten::slice: 4 [supported]
INFO:Neuron: => aten::softmax: 12 [supported]
INFO:Neuron: => aten::sub: 1 [supported]
INFO:Neuron: => aten::to: 41 [supported]
INFO:Neuron: => aten::transpose: 60 [supported]
INFO:Neuron: => aten::type_as: 12 [supported]
INFO:Neuron: => aten::unsqueeze: 5 [supported]
INFO:Neuron: => aten::view: 49 [supported]
INFO:Neuron: => aten::where: 1 [not supported]
Traceback (most recent call last):
File "compile_model.py", line 232, in <module>
model_neuron.trace(
File "compile_model.py", line 128, in trace
self.encoder = torch.neuron.trace(encoder, inputs)
File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 184, in trace
cu.stats_post_compiler(neuron_graph)
File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 492, in stats_post_compiler
raise RuntimeError(
RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace!
Which seem to say that .where
is not supported by neuron-cc (and thus inferentia), which mean that we cannot run T5 models fast and cheap…
Could we change the .where
to have something supported and thus be able to compile models for inferentia ?
Else should I raise the issue to neuron-cc
so they can support it ?
Thanks in advance, Have a great day.
Expected behavior
Be able to compile a T5 model for inferentia (.where
being supported)
Note : I also create an issue in the aws-neuron-sdk
repo : https://github.com/aws/aws-neuron-sdk/issues/440
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Achieve 12x higher throughput and lowest latency for PyTorch ...
This performance boost comes with minimum impact on latency, because AWS Inferentia is optimized to maximize throughput at small batch sizes.
Read more >Model Architecture Fit Guidelines - AWS Neuron
Most Detectron2-based R-CNNs are not jit traceable by default, so they cannot readily be compiled for optimized inference on Inferentia.
Read more >How to deploy a T5 model to AWS SageMaker for fast ...
Hi, I just watched the video of the Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models (11/02/2021) ...
Read more >Find Answers to AWS Questions about AWS Inferentia
Hello, We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model...
Read more >aws - Philipp Schmid - philschmid
Speed up BERT inference with Hugging Face Transformers and AWS Inferentia ... Transformers BERT fine-tuning using Amazon SageMaker and Training Compiler.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Closing: The problem was on my side, thanks anyway for your help and time! 😃
All the “neuron packages” are installable and wheels for 3.8 exists with the repo:
https://pip.repos.neuron.amazonaws.com
However, the installation leads to the wrong version of TensorFlow (neuron needs <2) that does not exists anymore.I cleaned everything, reinstalled with python3.7, and ran out of the box the two scripts above.
@Ierezell could you please try to set up your environment as described here. It feels like that you are missing packages to successfully compile the models