Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot compile T5 for inferentia

See original GitHub issue

System Info

transformers version: 4.20.1
Platform: Linux-5.18.7-arch1-1-x86_64-with-glibc2.34
Python version: 3.8.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.11.0+cu102 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

The inferentia for Marian (Seq2Seq) is here

Here is the snippet of code I’m using :

Python Snippet for compiling T5 model

import os

import numpy as np
import torch
import torch.neuron
from torch.nn import functional as F
from transformers.generation_utils import GenerationMixin
from transformers.modeling_outputs import BaseModelOutput, Seq2SeqLMOutput
from transformers.modeling_utils import PreTrainedModel
from transformers.models.t5.configuration_t5 import T5Config
from transformers.models.t5.modeling_t5 import T5Model
from transformers.models.t5.tokenization_t5 import T5Tokenizer

model_id = "mrm8488/t5-base-finetuned-question-generation-ap"
num_texts = 1  # Number of input texts to decode
num_beams = 4  # Number of beams per input text
max_encoder_length = 32  # Maximum input token length
max_decoder_length = 32


def infer(model, tokenizer, text):

    # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)
    batch = tokenizer(
        text,
        max_length=max_decoder_length,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )
    output = model.generate(
        **batch,
        max_length=max_decoder_length,
        num_beams=num_beams,
        num_return_sequences=num_beams,
    )
    results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]

    print("Texts:")
    for i, summary in enumerate(results):
        print(i + 1, summary)


def reduce(hidden, index):
    _, n_length, _ = hidden.shape

    # Create selection mask
    mask = torch.arange(n_length, dtype=torch.float32) == index
    mask = mask.view(1, -1, 1)

    # Broadcast mask
    masked = torch.multiply(hidden, mask)

    # Reduce along 1st dimension
    summed = torch.sum(masked, 1)
    return torch.unsqueeze(summed, 1)


class NeuronEncoder(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.encoder = model.encoder

    def forward(self, input_ids, attention_mask):
        return self.encoder(input_ids, attention_mask=attention_mask, return_dict=False)


class NeuronDecoder(torch.nn.Module):
    def __init__(self, model, max_length):
        super().__init__()
        self.weight = model.shared.weight.clone().detach()
        self.bias = model.final_logits_bias.clone().detach()
        self.decoder = model.decoder
        self.max_length = max_length

    def forward(self, input_ids, attention_mask, encoder_outputs, index):

        # Build a fixed sized causal mask for the padded decoder input ids
        mask = np.triu(np.ones((self.max_length, self.max_length)), 1)
        mask[mask == 1] = -np.inf
        causal_mask = torch.tensor(mask, dtype=torch.float)

        # Invoke the decoder
        (hidden,) = self.decoder(
            input_ids=input_ids,
            encoder_hidden_states=encoder_outputs,
            encoder_padding_mask=attention_mask,
            decoder_padding_mask=None,
            decoder_causal_mask=causal_mask,
            return_dict=False,
            use_cache=False,
        )

        # Reduce decoder outputs to the specified index (current iteration)
        hidden = reduce(hidden, index)

        # Compute final linear layer for token probabilities
        logits = F.linear(hidden, self.weight, bias=self.bias)
        return logits


class NeuronGeneration(PreTrainedModel, GenerationMixin):
    def trace(
        self, model, num_texts, num_beams, max_encoder_length, max_decoder_length
    ):
        """
        Traces the encoder and decoder modules for use on Neuron.

        This function fixes the network to the given sizes. Once the model has been
        compiled to a given size, the inputs to these networks must always be of
        fixed size.

        Args:
            model (GenerationMixin): The transformer-type generator model to trace
            num_texts (int): The number of input texts to translate at once
            num_beams (int): The number of beams to computer per text
            max_encoder_length (int): The maximum number of encoder tokens
            max_encoder_length (int): The maximum number of decoder tokens
        """
        self.config.max_decoder_length = max_decoder_length

        # Trace the encoder
        inputs = (
            torch.ones((num_texts, max_encoder_length), dtype=torch.long),
            torch.ones((num_texts, max_encoder_length), dtype=torch.long),
        )
        encoder = NeuronEncoder(model)
        self.encoder = torch.neuron.trace(encoder, inputs)

        # Trace the decoder (with expanded inputs)
        batch_size = num_texts * num_beams
        inputs = (
            torch.ones((batch_size, max_decoder_length), dtype=torch.long),
            torch.ones((batch_size, max_encoder_length), dtype=torch.long),
            torch.ones(
                (batch_size, max_encoder_length, model.config.d_model),
                dtype=torch.float,
            ),
            torch.tensor(0),
        )
        decoder = NeuronDecoder(model, max_decoder_length)
        self.decoder = torch.neuron.trace(decoder, inputs)

    # ------------------------------------------------------------------------
    # Beam Search Methods (Copied directly from transformers)
    # ------------------------------------------------------------------------

    def adjust_logits_during_generation(self, logits, cur_len, max_length):
        if cur_len == 1 and self.config.force_bos_token_to_be_generated:
            self._force_token_id_to_be_generated(logits, self.config.bos_token_id)
        elif cur_len == max_length - 1 and self.config.eos_token_id is not None:
            self._force_token_id_to_be_generated(logits, self.config.eos_token_id)
        return logits

    @staticmethod
    def _force_token_id_to_be_generated(scores, token_id) -> None:
        scores[:, [x for x in range(scores.shape[1]) if x != token_id]] = -float("inf")

    # ------------------------------------------------------------------------
    # Encoder/Decoder Invocation
    # ------------------------------------------------------------------------

    def prepare_inputs_for_generation(
        self,
        decoder_input_ids,
        encoder_outputs=None,
        attention_mask=None,
        **model_kwargs,
    ):
        # Pad the inputs for Neuron
        current_length = decoder_input_ids.shape[1]
        pad_size = self.config.max_decoder_length - current_length
        return dict(
            input_ids=F.pad(decoder_input_ids, (0, pad_size)),
            attention_mask=attention_mask,
            encoder_outputs=encoder_outputs.last_hidden_state,
            current_length=torch.tensor(current_length - 1),
        )

    def get_encoder(self):
        """Helper to invoke the encoder and wrap the results in the expected structure"""

        def encode(input_ids, attention_mask, **kwargs):
            (output,) = self.encoder(input_ids, attention_mask)
            return BaseModelOutput(
                last_hidden_state=output,
            )

        return encode

    def __call__(
        self, input_ids, attention_mask, encoder_outputs, current_length, **kwargs
    ):
        """Helper to invoke the decoder and wrap the results in the expected structure"""
        logits = self.decoder(
            input_ids, attention_mask, encoder_outputs, current_length
        )
        return Seq2SeqLMOutput(logits=logits)

    # ------------------------------------------------------------------------
    # Serialization
    # ------------------------------------------------------------------------

    def save_pretrained(self, directory):
        if os.path.isfile(directory):
            print(f"Provided path ({directory}) should be a directory, not a file")
            return
        os.makedirs(directory, exist_ok=True)
        torch.jit.save(self.encoder, os.path.join(directory, "encoder.pt"))
        torch.jit.save(self.decoder, os.path.join(directory, "decoder.pt"))
        self.config.save_pretrained(directory)

    @classmethod
    def from_pretrained(cls, directory):
        config = T5Config.from_pretrained(directory)
        obj = cls(config)
        obj.encoder = torch.jit.load(os.path.join(directory, "encoder.pt"))
        obj.decoder = torch.jit.load(os.path.join(directory, "decoder.pt"))
        return obj

    @property
    def device(self):
        return torch.device("cpu")


model_cpu = T5Model.from_pretrained(model_id)
tokenizer_cpu = T5Tokenizer.from_pretrained(model_id)
model_neuron = NeuronGeneration(model_cpu.config)

# 1. Compile the model
# Note: This may take a couple of minutes since both the encoder/decoder will be compiled
model_neuron.trace(
    model=model_cpu,
    num_texts=num_texts,
    num_beams=num_beams,
    max_encoder_length=max_encoder_length,
    max_decoder_length=max_decoder_length,
)

# 2. Serialize an artifact
# After this call you will have an `encoder.pt`, `decoder.pt` and `config.json` in the neuron_name folder
model_neuron.save_pretrained(neuron_name)
tokenizer.save_pretrained(neuron_name)

model_neuron = NeuronGeneration.from_pretrained(neuron_name)
infer(model_neuron, tokenizer_cpu, sample_text)

To setup the environment :

Create a venv (3.8 seems good for neuron-cc, 3.9 seems not)
install neuron-cc pip install neuron-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com
Install the rest : pip install pip install torch-neuron sagemaker transformers sentencepiece
run the code above
Get a :

TraceBack of the compilation

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Some weights of the model checkpoint at mrm8488/t5-base-finetuned-question-generation-ap were not used when initializing T5Model: ['lm_head.weight']
- This IS expected if you are initializing T5Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
INFO:Neuron:There are 2 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 627, fused = 587, percent fused = 93.62%
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$641; falling back to native python function call
ERROR:Neuron:No module named 'tensorflow'
Traceback (most recent call last):
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 381, in op_converter
    neuron_function = self.subgraph_compiler(
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/decorators.py", line 67, in trace
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 627, compiled = 0, percent compiled = 0.0%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 49 [supported]
INFO:Neuron: => aten::ScalarImplicit: 2 [supported]
INFO:Neuron: => aten::abs: 1 [supported]
INFO:Neuron: => aten::add: 65 [supported]
INFO:Neuron: => aten::arange: 2 [supported]
INFO:Neuron: => aten::contiguous: 12 [supported]
INFO:Neuron: => aten::div: 2 [supported]
INFO:Neuron: => aten::dropout: 50 [supported]
INFO:Neuron: => aten::embedding: 2 [not supported]
INFO:Neuron: => aten::full_like: 1 [supported]
INFO:Neuron: => aten::gt: 1 [supported]
INFO:Neuron: => aten::linear: 72 [supported]
INFO:Neuron: => aten::log: 1 [supported]
INFO:Neuron: => aten::lt: 1 [supported]
INFO:Neuron: => aten::matmul: 24 [supported]
INFO:Neuron: => aten::mean: 25 [supported]
INFO:Neuron: => aten::min: 1 [supported]
INFO:Neuron: => aten::mul: 53 [supported]
INFO:Neuron: => aten::permute: 1 [supported]
INFO:Neuron: => aten::pow: 25 [supported]
INFO:Neuron: => aten::relu: 12 [supported]
INFO:Neuron: => aten::rsqrt: 25 [supported]
INFO:Neuron: => aten::rsub: 1 [supported]
INFO:Neuron: => aten::size: 14 [supported]
INFO:Neuron: => aten::slice: 4 [supported]
INFO:Neuron: => aten::softmax: 12 [supported]
INFO:Neuron: => aten::sub: 1 [supported]
INFO:Neuron: => aten::to: 41 [supported]
INFO:Neuron: => aten::transpose: 60 [supported]
INFO:Neuron: => aten::type_as: 12 [supported]
INFO:Neuron: => aten::unsqueeze: 5 [supported]
INFO:Neuron: => aten::view: 49 [supported]
INFO:Neuron: => aten::where: 1 [not supported]
Traceback (most recent call last):
  File "compile_model.py", line 232, in <module>
    model_neuron.trace(
  File "compile_model.py", line 128, in trace
    self.encoder = torch.neuron.trace(encoder, inputs)
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 184, in trace
    cu.stats_post_compiler(neuron_graph)
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 492, in stats_post_compiler
    raise RuntimeError(
RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace!

Which seem to say that .where is not supported by neuron-cc (and thus inferentia), which mean that we cannot run T5 models fast and cheap…

Could we change the .where to have something supported and thus be able to compile models for inferentia ? Else should I raise the issue to neuron-cc so they can support it ?

Thanks in advance, Have a great day.

Expected behavior

Be able to compile a T5 model for inferentia (.where being supported)

Note : I also create an issue in the aws-neuron-sdk repo : https://github.com/aws/aws-neuron-sdk/issues/440

Issue Analytics

State:
Created a year ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

ierezellcommented, Jul 8, 2022

Closing: The problem was on my side, thanks anyway for your help and time! 😃

All the “neuron packages” are installable and wheels for 3.8 exists with the repo: https://pip.repos.neuron.amazonaws.com However, the installation leads to the wrong version of TensorFlow (neuron needs <2) that does not exists anymore.

I cleaned everything, reinstalled with python3.7, and ran out of the box the two scripts above.

1reaction

philschmidcommented, Jul 7, 2022

@Ierezell could you please try to set up your environment as described here. It feels like that you are missing packages to successfully compile the models

Top Results From Across the Web

Achieve 12x higher throughput and lowest latency for PyTorch ...

This performance boost comes with minimum impact on latency, because AWS Inferentia is optimized to maximize throughput at small batch sizes.

Model Architecture Fit Guidelines - AWS Neuron

Most Detectron2-based R-CNNs are not jit traceable by default, so they cannot readily be compiled for optimized inference on Inferentia.

How to deploy a T5 model to AWS SageMaker for fast ...

Hi, I just watched the video of the Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models (11/02/2021) ...

Find Answers to AWS Questions about AWS Inferentia

Hello, We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model...

aws - Philipp Schmid - philschmid

Speed up BERT inference with Hugging Face Transformers and AWS Inferentia ... Transformers BERT fine-tuning using Amazon SageMaker and Training Compiler.