question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot compile T5 for inferentia

See original GitHub issue

System Info

  • transformers version: 4.20.1
  • Platform: Linux-5.18.7-arch1-1-x86_64-with-glibc2.34
  • Python version: 3.8.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.11.0+cu102 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

The inferentia for Marian (Seq2Seq) is here

Here is the snippet of code I’m using :

Python Snippet for compiling T5 model
import os

import numpy as np
import torch
import torch.neuron
from torch.nn import functional as F
from transformers.generation_utils import GenerationMixin
from transformers.modeling_outputs import BaseModelOutput, Seq2SeqLMOutput
from transformers.modeling_utils import PreTrainedModel
from transformers.models.t5.configuration_t5 import T5Config
from transformers.models.t5.modeling_t5 import T5Model
from transformers.models.t5.tokenization_t5 import T5Tokenizer

model_id = "mrm8488/t5-base-finetuned-question-generation-ap"
num_texts = 1  # Number of input texts to decode
num_beams = 4  # Number of beams per input text
max_encoder_length = 32  # Maximum input token length
max_decoder_length = 32


def infer(model, tokenizer, text):

    # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)
    batch = tokenizer(
        text,
        max_length=max_decoder_length,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )
    output = model.generate(
        **batch,
        max_length=max_decoder_length,
        num_beams=num_beams,
        num_return_sequences=num_beams,
    )
    results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]

    print("Texts:")
    for i, summary in enumerate(results):
        print(i + 1, summary)


def reduce(hidden, index):
    _, n_length, _ = hidden.shape

    # Create selection mask
    mask = torch.arange(n_length, dtype=torch.float32) == index
    mask = mask.view(1, -1, 1)

    # Broadcast mask
    masked = torch.multiply(hidden, mask)

    # Reduce along 1st dimension
    summed = torch.sum(masked, 1)
    return torch.unsqueeze(summed, 1)


class NeuronEncoder(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.encoder = model.encoder

    def forward(self, input_ids, attention_mask):
        return self.encoder(input_ids, attention_mask=attention_mask, return_dict=False)


class NeuronDecoder(torch.nn.Module):
    def __init__(self, model, max_length):
        super().__init__()
        self.weight = model.shared.weight.clone().detach()
        self.bias = model.final_logits_bias.clone().detach()
        self.decoder = model.decoder
        self.max_length = max_length

    def forward(self, input_ids, attention_mask, encoder_outputs, index):

        # Build a fixed sized causal mask for the padded decoder input ids
        mask = np.triu(np.ones((self.max_length, self.max_length)), 1)
        mask[mask == 1] = -np.inf
        causal_mask = torch.tensor(mask, dtype=torch.float)

        # Invoke the decoder
        (hidden,) = self.decoder(
            input_ids=input_ids,
            encoder_hidden_states=encoder_outputs,
            encoder_padding_mask=attention_mask,
            decoder_padding_mask=None,
            decoder_causal_mask=causal_mask,
            return_dict=False,
            use_cache=False,
        )

        # Reduce decoder outputs to the specified index (current iteration)
        hidden = reduce(hidden, index)

        # Compute final linear layer for token probabilities
        logits = F.linear(hidden, self.weight, bias=self.bias)
        return logits


class NeuronGeneration(PreTrainedModel, GenerationMixin):
    def trace(
        self, model, num_texts, num_beams, max_encoder_length, max_decoder_length
    ):
        """
        Traces the encoder and decoder modules for use on Neuron.

        This function fixes the network to the given sizes. Once the model has been
        compiled to a given size, the inputs to these networks must always be of
        fixed size.

        Args:
            model (GenerationMixin): The transformer-type generator model to trace
            num_texts (int): The number of input texts to translate at once
            num_beams (int): The number of beams to computer per text
            max_encoder_length (int): The maximum number of encoder tokens
            max_encoder_length (int): The maximum number of decoder tokens
        """
        self.config.max_decoder_length = max_decoder_length

        # Trace the encoder
        inputs = (
            torch.ones((num_texts, max_encoder_length), dtype=torch.long),
            torch.ones((num_texts, max_encoder_length), dtype=torch.long),
        )
        encoder = NeuronEncoder(model)
        self.encoder = torch.neuron.trace(encoder, inputs)

        # Trace the decoder (with expanded inputs)
        batch_size = num_texts * num_beams
        inputs = (
            torch.ones((batch_size, max_decoder_length), dtype=torch.long),
            torch.ones((batch_size, max_encoder_length), dtype=torch.long),
            torch.ones(
                (batch_size, max_encoder_length, model.config.d_model),
                dtype=torch.float,
            ),
            torch.tensor(0),
        )
        decoder = NeuronDecoder(model, max_decoder_length)
        self.decoder = torch.neuron.trace(decoder, inputs)

    # ------------------------------------------------------------------------
    # Beam Search Methods (Copied directly from transformers)
    # ------------------------------------------------------------------------

    def adjust_logits_during_generation(self, logits, cur_len, max_length):
        if cur_len == 1 and self.config.force_bos_token_to_be_generated:
            self._force_token_id_to_be_generated(logits, self.config.bos_token_id)
        elif cur_len == max_length - 1 and self.config.eos_token_id is not None:
            self._force_token_id_to_be_generated(logits, self.config.eos_token_id)
        return logits

    @staticmethod
    def _force_token_id_to_be_generated(scores, token_id) -> None:
        scores[:, [x for x in range(scores.shape[1]) if x != token_id]] = -float("inf")

    # ------------------------------------------------------------------------
    # Encoder/Decoder Invocation
    # ------------------------------------------------------------------------

    def prepare_inputs_for_generation(
        self,
        decoder_input_ids,
        encoder_outputs=None,
        attention_mask=None,
        **model_kwargs,
    ):
        # Pad the inputs for Neuron
        current_length = decoder_input_ids.shape[1]
        pad_size = self.config.max_decoder_length - current_length
        return dict(
            input_ids=F.pad(decoder_input_ids, (0, pad_size)),
            attention_mask=attention_mask,
            encoder_outputs=encoder_outputs.last_hidden_state,
            current_length=torch.tensor(current_length - 1),
        )

    def get_encoder(self):
        """Helper to invoke the encoder and wrap the results in the expected structure"""

        def encode(input_ids, attention_mask, **kwargs):
            (output,) = self.encoder(input_ids, attention_mask)
            return BaseModelOutput(
                last_hidden_state=output,
            )

        return encode

    def __call__(
        self, input_ids, attention_mask, encoder_outputs, current_length, **kwargs
    ):
        """Helper to invoke the decoder and wrap the results in the expected structure"""
        logits = self.decoder(
            input_ids, attention_mask, encoder_outputs, current_length
        )
        return Seq2SeqLMOutput(logits=logits)

    # ------------------------------------------------------------------------
    # Serialization
    # ------------------------------------------------------------------------

    def save_pretrained(self, directory):
        if os.path.isfile(directory):
            print(f"Provided path ({directory}) should be a directory, not a file")
            return
        os.makedirs(directory, exist_ok=True)
        torch.jit.save(self.encoder, os.path.join(directory, "encoder.pt"))
        torch.jit.save(self.decoder, os.path.join(directory, "decoder.pt"))
        self.config.save_pretrained(directory)

    @classmethod
    def from_pretrained(cls, directory):
        config = T5Config.from_pretrained(directory)
        obj = cls(config)
        obj.encoder = torch.jit.load(os.path.join(directory, "encoder.pt"))
        obj.decoder = torch.jit.load(os.path.join(directory, "decoder.pt"))
        return obj

    @property
    def device(self):
        return torch.device("cpu")


model_cpu = T5Model.from_pretrained(model_id)
tokenizer_cpu = T5Tokenizer.from_pretrained(model_id)
model_neuron = NeuronGeneration(model_cpu.config)

# 1. Compile the model
# Note: This may take a couple of minutes since both the encoder/decoder will be compiled
model_neuron.trace(
    model=model_cpu,
    num_texts=num_texts,
    num_beams=num_beams,
    max_encoder_length=max_encoder_length,
    max_decoder_length=max_decoder_length,
)

# 2. Serialize an artifact
# After this call you will have an `encoder.pt`, `decoder.pt` and `config.json` in the neuron_name folder
model_neuron.save_pretrained(neuron_name)
tokenizer.save_pretrained(neuron_name)

model_neuron = NeuronGeneration.from_pretrained(neuron_name)
infer(model_neuron, tokenizer_cpu, sample_text)

To setup the environment :

  1. Create a venv (3.8 seems good for neuron-cc, 3.9 seems not)
  2. install neuron-cc pip install neuron-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com
  3. Install the rest : pip install pip install torch-neuron sagemaker transformers sentencepiece
  4. run the code above
  5. Get a :
TraceBack of the compilation
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Some weights of the model checkpoint at mrm8488/t5-base-finetuned-question-generation-ap were not used when initializing T5Model: ['lm_head.weight']
- This IS expected if you are initializing T5Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
INFO:Neuron:There are 2 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 627, fused = 587, percent fused = 93.62%
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$641; falling back to native python function call
ERROR:Neuron:No module named 'tensorflow'
Traceback (most recent call last):
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 381, in op_converter
    neuron_function = self.subgraph_compiler(
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/decorators.py", line 67, in trace
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 627, compiled = 0, percent compiled = 0.0%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 49 [supported]
INFO:Neuron: => aten::ScalarImplicit: 2 [supported]
INFO:Neuron: => aten::abs: 1 [supported]
INFO:Neuron: => aten::add: 65 [supported]
INFO:Neuron: => aten::arange: 2 [supported]
INFO:Neuron: => aten::contiguous: 12 [supported]
INFO:Neuron: => aten::div: 2 [supported]
INFO:Neuron: => aten::dropout: 50 [supported]
INFO:Neuron: => aten::embedding: 2 [not supported]
INFO:Neuron: => aten::full_like: 1 [supported]
INFO:Neuron: => aten::gt: 1 [supported]
INFO:Neuron: => aten::linear: 72 [supported]
INFO:Neuron: => aten::log: 1 [supported]
INFO:Neuron: => aten::lt: 1 [supported]
INFO:Neuron: => aten::matmul: 24 [supported]
INFO:Neuron: => aten::mean: 25 [supported]
INFO:Neuron: => aten::min: 1 [supported]
INFO:Neuron: => aten::mul: 53 [supported]
INFO:Neuron: => aten::permute: 1 [supported]
INFO:Neuron: => aten::pow: 25 [supported]
INFO:Neuron: => aten::relu: 12 [supported]
INFO:Neuron: => aten::rsqrt: 25 [supported]
INFO:Neuron: => aten::rsub: 1 [supported]
INFO:Neuron: => aten::size: 14 [supported]
INFO:Neuron: => aten::slice: 4 [supported]
INFO:Neuron: => aten::softmax: 12 [supported]
INFO:Neuron: => aten::sub: 1 [supported]
INFO:Neuron: => aten::to: 41 [supported]
INFO:Neuron: => aten::transpose: 60 [supported]
INFO:Neuron: => aten::type_as: 12 [supported]
INFO:Neuron: => aten::unsqueeze: 5 [supported]
INFO:Neuron: => aten::view: 49 [supported]
INFO:Neuron: => aten::where: 1 [not supported]
Traceback (most recent call last):
  File "compile_model.py", line 232, in <module>
    model_neuron.trace(
  File "compile_model.py", line 128, in trace
    self.encoder = torch.neuron.trace(encoder, inputs)
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 184, in trace
    cu.stats_post_compiler(neuron_graph)
  File "/mnt/Documents/test/sagemaker_test/.venv/lib/python3.8/site-packages/torch_neuron/convert.py", line 492, in stats_post_compiler
    raise RuntimeError(
RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace!

Which seem to say that .where is not supported by neuron-cc (and thus inferentia), which mean that we cannot run T5 models fast and cheap…

Could we change the .where to have something supported and thus be able to compile models for inferentia ? Else should I raise the issue to neuron-cc so they can support it ?

Thanks in advance, Have a great day.

Expected behavior

Be able to compile a T5 model for inferentia (.where being supported)

Note : I also create an issue in the aws-neuron-sdk repo : https://github.com/aws/aws-neuron-sdk/issues/440

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ierezellcommented, Jul 8, 2022

Closing: The problem was on my side, thanks anyway for your help and time! 😃

All the “neuron packages” are installable and wheels for 3.8 exists with the repo: https://pip.repos.neuron.amazonaws.com However, the installation leads to the wrong version of TensorFlow (neuron needs <2) that does not exists anymore.

I cleaned everything, reinstalled with python3.7, and ran out of the box the two scripts above.

1reaction
philschmidcommented, Jul 7, 2022

@Ierezell could you please try to set up your environment as described here. It feels like that you are missing packages to successfully compile the models

Read more comments on GitHub >

github_iconTop Results From Across the Web

Achieve 12x higher throughput and lowest latency for PyTorch ...
This performance boost comes with minimum impact on latency, because AWS Inferentia is optimized to maximize throughput at small batch sizes.
Read more >
Model Architecture Fit Guidelines - AWS Neuron
Most Detectron2-based R-CNNs are not jit traceable by default, so they cannot readily be compiled for optimized inference on Inferentia.
Read more >
How to deploy a T5 model to AWS SageMaker for fast ...
Hi, I just watched the video of the Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models (11/02/2021) ...
Read more >
Find Answers to AWS Questions about AWS Inferentia
Hello, We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model...
Read more >
aws - Philipp Schmid - philschmid
Speed up BERT inference with Hugging Face Transformers and AWS Inferentia ... Transformers BERT fine-tuning using Amazon SageMaker and Training Compiler.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found