Problems with MTEncDecModel model in ONNX
See original GitHub issueThere are two issues I am having with exporting MTEncDecModel to ONNX, they might be related in some way so going to post them together.
1. ONNX error in embedding in nemo version > 1.2:
If I try to run below example in nemo 1.5 I get the following error:
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION :
Non-zero status code returned while running Reshape node. Name:'Reshape_67' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:41
onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape&, std::vector<long int>&, bool)
gsl::narrow_cast<int64_t>(input_shape.Size()) == size was false.
The input tensor cannot be reshaped to the requested shape. Input shape:{1,44,1024}, requested shape:{2,16,16,64}
However, it seems like this error is not present if using an older version of nemo.
2. Big performance discrepancy when using ONNX version of the model:
I notice a big performance drop when exporting the MTEncDecModel to ONNX. I am trying a minimal setup with greedy search and get about half the BLEU score in comparison to using MTEncDecModel directly with TopKSequenceGenerator(top_k=1)
.
Here is a full script to get the translation with ONNX, you would need to additionally install onnxruntime-gpu
:
import torch
import numpy as np
import onnxruntime
from nemo.collections.nlp.models import MTEncDecModel
# Load the model form pretrained:
model = MTEncDecModel.from_pretrained('nmt_en_de_transformer12x2')
# Export all model components to onnx:
model.encoder.export('encoder.onnx')
model.decoder.export('decoder.onnx')
model.log_softmax.export('classifier.onnx')
# Initialise all the onnx sessions:
encoder_session = onnxruntime.InferenceSession('encoder.onnx', providers=['CUDAExecutionProvider'])
decoder_session = onnxruntime.InferenceSession('decoder.onnx', providers=['CUDAExecutionProvider'])
classifier_session = onnxruntime.InferenceSession('classifier.onnx', providers=['CUDAExecutionProvider'])
# Preprocess the data using the original nemo model for simplicity:
TEXT = ['They are not even 100 metres apart: On Tuesday, the new B 33 pedestrian lights in Dorfparkplatz in Gutach became operational - within view of the existing Town Hall traffic lights.']
src_ids, src_mask = model.prepare_inference_batch(TEXT)
src_ids = src_ids.cpu().numpy() # Convert to numpy for use with onnx
src_mask = src_mask.cpu().numpy().astype(int)
# Compute encoder hidden state:
encoder_input = {'input_ids': src_ids, 'encoder_mask': src_mask}
encoder_hidden_state = encoder_session.run(['last_hidden_states'], encoder_input)[0]
# Simple greedy search:
MAX_GENERATION_DELTA = 5
BOS = model.encoder_tokenizer.bos_id
EOS = model.encoder_tokenizer.eos_id
PAD = model.encoder_tokenizer.pad_id
def decode(tgt: np.array, embeding: np.array, src_mask: np.array) -> np.array:
decoder_input = {
'input_ids': tgt,
'decoder_mask': (tgt != PAD).astype(np.int64),
'encoder_mask': embeding,
'encoder_embeddings': src_mask
}
decoder_hidden_state = decoder_session.run(['last_hidden_states'], decoder_input)[0]
log_probs = classifier_session.run(['log_probs'], {'hidden_states': decoder_hidden_state})[0]
return log_probs
max_out_len = encoder_hidden_state.shape[1] + MAX_GENERATION_DELTA
tgt=np.full(shape=encoder_hidden_state.shape[:-1], fill_value=0)
tgt[:, 0] = BOS
for i in range(1, max_out_len):
log_probs = decode(tgt[:, :i], encoder_hidden_state, src_mask)
next_tokens = log_probs[:, -1].argmax(axis=1) # NOTE: ONNX decoder returns multiple outputs which is different to pytorch version, so I get the last one (this could be where error is?)
tgt[:, i] = next_tokens
if ((tgt == EOS).sum(axis=1) > 0).all():
break
tgt_torch = torch.from_numpy(tgt).to('cuda:0')
onnx_translation = model.ids_to_postprocessed_text(tgt_torch, model.decoder_tokenizer, model.target_processor)
I have run above against newstest2014 set and got a BLUE score of 13 vs 29 if I use MTEncDecModel.translate
method. It seems like onnx model works well for shorter sentences, but for the longer ones it cuts off too soon. Could be because torch model uses decoder_mems, and onnx doesn’t?
Is there maybe a better way to set up onnx inference?
Environment overview (please complete the following information)
- Environment Location: Docker on Kubernetes
- Method of NeMo install: using helm chart using nvcr.io/nvidia/nemo:1.2.0 / 1.5.1
- Additionally install
pip install onnxruntime-gpu==1.10.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (1 by maintainers)
Oh I see, @Vlados09 the issue you’re facing is probably linked to a known ONNX export issue with PyTorch.
We recently pushed a workaround to handle it in #3422 Would it be possible for you to install NeMo from source using the main branch? This should most likely fix the error.
Also, we do not recommend using older NeMo versions for exporting NMT models as there were quite a few critical export issues we’ve fixed in recent versions which include the need to provide
decoder_mems
.What is the value of bsize, num_decoder_attention_layers, seqlen, embedding_dim for the first decode iteration?