Possibility to speed up inference of onnx models with transformers.pipeline
See original GitHub issueProblem Description
Based on the model named “Helsinki-NLP/opus-mt-es-en”, I investigated the time-consuming composition of using the onnx model and the pytorch model for inference, and found that the main time difference lies in a small step in beam search: scores = scores.masked_fill(banned_mask, -float(“inf”)) When I use the pytorch model for inference, this line of code only consumes 0.10ms per execution, while using the onnx model consumes close to 10ms。 I consider that each execution needs to load the corresponding pytorch environment, which consumes some initialization time. If some measures can be taken to reduce the time here, the efficiency of using the onnx model for inference will be significantly improved.
The time-consuming of the following codes is also significantly different in the two reasoning methods:
# https://github.com/huggingface/transformers/blob/v4.24.0/src/transformers/generation_logits_process.py
static_bad_words_mask = torch.zeros(scores.shape[1])
static_bad_words_mask[self.bad_words_id_length_1] = 1
return static_bad_words_mask.unsqueeze(0).to(scores.device).bool()
Model loading and inference
- pytorch model: model = AutoModelForSeq2SeqLM.from_pretrained("./bin_model) result = model.generate(**model_inputs)
- onnx model: model = ORTModelForSeq2SeqLM.from_pretrained(“./onnx_model”, from_transformers=False) onnx_translation = pipeline(“translation_es_to_en”, model=model, tokenizer=tokenizer) result = onnx_translation(inputs)
Machine configuration
36 cores CPU,no GPU
Issue Analytics
- State:
- Created 10 months ago
- Reactions:1
- Comments:10 (7 by maintainers)
Top GitHub Comments
Sorry, it’s not on an AWS EC2 instance, but on my own machine, so I can’t provide more information.
Awesome thanks! Is it on an AWS EC2 instance? If so could you give me the name so that I can reproduce there?