question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BERT model is returning NaN logits values in output

See original GitHub issue

Description I’m able to deploy fine-tuned “bert-base-uncased” model on Triton inference server using TensorRT, while inference I am getting a NaN logits values.

Converted the onnx model to tensorrt using command below. trtexec --onnx=model.onnx --saveEngine=model.plan --minShapes=input_ids:1x1,attention_mask:1x1,token_type_ids:1x1 --optShapes=input_ids:16x128,attention_mask:16x128,token_type_ids:16x128 --maxShapes=input_ids:128x128,attention_mask:128x128,token_type_ids:128x128 --fp16 --verbose --workspace=14000 | tee conversion_bs16_dy.txt

Output logs logits: [[[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan] ............ [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]]]

Triton Information Triton Server Version 2.22.0 NVIDIA Release 22.05

Using Triton container: ‘007439368137.dkr.ecr.us-east-2.amazonaws.com/sagemaker-tritonserver:22.05-py3’

To Reproduce

  1. Deploy tensorrt model on triton inference server.
  2. Send a inference request.

text = "Published by HT Digital Content Services with permission from Construction Digital." batch_size = 1 payload = { "inputs": [ { "name": "TEXT", "shape": (batch_size,), "datatype": "BYTES", "data": [text], } ] }

Preprocessed the input text, got the input_ids and attention_masks from tokenizer then send the below input to the model.

Model input:

{'input_ids': array([[ 101, 12414, 10151, 87651, 10764, 18491, 12238, 10171, 48822, 10195, 13154, 10764, 119, 102]], dtype=int32), 'token_type_ids': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)}

Then you will see that model produce the logits values as NaN.

Please find the all deployment files on G-drive - https://drive.google.com/file/d/1uteEOgnSLwtfTonJtgukKjnDwycezFg3/view?usp=sharing

Expected behavior I expect valid logits values from the BERT model instead of NaN.

Please help me on this issue. Thanks

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rmccorm4commented, Sep 9, 2022

It will take some time for me to run the model on Polygraphy because I didn’t use that earlier.

@Vinayaks117 Hopefully something like this should get you started (assuming you’re in the same directory as your ONNX model, otherwise change mount paths)

# Start latest TRT container that comes with polygraphy installed
docker run -ti --gpus all -v ${PWD}:/mnt -w /mnt nvcr.io/nvidia/tensorrt:22.08-py3

# Let polygraphy install dependencies as needed (onnxruntime, etc)
export POLYGRAPHY_AUTOINSTALL_DEPS=1

# Run model with both onnxruntime, and tensorrt, and then compare the outputs
polygraphy run --validate --onnxrt --trt model.onnx

# For more details, config options, dynamic shape settings, etc.
polygraphy -h

# For example to validate that your TRT model is returning NaNs or not, you might try
polygraphy run --trt <trt plan or onnx file> --validate
0reactions
dyastremskycommented, Sep 30, 2022

Closing due to inactivity. Please let us know to reopen the issue if you’d like to follow up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

BERT HuggingFace gives NaN Loss - Stack Overflow
EDIT: The model computes losses on the first epoch but it starts returning NaNs at the second epoch. What could be causing that...
Read more >
T5 fp16 issue is fixed - Transformers - Hugging Face Forums
Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...
Read more >
HuggingFace Transformers is giving loss: nan - accuracy
I am a HuggingFace Newbie and I am fine-tuning a BERT model ( distilbert-base-cased ) using the Transformers library but the training loss...
Read more >
BERT for sequence classification | Kaggle
The "logits" are the output # values prior to applying an activation function like the softmax. logits = outputs[0] # Move logits and...
Read more >
BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
In this tutorial I'll show you how to use BERT with the huggingface ... The "logits" are the output # values prior to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found