Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Triton Server with Kaldi Backend does not return final response to client.

See original GitHub issue

Description Triton Server with Kaldi Backend does not return final response(response with lattice), that IS returned by Kaldi backend to GRPC client or return it in very long time.

Triton Information Triton Server r21.05(v2.10.0) Kaldi Backend r21.08

The same issue present in extracted from Dcoker and in build from source versions of TrtitonServer + Kaldi Backend

To Reproduce

Create a batch with 300 utterances ~300 seconds long each.
Run 1st time - here will be OK in ~3:40 execution time
Run 2nd time - here will be OK ~3:40 execution time
Run 3rd time - here GRPC client will get partial responses and will not get final(with lattice) response in more then 30 minutes. In log you will see a lot of messages about WRITEREADY state for several corr_id values.

Configuration files: config.pntxt

name: "kaldi_online"
backend: "kaldi"
default_model_filename: "libkaldi-trtisbackend.so"
max_batch_size: 800
model_transaction_policy {
  decoupled: True
}
parameters: {
    key: "config_filename"
    value {
        string_value:"<repo_path>/models/kaldi_online/1/conf/config.conf"
    }
}
parameters: {
key: "ivector_filename"
value: {
string_value:""
}
}
parameters: {
key: "nnet3_rxfilename"
value: {
string_value: "<repo_path>/models/kaldi_online/1/final.mdl"
}
}
parameters: {
key: "fst_rxfilename"
value: {
string_value: "<repo_path>/models/kaldi_online/1/HCLG.fst"
}
}
parameters: {
key: "word_syms_rxfilename"
value: {
string_value:"<repo_path>/models/kaldi_online/1/words.txt"
}
}
parameters: {
    key: "lattice_postprocessor_rxfilename"
    value {
      string_value: ""
    }
}
parameters: {
    key: "use_tensor_cores"
    value {
      string_value: "1"
    }
}
parameters: {
    key: "main_q_capacity"
    value {
      string_value: "30000"
    }
}
parameters: {
    key: "aux_q_capacity"
    value {
      string_value: "400000"
    }
}
parameters: [
{
key: "acoustic_scale"
value: {
string_value:"1.0"
}
},
{
key: "frame_subsampling_factor"
value: {
string_value:"3"
}
},
{
key: "max_active"
value: {
string_value:"10000"
}
},
{
key: "lattice_beam"
value: {
string_value:"7"
}
},
{
key: "beam"
value: {
string_value:"10.0"
}
},
{
key: "num_worker_threads"
value: {
string_value:"40"
}
},
{
    key: "num_channels"
    value {
      string_value: "4000"
    }
},
{
key: "max_execution_batch_size"
value: {
string_value:"400"
}
}]
sequence_batching {
max_sequence_idle_microseconds:1000000000
  control_input [
    {
      name: "START"
      control [
        {
          kind: CONTROL_SEQUENCE_START
          int32_false_true: [ 0, 1 ]
        }
      ]
    },
    {
      name: "READY"
      control [
        {
          kind: CONTROL_SEQUENCE_READY
          int32_false_true: [ 0, 1 ]
        }
      ]
    },
    {
      name: "END"
      control [
        {
          kind: CONTROL_SEQUENCE_END
          int32_false_true: [ 0, 1 ]
        }
      ]
    },
    {
      name: "CORRID"
      control [
        {
          kind: CONTROL_SEQUENCE_CORRID
    data_type: TYPE_UINT64
        }
      ]
    }
  ]
oldest {
max_candidate_sequences:2200
preferred_batch_size:[400]
max_queue_delay_microseconds:1000
}
},

input [
  {
    name: "WAV_DATA"
    data_type: TYPE_FP32
    dims: [ 8160 ]
  },
  {
    name: "WAV_DATA_DIM"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
output [
  {
    name: "RAW_LATTICE"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "CTM"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }

]
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

config.conf

--print-args=true
--feature-type=mfcc
--mfcc-config=<path_to_repo>/models/kaldi_online/1/conf/mfcc.conf
--minimize=false

mfcc.conf

# config for high-resolution MFCC features, intended for neural network training.
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--print-args=true
--use-energy=false   # use average of log energy, not energy.
--sample-frequency=8000 #  Switchboard is sampled at 8kHz
--num-mel-bins=40     # similar to Google's setup.
--num-ceps=40     # there is no dimensionality reduction.
--low-freq=40    # low cutoff frequency for mel bins
--high-freq=-200 # high cutoff frequently, relative to Nyquist of 4000 (=3800)

Expected behavior Final response retrun in a much shorter time like in fisrt and second iterations(as long as Kaldi Backend return it to Triton Server).

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

konstantin-sancomcommented, Dec 17, 2021

I think having the model can help us reproduce the bug more easily. It would be great if you can share any model that produces this bug. Thank you.

Hi, @Tabrizian , we reproduced this issue on an open source model: https://alphacephei.com/vosk/models/vosk-model-ru-0.22.zip

Please find configs in attachement. triton_vosk.tar.gz

We got “freezing” on 50 utterances with length about 300 seconds each. To reproduce make several iterations on the same utterances batch.

0reactions

Tabriziancommented, Dec 20, 2021

Thanks for providing the model. We’ll look into this.

Top Results From Across the Web

V2 Inference Protocol - KServe Documentation Website

Predict Protocol - Version 2¶. This document proposes a predict/inference API independent of any specific ML/DL framework and model server.

Integrating NVIDIA Triton Inference Server with Kaldi ASR

With a custom backend, a model can implement any logic desired, while still benefiting from the GPU support, concurrent execution, dynamic ...

Achieve low-latency hosting for decision tree-based ML ...

This enables you to achieve high inference performance with no model server setup, which is often the most complex technical aspect of model ......

nvidia_inferenceserver - Go Packages

Client ) error; func (t *TritonClientService) DisconnectToTritonWithGRPC() error ... On cache hits, triton does not need to go to the model/backend //@@ for ......

WebXR Voice Assistant - DiVA Portal

study aimed to compare browser-implemented ASR methods to server- ... Web Speech API, it was not possible to measure response time for this...