[BUG] Triton Server with Kaldi Backend does not return final response to client.
See original GitHub issueDescription
Triton Server with Kaldi Backend
does not return final response
(response with lattice), that IS returned by Kaldi backend
to GRPC client or return it in very long time.
Triton Information
Triton Server r21.05
(v2.10.0
)
Kaldi Backend r21.08
The same issue present in extracted from Dcoker
and in build from source
versions of TrtitonServer + Kaldi Backend
To Reproduce
- Create a
batch
with 300 utterances ~300 seconds long each. - Run 1st time - here will be OK in ~3:40 execution time
- Run 2nd time - here will be OK ~3:40 execution time
- Run 3rd time - here GRPC client will get
partial
responses and will not getfinal
(with lattice) response in more then 30 minutes. In log you will see a lot of messages aboutWRITEREADY
state for severalcorr_id
values.
Configuration files: config.pntxt
name: "kaldi_online"
backend: "kaldi"
default_model_filename: "libkaldi-trtisbackend.so"
max_batch_size: 800
model_transaction_policy {
decoupled: True
}
parameters: {
key: "config_filename"
value {
string_value:"<repo_path>/models/kaldi_online/1/conf/config.conf"
}
}
parameters: {
key: "ivector_filename"
value: {
string_value:""
}
}
parameters: {
key: "nnet3_rxfilename"
value: {
string_value: "<repo_path>/models/kaldi_online/1/final.mdl"
}
}
parameters: {
key: "fst_rxfilename"
value: {
string_value: "<repo_path>/models/kaldi_online/1/HCLG.fst"
}
}
parameters: {
key: "word_syms_rxfilename"
value: {
string_value:"<repo_path>/models/kaldi_online/1/words.txt"
}
}
parameters: {
key: "lattice_postprocessor_rxfilename"
value {
string_value: ""
}
}
parameters: {
key: "use_tensor_cores"
value {
string_value: "1"
}
}
parameters: {
key: "main_q_capacity"
value {
string_value: "30000"
}
}
parameters: {
key: "aux_q_capacity"
value {
string_value: "400000"
}
}
parameters: [
{
key: "acoustic_scale"
value: {
string_value:"1.0"
}
},
{
key: "frame_subsampling_factor"
value: {
string_value:"3"
}
},
{
key: "max_active"
value: {
string_value:"10000"
}
},
{
key: "lattice_beam"
value: {
string_value:"7"
}
},
{
key: "beam"
value: {
string_value:"10.0"
}
},
{
key: "num_worker_threads"
value: {
string_value:"40"
}
},
{
key: "num_channels"
value {
string_value: "4000"
}
},
{
key: "max_execution_batch_size"
value: {
string_value:"400"
}
}]
sequence_batching {
max_sequence_idle_microseconds:1000000000
control_input [
{
name: "START"
control [
{
kind: CONTROL_SEQUENCE_START
int32_false_true: [ 0, 1 ]
}
]
},
{
name: "READY"
control [
{
kind: CONTROL_SEQUENCE_READY
int32_false_true: [ 0, 1 ]
}
]
},
{
name: "END"
control [
{
kind: CONTROL_SEQUENCE_END
int32_false_true: [ 0, 1 ]
}
]
},
{
name: "CORRID"
control [
{
kind: CONTROL_SEQUENCE_CORRID
data_type: TYPE_UINT64
}
]
}
]
oldest {
max_candidate_sequences:2200
preferred_batch_size:[400]
max_queue_delay_microseconds:1000
}
},
input [
{
name: "WAV_DATA"
data_type: TYPE_FP32
dims: [ 8160 ]
},
{
name: "WAV_DATA_DIM"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
output [
{
name: "RAW_LATTICE"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "TEXT"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "CTM"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
config.conf
--print-args=true
--feature-type=mfcc
--mfcc-config=<path_to_repo>/models/kaldi_online/1/conf/mfcc.conf
--minimize=false
mfcc.conf
# config for high-resolution MFCC features, intended for neural network training.
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--print-args=true
--use-energy=false # use average of log energy, not energy.
--sample-frequency=8000 # Switchboard is sampled at 8kHz
--num-mel-bins=40 # similar to Google's setup.
--num-ceps=40 # there is no dimensionality reduction.
--low-freq=40 # low cutoff frequency for mel bins
--high-freq=-200 # high cutoff frequently, relative to Nyquist of 4000 (=3800)
Expected behavior Final response retrun in a much shorter time like in fisrt and second iterations(as long as Kaldi Backend return it to Triton Server).
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
V2 Inference Protocol - KServe Documentation Website
Predict Protocol - Version 2¶. This document proposes a predict/inference API independent of any specific ML/DL framework and model server.
Read more >Integrating NVIDIA Triton Inference Server with Kaldi ASR
With a custom backend, a model can implement any logic desired, while still benefiting from the GPU support, concurrent execution, dynamic ...
Read more >Achieve low-latency hosting for decision tree-based ML ...
This enables you to achieve high inference performance with no model server setup, which is often the most complex technical aspect of model ......
Read more >nvidia_inferenceserver - Go Packages
Client ) error; func (t *TritonClientService) DisconnectToTritonWithGRPC() error ... On cache hits, triton does not need to go to the model/backend //@@ for ......
Read more >WebXR Voice Assistant - DiVA Portal
study aimed to compare browser-implemented ASR methods to server- ... Web Speech API, it was not possible to measure response time for this...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, @Tabrizian , we reproduced this issue on an open source model: https://alphacephei.com/vosk/models/vosk-model-ru-0.22.zip
Please find configs in attachement. triton_vosk.tar.gz
We got “freezing” on 50 utterances with length about 300 seconds each. To reproduce make several iterations on the same utterances batch.
Thanks for providing the model. We’ll look into this.