question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python SDK Doesn't Close Resources, Causes WS_ERROR_UNDERLYING_IO_ERROR

See original GitHub issue

Describe the bug Python Implementation of the SDK does not close resources. Open file handlers and TCP connections will grow unbounded unless the parent process of the SDK is killed.

When resources are left open and growing, WebSocket operation failed. Internal error: 3. Error details: WS_ERROR_UNDERLYING_IO_ERROR comes out of the SDK at an alarming rate (sometimes more than 50% of requests will spit out the error)

To Reproduce

Steps to reproduce the behavior:

  1. Implement a simple gunicorn server that will call the speech to text SDK (like this example)
  2. Access your server, which should trigger speech to text
  3. Check lsof and netstat and you will see file handlers grow with every request

LSOF will show the following two files open by the gunicorn worker process indefinitely. Additional file handlers to the same two files will be added with every request, while previous ones will not close.

python3.7/site-packages/azure/cognitiveservices/speech/libMicrosoft.CognitiveServices.Speech.core.so
python3.7/site-packages/azure/cognitiveservices/speech/libMicrosoft.CognitiveServices.Speech.extension.kws.so

In addition to the underlying IO error from the sdk, this will eventually lead to a too many open files system error if the sdk is being used in a persistent API.

And netstat will show dangling TCP connections indefinitely

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.200.11.137:47238     52.184.80.197:443       ESTABLISHED 30945/python3.7
tcp        0      0 10.200.11.137:47240     52.184.80.197:443       ESTABLISHED 30946/python3.7
tcp        0      0 10.200.11.137:47246     52.184.80.197:443       ESTABLISHED 30945/python3.7
  1. Restart the server -> This will close all of the TCP connections and open file handlers.
  2. Repeated requests without restarting the server will lead to WebSocket operation failed. Internal error: 3. Error details: WS_ERROR_UNDERLYING_IO_ERROR coming out of the SDK very frequently.

Expected behavior One of the following:

  • Python’s SpeechRecognizer class should implement a close method to clean up resources.
  • stop_continuous_recognition should clean up resources.

Version of the Cognitive Services Speech SDK

azure-cognitiveservices-speech==1.6.0 from Pip

Platform, Operating System, and Programming Language

  • OS: Debian Streth, Amazon Linux 2, Ubuntu 19.04
  • Hardware - x64
model name      : Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping        : 4
cpu MHz         : 2500.000
cache size      : 33792 KB
  • Programming language: Python 3.7

Additional context

  • All of my requests to the sdk are using the same audio file
  • If I restart the parent worker, TCP connections and open file handlers are closed
    • Additionally, with this restart the underlying IO error stops happening
  • I have reproduced this issue on several different AWS EC2 instances, as well as on locally running Docker machines so I don’t think it’s a system-level network issue.
  • I have tried the following settings in sysctl to fix the issue from a system level to no avail
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

TCPDump for sdk connections that end with WS_ERROR_UNDERELYING_IO_ERROR ends with the following:

    52.184.80.197.https > ip-10-200-11-29.49110: Flags [F.], cksum 0xda84 (correct), seq 5031, ack 195333, win 1517, options [nop,nop,TS val 1505324782 ecr 3200938], length 0
05:07:45.758722 IP (tos 0x0, ttl 64, id 61588, offset 0, flags [DF], proto TCP (6), length 1438)
    ip-10-200-11-29.49110 > 52.184.80.197.https: Flags [.], cksum 0xa0f2 (incorrect -> 0xc34a), seq 197445:198831, ack 5032, win 343, options [nop,nop,TS val 3200950 ecr 1505324782], length 1386
05:07:45.806866 IP (tos 0x0, ttl 41, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    52.184.80.197.https > ip-10-200-11-29.49110: Flags [R], cksum 0x52c7 (correct), seq 189854933, win 0, length 0

It looks like Azure is sending a stop signal (Flags [F] and Flags [R]), but the SDK is continuing to send data anyway.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
chlandsicommented, Sep 30, 2019

@Checkroth: The SDK update has been released. I’m closing this for now, please reopen/create a new issue if there are problems.

2reactions
chlandsicommented, Aug 27, 2019

The bug in the current version can lead to packets being dropped or sent out of order in cases of high network load. This breaks decryption on the server, which then aborts the connection. Slowing down the input helps to reduce the network load; in absence of this error the SDK can accept input data at any rate. The SDK buffers it internally and throttles to a speed the service expects.

The fix will address the network problem and thus make the throttling of the stream unnecessary; the error should be gone after the update.

Read more comments on GitHub >

github_iconTop Results From Across the Web

'Invalid Python SDK' error right after creating a new project ...
The problem is caused by the non-ASCII characters in the path, and the solution is to remove them. As indicated by @TheLazyScripter this...
Read more >
Troubleshoot environment images - Azure Machine Learning
In this article, learn how to troubleshoot common problems you may encounter with environment image builds and learn about AzureML ...
Read more >
Solve “pkix path building failed”
The error 'pkix path building failed' is tough to troubleshoot. Use a reliable package repository like Packagecloud that works all the time.
Read more >
404 Not Found Error: What It Is and How to Fix It
The 404 Not Found Error is an HTTP response status code, which indicates that the requested resource could not be found.
Read more >
General Error Messages | InterSystems Error Reference
This document contains tables of numeric error codes and their corresponding error messages for InterSystems IRIS® data platform. Commonly, these error ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found