question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MemoryErrors Began Recently, Breaking our Python Function App

See original GitHub issue

Investigative information

  • Timestamp: 2019/04/26 @ 3pm
  • Function App name: [PRIVATE] Contact if needed.
  • Function name(s) (as appropriate): [PRIVATE]
  • Core Tools version: 2.6.1048

Function App Overview

We built a system that processes audio files. Each input file is a compressed format (m4a and mp3 are the most common). The main processing function uses ffmpeg, which is included as a static binary, to decompress the audio file into raw samples for processing. While the system was able to handle unpacking 90mins of audio, we found that the processing time involved for complete unpacking was unreasonable, causing the entire function to take ~4mins to process. Running this process locally showed that the Python process was responsible for allocating a maximum of ~1.5-2GB of memory.

We were able to reduce this dramatically by streaming the decompression process. However, the streaming had some processing overhead such that “shorter” audio files actually took longer to process overall with the streaming approach. We determined that ~10mins in length would be a fine length to switch modes.

To make the most of each Function App instance, we make use of the Python multiprocessing module (via ProcessPoolExecutor). We use a single Process for processing audio data with ffmpeg (which itself is multi-threaded). We then use numCPUs - 1 Processes to handle downloading audio files to prepare for the processing phase. In this way, we can download multiple audio files to the function at a time while one audio file is being processed. Attempting to process multiple files simultaneously results in worse performance overall (we suspect that the multiple instances of ffmpeg have threads that fight each other for time on the CPU cores).

Things Broke On Us

This system worked with no issues as of March 24th, 2019: we could queue up as many audio files of whatever size we liked and the system would churn through them. Happy with the system, we moved on to another task. As our Function App is on a Linux Consumption Plan, we left everything “as-is”, expecting to return to it at a later date for integration into a larger system.

That later date was yesterday and we were surprised to find that something changed in the intervening month and broke our Function App. All of a sudden our Function App would stop responding to requests. Looking into the Monitor and Live Metrics Stream views, we noticed several things:

  • Monitor:
    1. Doesn’t show all logs anymore.
  • Live Metrics Stream:
    1. Only shows “1 Server Online”.
    2. Server ID in the Sample Telemetry stream is “@(none)”.
    3. The “Servers” section at the bottom shows no useful information (and only ever one server instance).

Previously we could rely on the Monitor section to show us logs of both successful and unsuccessful runs of the functions in our Function App. This is no longer the case.

Previously we could use the Live Metrics Stream to show when more than one server instance had spawned to handle the request load. Unless that capability is currently gone, this feature is broken.

What We’ve Discovered

Luckily for us, the Live Metrics Stream appears to work and provided us with a sampling of log messages that we could use to attempt to identify the source of the issue. By watching this stream and inspecting the Errors, we found the following:

  1. The first error to occur is always caused by a MemoryError Exception in Python. This comes from a call to np.ascontiguousarray which is used to convert the audio samples from 16bit integers to 32bit float values. For large audio files, the array size here can be hundreds of MiBs in size (if not over 1GiB).
  2. Every subsequent run of the function fails to start the ffmpeg process. It looks like the system is getting an OSError, indicating that it was “trying to execute a non-existent file.”
  3. Once the Function App was shut down (from a timeout) and restarted by adding a request, the instances would again be able to access ffmpeg and process happily (when provided audio files at “the right size”).
  4. If lots of audio file processing requests hit the Function App at once and the system breaks due to the MemoryError, then ~10mins from the break the timeout will occur and messages may begin to process again (usually succeeding without issue).

Certain audio file sizes can trigger the MemoryError to occur 100% of the time. In some circumstances, it seems that smaller audio files can trigger the MemoryError, though it is likely related to the memory state caused by simultaneously downloading files.

Questions

This leads us to the following questions:

  1. What changed with memory handling in Function Apps [possibly Python-specific] to cause this issue?
  2. How do we fix/work around this issue?

Repro steps

Reproduction is proprietary, but we have two ways to do it:

  1. Unpack a 10min long audio file into a function with 32bit samples (stereo). (Specifically: with ffmpeg via the audioread module.)
  2. Use our Function App to process a queue of multiple (~50) smaller (~3:30) audio files.

Expected behavior

The Function App continues to process audio files without interruption.

Actual behavior

The Function App fails due to a MemoryError and suddenly goes into a “Broken” state where the included ffmpeg static binary “goes missing”.

Known workarounds

No known workarounds at this time.

Related information

Triggers/Bindings Used:

  • HTTP - Used to add requests to the processing Queue.
  • Queue Storage - Used to trigger audio file processing.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:15 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
balag0commented, May 15, 2019

Update - The updated memory config is available in East Asia now. It will be deployed to rest of the regions tomorrow.

Currently consumption sku doesn’t support > 1.5 GB workloads. App service plan function apps should though…

1reaction
j08luecommented, May 13, 2019

We expect to have an additional ~600MB available to the container.

Please notify us, @balag0, when the memory allowance has been increased to 1.5 GB and Azure Functions becomes usable for us again. Or why not make it 3 GB, like λ has? 😉

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Python errors in Azure Functions - Microsoft Learn
This section helps you troubleshoot module-related errors in your Python function app. These errors typically result in the following Azure ...
Read more >
How to avoid Memory errors with Pandas
As soon as my data was ready to be processed, I started to experience some issues because some Pandas functions needed way more...
Read more >
Changelog — Python 3.11.1 documentation
gh-98178: On macOS, fix a crash in syslog.syslog() in multi-threaded applications. On macOS, the libc syslog() function is not thread-safe, ...
Read more >
How to Handle the MemoryError in Python - Rollbar
An Example of MemoryError. To have a look at this error in action, let's start with a particularly greedy piece of code. In...
Read more >
Exception and Error Handling in Python - DataCamp
Struggling with error types? Learn how to catch and handle exceptions in Python with our step-by-step tutorial. Raise exceptions in Python and catch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found