MemoryErrors Began Recently, Breaking our Python Function App
See original GitHub issueInvestigative information
- Timestamp: 2019/04/26 @ 3pm
- Function App name: [PRIVATE] Contact if needed.
- Function name(s) (as appropriate): [PRIVATE]
- Core Tools version: 2.6.1048
Function App Overview
We built a system that processes audio files. Each input file is a compressed format (m4a and mp3 are the most common). The main processing function uses ffmpeg
, which is included as a static binary, to decompress the audio file into raw samples for processing. While the system was able to handle unpacking 90mins of audio, we found that the processing time involved for complete unpacking was unreasonable, causing the entire function to take ~4mins to process. Running this process locally showed that the Python process was responsible for allocating a maximum of ~1.5-2GB of memory.
We were able to reduce this dramatically by streaming the decompression process. However, the streaming had some processing overhead such that “shorter” audio files actually took longer to process overall with the streaming approach. We determined that ~10mins in length would be a fine length to switch modes.
To make the most of each Function App instance, we make use of the Python multiprocessing module (via ProcessPoolExecutor
). We use a single Process for processing audio data with ffmpeg
(which itself is multi-threaded). We then use numCPUs - 1
Processes to handle downloading audio files to prepare for the processing phase. In this way, we can download multiple audio files to the function at a time while one audio file is being processed. Attempting to process multiple files simultaneously results in worse performance overall (we suspect that the multiple instances of ffmpeg
have threads that fight each other for time on the CPU cores).
Things Broke On Us
This system worked with no issues as of March 24th, 2019: we could queue up as many audio files of whatever size we liked and the system would churn through them. Happy with the system, we moved on to another task. As our Function App is on a Linux Consumption Plan, we left everything “as-is”, expecting to return to it at a later date for integration into a larger system.
That later date was yesterday and we were surprised to find that something changed in the intervening month and broke our Function App. All of a sudden our Function App would stop responding to requests. Looking into the Monitor and Live Metrics Stream views, we noticed several things:
- Monitor:
- Doesn’t show all logs anymore.
- Live Metrics Stream:
- Only shows “1 Server Online”.
- Server ID in the Sample Telemetry stream is “@(none)”.
- The “Servers” section at the bottom shows no useful information (and only ever one server instance).
Previously we could rely on the Monitor section to show us logs of both successful and unsuccessful runs of the functions in our Function App. This is no longer the case.
Previously we could use the Live Metrics Stream to show when more than one server instance had spawned to handle the request load. Unless that capability is currently gone, this feature is broken.
What We’ve Discovered
Luckily for us, the Live Metrics Stream appears to work and provided us with a sampling of log messages that we could use to attempt to identify the source of the issue. By watching this stream and inspecting the Errors, we found the following:
- The first error to occur is always caused by a MemoryError Exception in Python. This comes from a call to
np.ascontiguousarray
which is used to convert the audio samples from 16bit integers to 32bit float values. For large audio files, the array size here can be hundreds of MiBs in size (if not over 1GiB). - Every subsequent run of the function fails to start the
ffmpeg
process. It looks like the system is getting anOSError
, indicating that it was “trying to execute a non-existent file.” - Once the Function App was shut down (from a timeout) and restarted by adding a request, the instances would again be able to access
ffmpeg
and process happily (when provided audio files at “the right size”). - If lots of audio file processing requests hit the Function App at once and the system breaks due to the
MemoryError
, then ~10mins from the break the timeout will occur and messages may begin to process again (usually succeeding without issue).
Certain audio file sizes can trigger the MemoryError
to occur 100% of the time. In some circumstances, it seems that smaller audio files can trigger the MemoryError
, though it is likely related to the memory state caused by simultaneously downloading files.
Questions
This leads us to the following questions:
- What changed with memory handling in Function Apps [possibly Python-specific] to cause this issue?
- How do we fix/work around this issue?
Repro steps
Reproduction is proprietary, but we have two ways to do it:
- Unpack a 10min long audio file into a function with 32bit samples (stereo). (Specifically: with
ffmpeg
via theaudioread
module.) - Use our Function App to process a queue of multiple (~50) smaller (~3:30) audio files.
Expected behavior
The Function App continues to process audio files without interruption.
Actual behavior
The Function App fails due to a MemoryError
and suddenly goes into a “Broken” state where the included ffmpeg
static binary “goes missing”.
Known workarounds
No known workarounds at this time.
Related information
Triggers/Bindings Used:
- HTTP - Used to add requests to the processing Queue.
- Queue Storage - Used to trigger audio file processing.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:15 (1 by maintainers)
Top GitHub Comments
Update - The updated memory config is available in East Asia now. It will be deployed to rest of the regions tomorrow.
Currently consumption sku doesn’t support > 1.5 GB workloads. App service plan function apps should though…
Please notify us, @balag0, when the memory allowance has been increased to 1.5 GB and Azure Functions becomes usable for us again. Or why not make it 3 GB, like λ has? 😉