question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Log files are still being cached causing ever-growing memory usage when scheduler is running

See original GitHub issue

Apache Airflow version

2.4.1

What happened

My Airflow scheduler memory usage started to grow after I turned on the dag_processor_manager log by doing

export CONFIG_PROCESSOR_MANAGER_LOGGER=True

see the red arrow below

2022-10-11_12-06 (1)

By looking closely at the memory usage as mentioned in https://github.com/apache/airflow/issues/16737#issuecomment-917677177, I discovered that it was the cache memory that’s keep growing:

2022-10-12_14-42 (1)

Then I turned off the dag_processor_manager log, memory usage returned to normal (not growing anymore, steady at ~400 MB)

This issue is similar to #14924 and #16737. This time the culprit is the rotating logs under ~/logs/dag_processor_manager/dag_processor_manager.log*.

What you think should happen instead

Cache memory shouldn’t keep growing like this

How to reproduce

Turn on the dag_processor_manager log by doing

export CONFIG_PROCESSOR_MANAGER_LOGGER=True

in the entrypoint.sh and monitor the scheduler memory usage

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

k8s

Anything else

I’m not sure why the previous fix https://github.com/apache/airflow/pull/18054 has stopped working 🤔

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:25 (25 by maintainers)

github_iconTop GitHub Comments

4reactions
zachliucommented, Oct 29, 2022

worse than a red herring, this is a mirage 😆 can’t change the kernel’s behavior, i’ll just change meself:

LOGGING_CONFIG["handlers"]["processor_manager"].update(
    {
        'maxBytes': 10485760,  # 10M
        "backupCount": 3,
    }
)

this makes the cache memory usage cap at 40~50Mb

1reaction
Taragoliscommented, Oct 27, 2022

BTW. I’ve heard VERY bad things about EFS when EFS is used to share DAGs. It has profound impact on stability and performance of Airlfow if you have big number of DAGs unless you pay big bucks for IOPS. I’ve heard that from many people. This is the moment when I usually STRONGLY recommend GitSync instead: https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca

It’s always it depends on configuration and monitoring. I personally have this issue might be in Airflow 2.1.x and I do not know is it actually related to Airflow itself or some other stuff. Work with EFS definitely take more effort rather than GitSync.

Just for someone who might found this thread in the future with EFS performance degradation might help:

Disable save python bytecodes inside of NFS (AWS EFS) mount

  • Mount as Read-Only
  • Disable Python bytecode by set PYTHONDONTWRITEBYTECODE=x
  • Or set location for bytecodes by set PYTHONPYCACHEPREFIX for example to /tmp/pycaches

Throughput in mode Bursting in first looks like miracle but when all Bursting Capacity go to zero it could turn into your life into the hell. Each newly created EFS share has about 2.1 TB BurstingCreditBalance.

What could be done here:

  • Switch to Provisional Throughput mode permanently which might cost a lot, something like 6 USD per 1 MiB/sec without VAT, so 100 MiB/Sec would cost more than 600 USD per month.
  • Switch to Provisional Throughput mode only when BurstingCreditBalance less than some amount, like 0.5 TB, and switch back when BurstingCreditBalance exceed limit 2.1 TB. Unfortunately there is no autoscaling so it would be manual or combination of CloudWatch Alerting + AWS Lambda.

image

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Cache and Memory Manager Performance Issues
Before Windows Server 2012, two primary potential issues caused system file cache to grow until available memory was almost depleted under ...
Read more >
Reasoning About Memory Use - RabbitMQ
Erlang memory breakdown reports only memory is currently being used, and not the memory that has been allocated for later use or reserved...
Read more >
Core Command-line Options - Valgrind
This chapter describes the Valgrind core services, command-line options and behaviours. That means it is relevant regardless of what particular tool you are ......
Read more >
TIBCO EBX® Documentation - Administration overview
EBX® maintains an object cache in memory. ... This leads to an ever-growing database containing obsolete history and can thus lead to poor...
Read more >
Big SQL Scheduler Intro and Memory Improvements in 4.2 - IBM
A) If your workload is mostly on top of HBase tables, the Scheduler cache memory consumption is minimal. This is because HBase manages...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found