Airflow 2.1.0 with Schedulers HA Failing
See original GitHub issueDiscussed in https://github.com/apache/airflow/discussions/17126
<div type='discussions-op-text'>Originally posted by sorabhgit July 21, 2021 Hello Guys , I am also struggling with issue while setting up schedulers HA with Airflow 2.1.0 version .
I’ve installed airflow scheduler on 2 separate nodes with both pointing to same mysql8 but gets below error in one of the airflow scheduler logs :
Steps to reproduce :
- Install Airflow 2.1.0 on 2 nodes using Mysql 8.0.25 .
- use_row_level_locking = True ( in airflow.cfg of both the nodes )
- Start scheduler,webserver,celery worker on node1 and just scheduler on node2 .
- Execute any example DAG and one of the scheduler will exit/failed with below error .
[^[[34m2021-07-01 08:15:04,342^[[0m] {^[[34mscheduler_job.py:^[[0m1302} ERROR^[[0m - Exception when executing SchedulerJob._run_scheduler_loop^[[0m
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/connection_cext.py", line 337, in get_rows
else self._cmysql.fetch_row()
_mysql_connector.MySQLInterfaceError: Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/cursor_cext.py", line 277, in execute
self._handle_result(result)
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/cursor_cext.py", line 172, in _handle_result
self._handle_resultset()
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/cursor_cext.py", line 671, in _handle_resultset
self._rows = self._cnx.get_rows()[0]
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/connection_cext.py", line 368, in get_rows
sqlstate=exc.sqlstate)
mysql.connector.errors.DatabaseError: 3572 (HY000): Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1284, in _execute
num_queued_tis = self._do_scheduling(session)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1546, in _do_scheduling
num_queued_tis = self._critical_section_execute_task_instances(session=session)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1142, in _critical_section_execute_task_instances
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 900, in _executable_task_instances_to_queued
pools = models.Pool.slots_stats(lock_rows=True, session=session)
</div>Issue Analytics
- State:
- Created 2 years ago
- Comments:17 (7 by maintainers)
Top Results From Across the Web
Scheduler — Airflow Documentation - Apache Airflow
The Airflow scheduler monitors all tasks and DAGs, then triggers the task ... The HA scheduler is designed to take advantage of the...
Read more >Airflow Scheduler HA - Stack Overflow
1 Answer 1 ... ValueError: unsupported pickle protocol: 5 generally occurs when you have a different version of Python running on both machines....
Read more >Cloud Composer release notes | Google Cloud
Fully managed service for scheduling batch jobs. Sole-Tenant Nodes ... Cloud Composer. Workflow orchestration service built on Apache Airflow.
Read more >apache-airflow - PyPI
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers...
Read more >Airflow 2.0 Series - HA Scheduler - PART 6 - YouTube
Welcome in Airflow 2.0 series!My name is Marc Lamberti, Head of customer training at Astronomer. THIS IS IT! The most awaited feature is ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We support Scheduler HA (running more than one scheduler) https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html?highlight=scheduler ha#running-more-than-one-scheduler - our Scheduler runs in Active/Active mode (which means that both schedulers are parsing DAGs at the same time). This is supported in MySQL 8+ and should work (of course there might be some edge cases, but generally we tested it and it works).
This is of course very different than Database HA. This is something that is outside of the realm of Airflow and is done by your deployment. From the very beginning we had the assumption, and we have developed Airflow 2 with the assumption that the Database is running at most in Active/Passive mode. The comment from #14788 indicated that someone had similar problem when running DB in active/active mode behind (and there switching to talk directly to only one physical DB helped). So my assumption was that one of the reasons is that you have similar setup.
Also - we’ve seen similar problems with various proxies which provided kind’a poor’s man DB HA, where the proxy had several physical DB clusters behind. We heavily base our Scheduler’s HA on Database locking, and locking is hard problem to solve in Active/Active setup.
That leads to the suggestion - that this might be similar case for you. If it is not and you are 100% sure that you have single physical DB behind then the problem needs deeper investigation and will take quite some time to resolve, and possibly some iterations here to find out the reason (because we have not seen it in our tests).
So if you are 100% sure you do not have multiple DBs being accessed at the same time (even if single proxy is used) then my advice will be to switch to Postgres, as it might take quite a lot of time to find out the cause (we’ve seen it in the past - sometimes people used customized versions of the databases with some functionality disabled for example). Postgres is much more stable, and less configurable (MySQL for example can have multiple engines with different capabilities) and there might be many other reasons why MySQL (especially custom-configured one) creates problems.
Unfortunlately we have no capacity to investigate and help individual users here in the community and investigate those cases deeply, so unless you have time and capacity to try to investigate it and provide more information, I am afraid it might take quite some time to even reproduce this kind of problem you have.
Going Postgres is much more “certain” route, and if you are keen on timing, I’d heartily recommend going that route.
@ashb something that we need to discuss when you return - it seems (needs confirmation) that some people connect Airlfow HA schedulers to a DB in active/active mode and it causes the locking problem (MySQL in this case as usual).
I think we might want to either be more explicit in Airflow about that, or detect it and inform the user (better) or possibly implement support for Active/Active mode (the best but might not be possible/easy). Happy to have a discussion on it when you are back from holidays 😉