LocalTaskJob Heartbeat Spamming (yields 800MB log) and a DAG that runs for nearly an hour.
See original GitHub issueApache Airflow version: 1.10.10
Environment: Centos Linux 7
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): Centos Linux 7
- Kernel (e.g.
uname -a
): cannot disclose - Install tools: n/a
- Others: n/a
What happened: I am re-posting #9735 (original did not use the issue template). I have recently seen the same problem, resulting in an 800MB log file for a single task run.
"ERROR - LocalTaskJob heartbeat got an exception"
spammed about > 30,000 times, yielding a massive log file.
According to #5589 and #6284 this issue has been fixed. Both fixes were included 1.10.6, though the problem still exists.
What you expected to happen:
I would expect that the DAG would fail in a timely manner due to a lack of worker heartbeats.
How to reproduce it:
This appears to occur randomly, presumably while the database is performing poorly. I suspect this could be reproduced by overloading the DB while a DAG is running.
How often does this problem occur?
This problem occurs when the database becomes unreachable (rarely),
The logs pasted below are from the linked issue above, not my own. In my logs, the underlying database became unavailable for some time. In the logs below, it appears the DB has too many open connections. I am using MySQL where the referenced logs are using Postgres, indicating that this issue is independent of the underlying database.
Traceback (most recent call last): File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py”, line 2285, in _wrap_pool_connect return fn() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 363, in connect return _ConnectionFairy._checkout(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 773, in _checkout fairy = _ConnectionRecord.checkout(pool) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 492, in checkout rec = pool._do_get() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py”, line 238, in _do_get return self._create_connection() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 308, in _create_connection return _ConnectionRecord(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 437, in init self.__connect(first_connect_check=True) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 657, in _connect pool.logger.debug(“Error on connect(): %s”, e) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py”, line 69, in exit exc_value, with_traceback=exc_tb, File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py”, line 178, in raise raise exception File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 652, in __connect connection = pool._invoke_creator(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py”, line 114, in connect return dialect.connect(*cargs, **cparams) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py”, line 488, in connect return self.dbapi.connect(*cargs, **cparams) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/psycopg2/init.py”, line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: ERROR: no more connections allowed (max_client_conn)
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File “/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/airflow/jobs/base_job.py”, line 172, in heartbeat session.merge(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py”, line 2128, in merge _resolve_conflict_map=_resolve_conflict_map, File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py”, line 2201, in merge merged = self.query(mapper.class).get(key[1]) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 1004, in get return self._get_impl(ident, loading.load_on_pk_identity) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 1119, in _get_impl return db_load_fn(self, primary_key_identity) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py”, line 284, in load_on_pk_identity return q.one() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 3358, in one ret = self.one_or_none() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 3327, in one_or_none ret = list(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 3403, in iter return self._execute_and_instances(context) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 3425, in _execute_and_instances querycontext, self._connection_from_session, close_with_result=True File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 3440, in _get_bind_args mapper=self._bind_mapper(), clause=querycontext.statement, **kw File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py”, line 3418, in _connection_from_session conn = self.session.connection(**kw) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py”, line 1133, in connection execution_options=execution_options, File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py”, line 1139, in _connection_for_bind engine, execution_options File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py”, line 432, in _connection_for_bind conn = bind._contextual_connect() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py”, line 2251, in _contextual_connect self._wrap_pool_connect(self.pool.connect, None), File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py”, line 2289, in wrap_pool_connect e, dialect, self File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py”, line 1555, in handle_dbapi_exception_noconnection sqlalchemy_exception, with_traceback=exc_info[2], from=e File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py”, line 178, in raise raise exception File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py”, line 2285, in _wrap_pool_connect return fn() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 363, in connect return _ConnectionFairy._checkout(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 773, in _checkout fairy = _ConnectionRecord.checkout(pool) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 492, in checkout rec = pool._do_get() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py”, line 238, in _do_get return self._create_connection() File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 308, in _create_connection return _ConnectionRecord(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 437, in init self.__connect(first_connect_check=True) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 657, in _connect pool.logger.debug(“Error on connect(): %s”, e) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py”, line 69, in exit exc_value, with_traceback=exc_tb, File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py”, line 178, in raise raise exception File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py”, line 652, in __connect connection = pool._invoke_creator(self) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py”, line 114, in connect return dialect.connect(*cargs, **cparams) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py”, line 488, in connect return self.dbapi.connect(*cargs, **cparams) File “/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/psycopg2/init.py”, line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) ERROR: no more connections allowed (max_client_conn)
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)
Hi, what is the solution to this issue?
ERROR - LocalTaskJob heartbeat got an exception
@xiangqiao123 - please open a new issue or discussion (if you cannot provide a reproducible case). it makes exactly 0 sense to comment on issue closed more than year ago, for completely different version based on a very vague resemblance of some part of an error you get.
It makes completely no sense and brings you no closer to anyone even wanting to help you.