Fail to download task log if there are Chinese characters in dag_id
See original GitHub issueApache Airflow version
main (development)
What happened
If there are Chinese characters in dag_id of a dag, downloading logs of tasks which belong to the dag leads to ‘Internal Server Error Page’
What you expected to happen
Here’s the webserver log related to the bug which standalone mode produced:
webserver | [2022-01-26 18:29:15 +0800] [48511] [ERROR] Error handling request /get_logs_with_metadata?dag_id=%E6%B5%8B%E8%AF%95&task_id=sleep&execution_date=2022-01-25T09%3A23%3A42.145023%2B00%3A00&metadata=null&format=file&try_number=1 webserver | Traceback (most recent call last): webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/workers/sync.py”, line 136, in handle webserver | self.handle_request(listener, req, client, addr) webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/workers/sync.py”, line 185, in handle_request webserver | resp.write(item) webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/http/wsgi.py”, line 327, in write webserver | self.send_headers() webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/http/wsgi.py”, line 322, in send_headers webserver | util.write(self.sock, util.to_bytestring(header_str, “latin-1”)) webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/util.py”, line 565, in to_bytestring webserver | return value.encode(encoding) webserver | UnicodeEncodeError: ‘latin-1’ codec can’t encode characters in position 161-162: ordinal not in range(256) webserver | 127.0.0.1 - - [26/Jan/2022:18:29:15 +0800] “GET /get_logs_with_metadata?dag_id=%E6%B5%8B%E8%AF%95&task_id=sleep&execution_date=2022-01-25T09%3A23%3A42.145023%2B00%3A00&metadata=null&format=file&try_number=1 HTTP/1.1” 500 0 “-” “-” webserver | [2022-01-26 18:29:21 +0800] [48508] [ERROR] Error handling request /get_logs_with_metadata?dag_id=%E6%B5%8B%E8%AF%95&task_id=sleep&execution_date=2022-01-25T09%3A23%3A42.145023%2B00%3A00&metadata=null&format=file&try_number=1 webserver | Traceback (most recent call last): webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/workers/sync.py”, line 136, in handle webserver | self.handle_request(listener, req, client, addr) webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/workers/sync.py”, line 185, in handle_request webserver | resp.write(item) webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/http/wsgi.py”, line 327, in write webserver | self.send_headers() webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/http/wsgi.py”, line 322, in send_headers webserver | util.write(self.sock, util.to_bytestring(header_str, “latin-1”)) webserver | File “/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/util.py”, line 565, in to_bytestring webserver | return value.encode(encoding) webserver | UnicodeEncodeError: ‘latin-1’ codec can’t encode characters in position 161-162: ordinal not in range(256) webserver | 127.0.0.1 - - [26/Jan/2022:18:29:21 +0800] “GET /get_logs_with_metadata?dag_id=%E6%B5%8B%E8%AF%95&task_id=sleep&execution_date=2022-01-25T09%3A23%3A42.145023%2B00%3A00&metadata=null&format=file&try_number=1 HTTP/1.1” 500 0 “-” “-” triggerer | [2022-01-26 18:29:43,927] {triggerer_job.py:250} INFO - 0 triggers currently running
How to reproduce
- I’ve tested in airflow v2.2.0 with celery executor, airflow dev version with standalone mode and airflow v1.10.12 with celery executor. The bug existed in all three version I’ve tested.
- To reproduce, simply create a dag with some Chinese characters like ‘测试’ as dag_id. After triggering the dag, try to download a log file of any task of the dag through tree view page or graph view page and you will get redirected to some ‘Internal Server Error Page’.
Operating System
macOS Catalina, CentOS 7
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
No response
Anything else
- Following the error log produced by websever, I checked
/opt/anaconda3/envs/airflow_dev/lib/python3.8/site-packages/gunicorn/http/wsgi.py
line 322 and sawutil.write(self.sock, util.to_bytestring(header_str, "latin-1"))
- After changing
latin-1
toutf-8
, the bug got fixed. The whole function is shown as following, the commented line is added by me. -
def send_headers(self): if self.headers_sent: return tosend = self.default_headers() tosend.extend(["%s: %s\r\n" % (k, v) for k, v in self.headers]) header_str = "%s\r\n" % "".join(tosend) util.write(self.sock, util.to_bytestring(header_str, "latin-1")) # util.write(self.sock, util.to_bytestring(header_str, "utf-8")) self.headers_sent = True```
- However,
gunicorn/http/wsgi.py
is not part of airflow code, I haven’t figured out how to fix this without changing this script. May I ask if there is a better way to fix it?
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
I guess this might change in the future there is a good discussion in https://github.com/apache/airflow/issues/18010#issuecomment-912820115 Probably the idea of separating the id from the display name in the UI will happen in future releases.
It’s not supported currently - but in the future - if you want to make all the changes and PR to make it possible - I think that would be awesome @ramwin