SFTPHook cannot download large files
See original GitHub issueApache Airflow version: 2.0.1 (Should apply to previous versions and later ones as well)
Environment:
- Cloud provider or hardware configuration: AWS, EC2 c5.xlarge
- OS (e.g. from /etc/os-release): ubuntu 18.04
What happened:
In airflow.providers.sftp.hooks.sftp.SFTPHook
, when we try to download a file greater than 18 MiB, the download keeps happening forever and never gets completed.
What you expected to happen:
The download should have completed in seconds but did not. A file less than 18MiB gets downloaded in few seconds.
Looks like this is an underlying issue in the paramiko
library.
Attaching a bunch of issues on paramiko’s git and stackoverflow -
- https://github.com/paramiko/paramiko/issues/926
- https://stackoverflow.com/questions/12486623/paramiko-fails-to-download-large-files-1gb
- https://stackoverflow.com/questions/3459071/paramiko-sftp-hangs-on-get
How to reproduce it:
- Create a large file size > 18MiB
- Dump it in an SFTP server
- Use airflow SFTPHook to download it
- You should be able to see the task run forever
Anything else we need to know: I after exploring found a solution to the problem and have fixed it in my project but if the community can dive deep it would be great. Link to the solution is - https://gist.github.com/vznncv/cb454c21d901438cc228916fbe6f070f This gist is by @vznncv and credits to him for coming up with a solution.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:13 (13 by maintainers)
Top GitHub Comments
I’d probably caution against introducing twisted – with Python 3.6/3.7+ the built in asyncio can do most of twisted without the need for a large external dependency.
I do not think there is anything special needed. Just rewrite the Hook/Operator, starting with replacing the twisted library in setup.py deps instead of paramiko/sftp. Then we can relase a major release of SFTP provider with it. That’s pretty much it 😃