question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SFTPHook cannot download large files

See original GitHub issue

Apache Airflow version: 2.0.1 (Should apply to previous versions and later ones as well)

Environment:

  • Cloud provider or hardware configuration: AWS, EC2 c5.xlarge
  • OS (e.g. from /etc/os-release): ubuntu 18.04

What happened: In airflow.providers.sftp.hooks.sftp.SFTPHook, when we try to download a file greater than 18 MiB, the download keeps happening forever and never gets completed.

What you expected to happen: The download should have completed in seconds but did not. A file less than 18MiB gets downloaded in few seconds. Looks like this is an underlying issue in the paramiko library. Attaching a bunch of issues on paramiko’s git and stackoverflow -

  1. https://github.com/paramiko/paramiko/issues/926
  2. https://stackoverflow.com/questions/12486623/paramiko-fails-to-download-large-files-1gb
  3. https://stackoverflow.com/questions/3459071/paramiko-sftp-hangs-on-get

How to reproduce it:

  1. Create a large file size > 18MiB
  2. Dump it in an SFTP server
  3. Use airflow SFTPHook to download it
  4. You should be able to see the task run forever

Anything else we need to know: I after exploring found a solution to the problem and have fixed it in my project but if the community can dive deep it would be great. Link to the solution is - https://gist.github.com/vznncv/cb454c21d901438cc228916fbe6f070f This gist is by @vznncv and credits to him for coming up with a solution.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
ashbcommented, Jun 14, 2021

I’d probably caution against introducing twisted – with Python 3.6/3.7+ the built in asyncio can do most of twisted without the need for a large external dependency.

2reactions
potiukcommented, Jun 6, 2021

@potiuk that would be definitely a good place to start. How do you want to go about this? I would definitely want to contribute to this.

I do not think there is anything special needed. Just rewrite the Hook/Operator, starting with replacing the twisted library in setup.py deps instead of paramiko/sftp. Then we can relase a major release of SFTP provider with it. That’s pretty much it 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Paramiko Fails to download large files >1GB - Stack Overflow
I've run into problems downloading large files (>1 GB) via SFTP using pysftp. Underlying library is Paramiko. Googling about the problem lead me...
Read more >
I cannot download large files :: Support Forum - WinSCP
I want to download a large file (11,247 KB) and transfer failed: Server unexpectly closed network connection. Tryed on versions 4.3.3, ...
Read more >
Automate SFTP Operations using Apache Airflow - YouTube
Download and Upload Files from SFTP Servers from Java Applications using the JSch library. Productivity for Programmers•12K views.
Read more >
airflow.providers.sftp.hooks.sftp
SFTPHook (ssh_conn_id='sftp_default', ssh_hook=None, *args, **kwargs)[source]¶ ... It doesn't return unix.owner, unix.mode, perm, unix.group and unique.
Read more >
SFTP download large file action - Workato connectors
This action downloads a file from your SFTP server. This cannot be used to download entire folders. The file contents will be downloaded...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found