question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Download and save a large file as an artifact

See original GitHub issue

One of the steps of my workflow is simply downloading a large data file:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.content

Now, this fails with a MemoryError because req.content tries to read the whole file into memory. However, even though requests has a streaming API, via iter_content(), I don’t think it’s possible to use this because metaflow doesn’t expose a file object to write into. If I try to store a generator object as an artifact it doesn’t work either:

def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.iter_content(chunk_size=1024)
TypeError: can't pickle generator objects

Finally, I can’t use req.raw:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.raw
TypeError: cannot serialize '_io.BufferedReader' object

If you somehow exposed the file object we were writing to, I could stream each chunk of the file separately and pickle them:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    pickle.dump(chunk, fp)

Or ideally not use pickle at all:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    fp.write(chunk)

Is exposing the file object, or allowing non-pickle files currently possible? If not, is it on the radar?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:5

github_iconTop GitHub Comments

1reaction
multimericcommented, Feb 21, 2020

Discussion on gitter from @tuulos:

@TMiguelT internally at Netflix we rely mostly on in-memory processing. While this might not be feasible on a laptop, it works fine with the @resources decorator which allows you to request large cloud instances (e.g. with AWS Batch).

When a dataset doesn’t fit in a single instance, we shard the data.

also when it comes to handling large datasets as artifacts, we tend to store pointers to (immutable) datasets as artifacts, not the dataset itself. This is what we do e.g. with Hive tables that are often used as datasets

we are actively working on improving the data layer (related to Netflix/metaflow#4). It’d be great to hear more about your use case / size of data etc., so we can make sure it’ll be handled smoothly in upcoming releases

0reactions
multimericcommented, Sep 29, 2021

Great! I guess that isn’t yet stable though? Are there usage examples that involve file storage anywhere?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Exporting and Downloading Artifacts Using the Library
Download the export file to your local file system by clicking Download next to the export file name and save the export zip...
Read more >
How do I deploy large files to Artifactory? - JFrog
By default, Artifactory limits UI-generated file deployments to 100MB. You are free to adjust this limit at Administration > Artifactory > ...
Read more >
Storing Build Artifacts - CircleCI
Artifacts that are text can be compressed at very little cost. If you must upload a large artifact you can upload them to...
Read more >
Publish and download build artifacts - Azure Pipelines
Tips · Use forward slashes in file path arguments. · Build artifacts are stored on a Windows filesystem, which causes all UNIX permissions...
Read more >
Job artifacts - GitLab Docs
You can download a specific file from the artifacts archive for a specific job with the job artifacts API. For example, to download...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found