Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Download and save a large file as an artifact

See original GitHub issue

One of the steps of my workflow is simply downloading a large data file:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.content

Now, this fails with a MemoryError because req.content tries to read the whole file into memory. However, even though requests has a streaming API, via iter_content(), I don’t think it’s possible to use this because metaflow doesn’t expose a file object to write into. If I try to store a generator object as an artifact it doesn’t work either:

def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.iter_content(chunk_size=1024)

TypeError: can't pickle generator objects

Finally, I can’t use req.raw:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.raw

TypeError: cannot serialize '_io.BufferedReader' object

If you somehow exposed the file object we were writing to, I could stream each chunk of the file separately and pickle them:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    pickle.dump(chunk, fp)

Or ideally not use pickle at all:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    fp.write(chunk)

Is exposing the file object, or allowing non-pickle files currently possible? If not, is it on the radar?

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:5

Top GitHub Comments

1reaction

multimericcommented, Feb 21, 2020

Discussion on gitter from @tuulos:

@TMiguelT internally at Netflix we rely mostly on in-memory processing. While this might not be feasible on a laptop, it works fine with the @resources decorator which allows you to request large cloud instances (e.g. with AWS Batch).

When a dataset doesn’t fit in a single instance, we shard the data.

also when it comes to handling large datasets as artifacts, we tend to store pointers to (immutable) datasets as artifacts, not the dataset itself. This is what we do e.g. with Hive tables that are often used as datasets

we are actively working on improving the data layer (related to Netflix/metaflow#4). It’d be great to hear more about your use case / size of data etc., so we can make sure it’ll be handled smoothly in upcoming releases

0reactions

multimericcommented, Sep 29, 2021

Great! I guess that isn’t yet stable though? Are there usage examples that involve file storage anywhere?