question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting movie files into dask efficiently

See original GitHub issue
  • dask-image version: 0.2.0
  • Python version: 3.7
  • Operating System: Mac OSX

Description

I’m interesting in getting movie files - .mov, .mpeg, .avi (basically anything readable with ffmpeg) into dask in a nice way - i.e. something like dask_image.imread.imread but that can accept these formats.

It is possible to read these formats into python via ffmpeg using libraries like imageio.imread or pyav but these tend to return video objects that have iterators or get frame methods on them, but I would like a dask array that I can call into in a lazy fashion to get just what I need and have it be highly performant.

Note there has been some discussion around this on an image.sc post I made, including caveats around attempts at full random access when looking at movie files. I am fine with cacheing of intermediate results to make accessing neighboring frames fast, and I’m fine if making big jumps in the movie is slow, but accessing nearby frames should be fast (I’m interested in using this for interactive movie visualisation using napari so it is reasonable to expect that most times people will be looking at frames in order, but they might want to jump around and things should cached nicely too)

What I Did

I made some attempts at this myself modifying the dask_image.imread.imread code - see here

import imageio
from dask import delayed
import dask.array as da
from dask.cache import Cache

cache = Cache(2e9)  # Leverage two gigabytes of memory
cache.register()

def dask_from_mov(path):
    vid = imageio.get_reader(path,  'ffmpeg')
    shape = vid.get_meta_data()['size'][::-1] + (3,)
    lazy_imread = delayed(vid.get_data)
    return da.stack([da.from_delayed(lazy_imread(i), shape=shape, dtype=np.uint8) for i in range(vid.count_frames())])

There are more code snippets and links to some .mov files in the image.sc post linked to above if people what more detail.

Overall performance of that approach was not very good. I can do some benchmarking etc, but I suspected that what I’m doing is horrible inefficient from a decoding standpoint and there might be a lower level of the ffmpeg reader to connect with dask. Curious if anyone here has any experience with this or ideas?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

8reactions
danielballancommented, Mar 13, 2020

Thanks, that’s useful to hear. PIMS and napari have started talking more recently and I expect that to continue, so hopefully we can work together to smooth this out.

4reactions
sofroniewncommented, Mar 13, 2020

That would be great @danielballan - the next napari release (should be < 2 weeks and 0.3.0) will be the first one that supports the addition of reader plugins by @tlambert03 - we’ve got the basic machinery merged into master, and are now working on a few details and documentation, see https://github.com/napari/napari/pull/1030. At that point I’ve love to see both PIMS and dask-image be able to load data into napari via our plugin mechanism (which is hopefully pretty light weight and not too far from where you are now).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask Best Practices - Dask documentation
This is a short overview of Dask best practices. This document specifically focuses on best practices that are shared among all of the...
Read more >
Reading text files into Dask DataFrame - python - Stack Overflow
I'm looking to train a model on ~100,000 text files. Pandas was running into some memory issues to decided to move to Dask....
Read more >
Guide to Big Data Joins — Python, SQL, Pandas, Spark, Dask
The classic way is to load up a Database, indexing and run the SQL ... 26 million ratings for 45,000 movies, data split...
Read more >
Parallel Programming with Dask in Python Course - DataCamp
The Dask library can lower your memory use by loading chunks of data only when needed. It can lower runtimes by using all...
Read more >
Performing HOG Matrices on PIMS Chunks through ImageIO
Getting Dask and PIMS to do this on one small video is a great first step in building infrastructure/code that can let us...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found