question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask_image imread performance issue

See original GitHub issue

Dear dask_image community,

I am a new dask_image user. Maybe, due to my beginner level, I am doing something wrong, but I noticed that reading a collection of images using dask_image is much slower that using single-threaded skimage. I have installed the latest dask_image version available in pypi (dask_image version 0.4.0). In the following example, I am reading 398 images, all of them with the same dimension (64x10240, uint16). Taking into account the dimensions and numbers of images, I would expect dask_image to be slighly slower that single-threaded skimage (due to the tiny dask overhead involved in opening this small number of “tiny images”), but instead the performance of dask_image is much slower (around 24x). Then I proceed to implement the image reading function in pure-dask and the performance is much better than the one obtained with dask_image. In the following I will report the benchmarks results (all the following code-snippets load the same data successfully):

import glob import numpy as np import skimage.io import dask_image.imread from dask import delayed import dask.array as da


Single-threaded-skimage baseline

%%time
all_images = sorted(glob.glob(f"{path_images}/*.tif"))
array_images = np.zeros((len(all_images), 64, 10240), dtype=np.uint16)
for idx, image in enumerate(all_images):
    array_images[idx] = skimage.io.imread(image)

Elapsed time: 510 milliseconds


Using dask_image

%%time
using_dask_image = dask_image.imread.imread(f"{path_images}/*.tif")
array_dask_image = using_dask_image.compute()

Elapsed time: 12.1 seconds


Using pure-dask

%%time
lazy_imread = delayed(skimage.io.imread)  # lazy reader
lazy_arrays = [lazy_imread(image) for image in all_images]
dask_arrays = [
    da.from_delayed(delayed_reader, shape=(64, 10240), dtype=np.uint16)
    for delayed_reader in lazy_arrays
]
using_dask = da.stack(dask_arrays, axis=0).compute()

Elapsed time: 1.09 seconds


Using dask-image with synchronous scheduler

%%time
array_dask_image = using_dask_image.compute(scheduler="synchronous")

Elapsed time: 3 seconds


Using dask-image with processes scheduler

%%time
array_dask_image = using_dask_image.compute(scheduler="processes")

Elapsed time: 6.63 seconds


Using dask-image with threads scheduler

%%time
array_dask_image = using_dask_image.compute(scheduler="threads")

Elapsed time: 12 seconds


Environment:

  • Dask-Image version: 0.4.0
  • Dask version: 2021.01.0
  • Python version: 3.8.3
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): pip

Thank you very much for your all help 😉

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:24 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
FirefoxMetzgercommented, Mar 13, 2021

One thing I wish we had was a lightweight, minimal dependency way to install just an image reader to get stuff into a dask array. I’m not sure if there’s a really clean way to handle this.

@GenevieveBuckley pip install imageio 😛 Currently, it depends on numpy and pillow, but if pillow is too much (common formats like .jpg, .png, .gif), there is the option to install it using pip install --no-dependencies imageio and then only install the plugins you actually want/need.

One can also envision a dask plugin/wrapper to allow loading the image straight into a dask_image and avoid a copy. I’m thinking about doing this for pytorch and tensorflow at some point down the line (directly load into pinned memory) because my current lab does a lot of deep learning with images. __array_interface__ may already do this trivially for dask, because imageio returns a numpy array.

For what it’s worth, I’m collaborating with a group who are working with .mp4 video data so this is relevant for me now too.

This is also something that can be done with imageio 🤩. We maintain an ffmpeg wrapper (pip install imageio[ffmpeg]), which can then read videos frame-by-frame or as an image stack via the familiar API. It will get even better once we have https://github.com/imageio/imageio/pull/574 merged. In the future, I hope to replace this with a wrapper around av, because there is little point in duplicating the effort of cleanly wrapping and shipping ffmpeg.

Thoughts and feature requests are (of course) appreciated.

Does it? I got the impression it was more complicated than that.

@jakirkham It will 👼 as soon as I get https://github.com/imageio/imageio/pull/574 merged and find the time to write the wrapper for skimage.

Yeah we’ve discussed with imageio before if they could provide the shape and dtype without loading the full image […]

Metadata for images is a never-ending story xD I think the reason there is no clear standard for it yet in imageio is that every format has its own set of metadata, so it is non-trivial to find common ground that we can guarantee to provide for all formats. For me specifically, the user-side is a bit of a black box, because I’ve not really seen use-cases yet where people actively consume metadata; then again I’ve only recently joined this particular corner of the internet, so there is a lot I may not know (yet).

1reaction
jnicommented, Mar 12, 2021

I do have opinions here!

(1) skimage.io will eventually become a thin wrapper around imageio. (2) imageio will start to do smarter things around lazy loading, see https://github.com/imageio/imageio/issues/569, https://github.com/imageio/imageio/pull/574, and the links therein. (3) overall there should be a community-wide effort around this, see https://blog.danallan.com/posts/2020-03-07-reader-protocol/

None of this is particularly helpful re dask-image’s present choice, except to say that maybe some/all of the effort in this discussion should go towards those issues rather than towards adding Yet Another way of wrapping wrappers around IO libraries.

Re tifffile, for reading tiffs it always boils down to that (whether you’re using imageio or skimage.io), so if you want to do lazy loading of big tiffs I suggest implementing it on top of tifffile directly — it certainly has that capability, no need for PIMS here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Load Large Image Data with Dask Array
Lazily load images with Dask Array​​ We can delay the imageio. imread calls with Dask Delayed. Note: here we're assuming that all of...
Read more >
Converting dask array using np.asarry is very slow
I have a folder of about 150 large images (40000x30000x3) ~400MB each, and I want to validate ROIs from an imaging analysis.
Read more >
using dask and napari to process & view large datasets
using dask.delayed to load images¶ ; numpy array, such as ; skimage.io.imread , you can create a “lazy” version of that function by...
Read more >
dask-image - Python Package Health Analysis - Snyk
Improve imread performance: reduced overhead of pim.open calls when reading from image sequence (#182). Bug Fixes. dask-image imread v0.5.0 ...
Read more >
Performing HOG Matrices on PIMS Chunks through ImageIO - Dask ...
However, I'm running into a couple problems that I haven't been able to solve. ... (See dask-image/dask_image/imread/_utils.py.) Since you want to use pims....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found