dask_image imread performance issue
See original GitHub issueDear dask_image community,
I am a new dask_image user. Maybe, due to my beginner level, I am doing something wrong, but I noticed that reading a collection of images using dask_image is much slower that using single-threaded skimage. I have installed the latest dask_image version available in pypi (dask_image version 0.4.0). In the following example, I am reading 398 images, all of them with the same dimension (64x10240, uint16). Taking into account the dimensions and numbers of images, I would expect dask_image to be slighly slower that single-threaded skimage (due to the tiny dask overhead involved in opening this small number of “tiny images”), but instead the performance of dask_image is much slower (around 24x). Then I proceed to implement the image reading function in pure-dask and the performance is much better than the one obtained with dask_image. In the following I will report the benchmarks results (all the following code-snippets load the same data successfully):
import glob import numpy as np import skimage.io import dask_image.imread from dask import delayed import dask.array as da
Single-threaded-skimage baseline
%%time
all_images = sorted(glob.glob(f"{path_images}/*.tif"))
array_images = np.zeros((len(all_images), 64, 10240), dtype=np.uint16)
for idx, image in enumerate(all_images):
array_images[idx] = skimage.io.imread(image)
Elapsed time: 510 milliseconds
Using dask_image
%%time
using_dask_image = dask_image.imread.imread(f"{path_images}/*.tif")
array_dask_image = using_dask_image.compute()
Elapsed time: 12.1 seconds
Using pure-dask
%%time
lazy_imread = delayed(skimage.io.imread) # lazy reader
lazy_arrays = [lazy_imread(image) for image in all_images]
dask_arrays = [
da.from_delayed(delayed_reader, shape=(64, 10240), dtype=np.uint16)
for delayed_reader in lazy_arrays
]
using_dask = da.stack(dask_arrays, axis=0).compute()
Elapsed time: 1.09 seconds
Using dask-image with synchronous scheduler
%%time
array_dask_image = using_dask_image.compute(scheduler="synchronous")
Elapsed time: 3 seconds
Using dask-image with processes scheduler
%%time
array_dask_image = using_dask_image.compute(scheduler="processes")
Elapsed time: 6.63 seconds
Using dask-image with threads scheduler
%%time
array_dask_image = using_dask_image.compute(scheduler="threads")
Elapsed time: 12 seconds
Environment:
- Dask-Image version: 0.4.0
- Dask version: 2021.01.0
- Python version: 3.8.3
- Operating System: Ubuntu 18.04
- Install method (conda, pip, source): pip
Thank you very much for your all help 😉
Issue Analytics
- State:
- Created 3 years ago
- Comments:24 (8 by maintainers)
Top GitHub Comments
@GenevieveBuckley
pip install imageio
😛 Currently, it depends onnumpy
andpillow
, but ifpillow
is too much (common formats like .jpg, .png, .gif), there is the option to install it usingpip install --no-dependencies imageio
and then only install the plugins you actually want/need.One can also envision a dask plugin/wrapper to allow loading the image straight into a dask_image and avoid a copy. I’m thinking about doing this for pytorch and tensorflow at some point down the line (directly load into pinned memory) because my current lab does a lot of deep learning with images.
__array_interface__
may already do this trivially for dask, because imageio returns a numpy array.This is also something that can be done with imageio 🤩. We maintain an ffmpeg wrapper (
pip install imageio[ffmpeg]
), which can then read videos frame-by-frame or as an image stack via the familiar API. It will get even better once we have https://github.com/imageio/imageio/pull/574 merged. In the future, I hope to replace this with a wrapper around av, because there is little point in duplicating the effort of cleanly wrapping and shipping ffmpeg.Thoughts and feature requests are (of course) appreciated.
@jakirkham It will 👼 as soon as I get https://github.com/imageio/imageio/pull/574 merged and find the time to write the wrapper for skimage.
Metadata for images is a never-ending story xD I think the reason there is no clear standard for it yet in imageio is that every format has its own set of metadata, so it is non-trivial to find common ground that we can guarantee to provide for all formats. For me specifically, the user-side is a bit of a black box, because I’ve not really seen use-cases yet where people actively consume metadata; then again I’ve only recently joined this particular corner of the internet, so there is a lot I may not know (yet).
I do have opinions here!
(1) skimage.io will eventually become a thin wrapper around imageio. (2) imageio will start to do smarter things around lazy loading, see https://github.com/imageio/imageio/issues/569, https://github.com/imageio/imageio/pull/574, and the links therein. (3) overall there should be a community-wide effort around this, see https://blog.danallan.com/posts/2020-03-07-reader-protocol/
None of this is particularly helpful re dask-image’s present choice, except to say that maybe some/all of the effort in this discussion should go towards those issues rather than towards adding Yet Another way of wrapping wrappers around IO libraries.
Re tifffile, for reading tiffs it always boils down to that (whether you’re using imageio or skimage.io), so if you want to do lazy loading of big tiffs I suggest implementing it on top of tifffile directly — it certainly has that capability, no need for PIMS here.