Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better support for Spark image type

See original GitHub issue

I have a spark dataframe with two image columns (read into spark using spark.read.format("image").load(PATH)) and some columns containing strings and arrays. This image type is a struct with some identifying information and the data in binary format as per the schema below (more information here on the image type: https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/image).

The issue with this is that I could not find a codec that would work for this type of data in the petastorm codecs. I also tried extracting the 'image.data' column to have the binary format data and then dropped the 'image' column but this did not seem to fix the issue either. The error I get is DecodeFieldError: Decoding field "image_left_data" failed which contains the binary data and I used the CompressedImageCodec('jpeg') in the Unischema that I used with materialize_dataset().

Then I decided to not use materialize dataset anymore and just directly write the dataframe into ‘normal’ parquet and loading the data using BatchedDataLoader(make_batch_reader()) so it can be used in PyTorch. I still have some issues getting the binary formatted image into a tensor but I think that can be ironed out easily.

I prefer to not use make_spark_converter() as that results in a complete write of the dataset to temp storage every time I run the script. Since writing data is the main cost in my environment I prefer to not use this, and I prefer to write out the dataset in (petastorm-)parquet format just once.

Having more native support for the image type in spark (easy write out/specific codec/etc.) would make this task for me a lot easier and I expect others who work with large image datasets will also benefit a lot from this, so I hope you can provide more information here.

I hope I made my problem and its context clear but don’t hesitate to ask more information if you need it! Also, if I am doing something wrong with e.g. codecs or something else or if I should import the images with a different datatype to begin with, then help is also very welcome 😃.

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

1reaction

selitvincommented, Jun 1, 2021

Thank you for your detailed message on your solution. Until we come up with a better documentation these examples can serve users that are looking for more information.

0reactions

RobindeGrootNLcommented, Jun 1, 2021

I’ll close this issue now, just send me a message if you need more info!