Better support for Spark image type
See original GitHub issueI have a spark dataframe with two image columns (read into spark using spark.read.format("image").load(PATH)
) and some columns containing strings and arrays. This image type is a struct with some identifying information and the data in binary format as per the schema below (more information here on the image type: https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/image).
The issue with this is that I could not find a codec that would work for this type of data in the petastorm codecs. I also tried extracting the 'image.data'
column to have the binary format data and then dropped the 'image'
column but this did not seem to fix the issue either. The error I get is DecodeFieldError: Decoding field "image_left_data" failed
which contains the binary data and I used the CompressedImageCodec('jpeg')
in the Unischema
that I used with materialize_dataset()
.
Then I decided to not use materialize dataset anymore and just directly write the dataframe into ‘normal’ parquet and loading the data using BatchedDataLoader(make_batch_reader())
so it can be used in PyTorch. I still have some issues getting the binary formatted image into a tensor but I think that can be ironed out easily.
I prefer to not use make_spark_converter()
as that results in a complete write of the dataset to temp storage every time I run the script. Since writing data is the main cost in my environment I prefer to not use this, and I prefer to write out the dataset in (petastorm-)parquet format just once.
Having more native support for the image type in spark (easy write out/specific codec/etc.) would make this task for me a lot easier and I expect others who work with large image datasets will also benefit a lot from this, so I hope you can provide more information here.
I hope I made my problem and its context clear but don’t hesitate to ask more information if you need it! Also, if I am doing something wrong with e.g. codecs or something else or if I should import the images with a different datatype to begin with, then help is also very welcome 😃.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top GitHub Comments
Thank you for your detailed message on your solution. Until we come up with a better documentation these examples can serve users that are looking for more information.
I’ll close this issue now, just send me a message if you need more info!