question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better support for Spark image type

See original GitHub issue

I have a spark dataframe with two image columns (read into spark using spark.read.format("image").load(PATH)) and some columns containing strings and arrays. This image type is a struct with some identifying information and the data in binary format as per the schema below (more information here on the image type: https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/image).

image

The issue with this is that I could not find a codec that would work for this type of data in the petastorm codecs. I also tried extracting the 'image.data' column to have the binary format data and then dropped the 'image' column but this did not seem to fix the issue either. The error I get is DecodeFieldError: Decoding field "image_left_data" failed which contains the binary data and I used the CompressedImageCodec('jpeg') in the Unischema that I used with materialize_dataset().

Then I decided to not use materialize dataset anymore and just directly write the dataframe into ‘normal’ parquet and loading the data using BatchedDataLoader(make_batch_reader()) so it can be used in PyTorch. I still have some issues getting the binary formatted image into a tensor but I think that can be ironed out easily.

I prefer to not use make_spark_converter() as that results in a complete write of the dataset to temp storage every time I run the script. Since writing data is the main cost in my environment I prefer to not use this, and I prefer to write out the dataset in (petastorm-)parquet format just once.

Having more native support for the image type in spark (easy write out/specific codec/etc.) would make this task for me a lot easier and I expect others who work with large image datasets will also benefit a lot from this, so I hope you can provide more information here.

I hope I made my problem and its context clear but don’t hesitate to ask more information if you need it! Also, if I am doing something wrong with e.g. codecs or something else or if I should import the images with a different datatype to begin with, then help is also very welcome 😃.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
selitvincommented, Jun 1, 2021

Thank you for your detailed message on your solution. Until we come up with a better documentation these examples can serve users that are looking for more information.

0reactions
RobindeGrootNLcommented, Jun 1, 2021

I’ll close this issue now, just send me a message if you need more info!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data sources - Spark 3.3.1 Documentation
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw...
Read more >
Real distributed image processing with Apache Spark
These approaches are not truly distributed, but is there a better way? In this blog I will show you how to use the...
Read more >
Introducing Built-in Image Data Source in Apache Spark 2.4
In Apache Spark 2.4, it's much easier to use because it is now a built-in data source. Using the image data source, you...
Read more >
Choosing the Right HDFS File Format for Your Apache Spark ...
Imagery: Avro is better optimized for binary data than Parquet and supports random access for efficient joins. Aggregated metadata: JSON is ...
Read more >
Images in Spark Page - Adobe Spark Knowledge Base
What is very important to understand here is that some image types (like Fill screen and window) are not possible without cropping. What...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found