Reading millions of images from s3
See original GitHub issueWhen attempting to read millions of images from s3 (all in a single bucket) with readImages
, the command just hangs for several hours. Is this expected? Are they any best practices for how to use readImages
with millions of images in s3?
Issue Analytics
- State:
- Created 6 years ago
- Comments:12 (3 by maintainers)
Top Results From Across the Web
Analytical processing of millions of cell images using Amazon ...
In our case, 1.6 million images were generated per week using ... After processing, these images were copied back to Amazon S3 as...
Read more >Reading from s3 bucket with millions of s3 objects in python
I have an s3 bucket with millions of objects and each object is named with a UNIX timestamp as shown in the image....
Read more >How do I easily view images stored in AWS S3 via a browser?
The easiest way to view images stored in AWS S3 is to use the Amazon S3 Console. With the console, you can easily...
Read more >Use S3 to Store Images From Your Application
A walkthrough of how to use S3 as image storage for a NodeJS/Express application. ... If you've ever built an application that handles...
Read more >Hosting Your HIT Images Using Amazon S3
There are three high level steps to host your HITs' images on S3: 1) Create an S3 bucket, 2) Upload your images, and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I dug into this more. For posterity and for anyone who comes across this issue, here’s the explanation and code for a workaround that appears to be at least 20x faster.
Internally in
readImages
, filesToDF calls the spark context’s binaryFiles function. That, in turn, makes a new BinaryFileRDD. That code has this comment:Crucially, when the parallelization happens, it uses
which indicates that all of that processing is happening on the driver, just parallelized over the processes that the JVM has. That’s why this is so unbelievably slow–the driver is used to collect all of the stats on the millions of files, solely to calculate the number of partitions, and then the work is farmed out to the workers to actually read the files. The driver collecting stats on the files is the bottleneck.
Instead, I wrote code to do the following. I have a dataframe with the s3 paths. I run a python function in a map which uses boto3 to directly grab the file from s3 on the worker, decode the image data, and assemble the same type of dataframe as
readImages
.Here’s the code, more or less in its entirety, to read and decode the images and then just write them to parquet. It ran over 100k images in s3 on 40 nodes in 2.6 minutes instead of the 50 minutes that the vanilla
readImages
took.Hi @mdagost and @thunterdb
I am very interesting to the function that @mdagost provided to feed image data from S3 to train a model since your function above can construct image_df as image data source.
However, with the current version of sparkdl, it seems the code can no longer apply due to “_decodeImage” function is no longer in imageIO. It was there in the pysparkdl 0.2.0 documentation pysparkdl 0.2.0 documentation
And I have tried to use imageArrayToStruct but did not work as well.
decoded = imageArrayToStruct(bytearray(contents)).
…schema = StructType([StructField("filePath", StringType(), False), StructField("image", ImageSchema)])
I received the following error.
File “/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py”, line 406, in init AssertionError: dataType should be DataType
I assume it was due to the way we are getting the image bytearray. I am planing to try to apply the following code later tonight to read the bytearray from S3 object (a jpg file).
img = image.load_img(BytesIO(obj.get()['Body'].read()), target_size=(224, 224))
So I would like to confirm from @thunterdb that currently we can construct the image_df regardless where the images are locating at right? I am constructing the file path from the object keys that sitting in different subfolders in my S3 bucket.
And how can I export the trained weight of an inception model after transfer learning?
Thank you, Heng