question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading millions of images from s3

See original GitHub issue

When attempting to read millions of images from s3 (all in a single bucket) with readImages, the command just hangs for several hours. Is this expected? Are they any best practices for how to use readImages with millions of images in s3?

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:12 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
mdagostcommented, Oct 27, 2017

I dug into this more. For posterity and for anyone who comes across this issue, here’s the explanation and code for a workaround that appears to be at least 20x faster.

Internally in readImages, filesToDF calls the spark context’s binaryFiles function. That, in turn, makes a new BinaryFileRDD. That code has this comment:

// setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when // traversing a large number of directories and files. Parallelize it.

Crucially, when the parallelization happens, it uses

Runtime.getRuntime.availableProcessors().toString)

which indicates that all of that processing is happening on the driver, just parallelized over the processes that the JVM has. That’s why this is so unbelievably slow–the driver is used to collect all of the stats on the millions of files, solely to calculate the number of partitions, and then the work is farmed out to the workers to actually read the files. The driver collecting stats on the files is the bottleneck.

Instead, I wrote code to do the following. I have a dataframe with the s3 paths. I run a python function in a map which uses boto3 to directly grab the file from s3 on the worker, decode the image data, and assemble the same type of dataframe as readImages.

Here’s the code, more or less in its entirety, to read and decode the images and then just write them to parquet. It ran over 100k images in s3 on 40 nodes in 2.6 minutes instead of the 50 minutes that the vanilla readImages took.

from sparkdl.image.imageIO import _decodeImage, imageSchema

# this function will use boto3 on the workers directly to pull the image
def readFileFromS3(row):
  import boto3
  import os

  s3 = boto3.client('s3')
  
  filePath = row.image_url
  # strip off the starting s3a:// from the bucket
  bucket = os.path.dirname(str(filePath))[6:]
  key = os.path.basename(str(filePath))
  
  response = s3.get_object(Bucket=bucket, Key=key)
  body = response["Body"]
  contents = bytearray(body.read())
  body.close()

  if len(contents):
    return (filePath, bytearray(contents))

# rows_df is a dataframe with a single string column called "image_url" that has the full s3a filePath
# Running rows_df.rdd.take(2) gives the output
# [Row(image_url=u's3a://mybucket/14f89051-26b3-4bd9-88ad-805002e9a7c5'),
# Row(image_url=u's3a://mybucket/a47a9b32-a16e-4d04-bba0-cdc842c06052')]

# farm out our images to the workers with a map
images_rdd = (
  rows_df
  .rdd
  .map(readFileFromS3)
)

# convert our rdd to a dataframe and then
# use a udf to decode the image; the schema comes from sparkdl.image.imageIO
schema = StructType([StructField("filePath", StringType(), False),
                         StructField("fileData", BinaryType(), False)])

decodeImage = udf(_decodeImage, imageSchema)

image_df = (
  images_rdd
  .toDF(schema)
  .select("filePath", decodeImage("fileData").alias("image"))
)

(
  image_df
  .write
  .format("parquet")
  .mode("overwrite")
  .option("compression", "gzip")
  .save("s3://my_bucket/images.parquet")
)

0reactions
heng2jcommented, Oct 1, 2018

Hi @mdagost and @thunterdb

I am very interesting to the function that @mdagost provided to feed image data from S3 to train a model since your function above can construct image_df as image data source.

However, with the current version of sparkdl, it seems the code can no longer apply due to “_decodeImage” function is no longer in imageIO. It was there in the pysparkdl 0.2.0 documentation pysparkdl 0.2.0 documentation

And I have tried to use imageArrayToStruct but did not work as well.

decoded = imageArrayToStruct(bytearray(contents)). schema = StructType([StructField("filePath", StringType(), False), StructField("image", ImageSchema)])

I received the following error.

File “/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py”, line 406, in init AssertionError: dataType should be DataType

I assume it was due to the way we are getting the image bytearray. I am planing to try to apply the following code later tonight to read the bytearray from S3 object (a jpg file).

img = image.load_img(BytesIO(obj.get()['Body'].read()), target_size=(224, 224))

So I would like to confirm from @thunterdb that currently we can construct the image_df regardless where the images are locating at right? I am constructing the file path from the object keys that sitting in different subfolders in my S3 bucket.

And how can I export the trained weight of an inception model after transfer learning?

Thank you, Heng

Read more comments on GitHub >

github_iconTop Results From Across the Web

Analytical processing of millions of cell images using Amazon ...
In our case, 1.6 million images were generated per week using ... After processing, these images were copied back to Amazon S3 as...
Read more >
Reading from s3 bucket with millions of s3 objects in python
I have an s3 bucket with millions of objects and each object is named with a UNIX timestamp as shown in the image....
Read more >
How do I easily view images stored in AWS S3 via a browser?
The easiest way to view images stored in AWS S3 is to use the Amazon S3 Console. With the console, you can easily...
Read more >
Use S3 to Store Images From Your Application
A walkthrough of how to use S3 as image storage for a NodeJS/Express application. ... If you've ever built an application that handles...
Read more >
Hosting Your HIT Images Using Amazon S3
There are three high level steps to host your HITs' images on S3: 1) Create an S3 bucket, 2) Upload your images, and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found