Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading millions of images from s3

See original GitHub issue

When attempting to read millions of images from s3 (all in a single bucket) with readImages, the command just hangs for several hours. Is this expected? Are they any best practices for how to use readImages with millions of images in s3?

Issue Analytics

State:
Created 6 years ago
Comments:12 (3 by maintainers)

Top GitHub Comments

6reactions

mdagostcommented, Oct 27, 2017

I dug into this more. For posterity and for anyone who comes across this issue, here’s the explanation and code for a workaround that appears to be at least 20x faster.

Internally in readImages, filesToDF calls the spark context’s binaryFiles function. That, in turn, makes a new BinaryFileRDD. That code has this comment:

// setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when // traversing a large number of directories and files. Parallelize it.

Crucially, when the parallelization happens, it uses

Runtime.getRuntime.availableProcessors().toString)

which indicates that all of that processing is happening on the driver, just parallelized over the processes that the JVM has. That’s why this is so unbelievably slow–the driver is used to collect all of the stats on the millions of files, solely to calculate the number of partitions, and then the work is farmed out to the workers to actually read the files. The driver collecting stats on the files is the bottleneck.

Instead, I wrote code to do the following. I have a dataframe with the s3 paths. I run a python function in a map which uses boto3 to directly grab the file from s3 on the worker, decode the image data, and assemble the same type of dataframe as readImages.

Here’s the code, more or less in its entirety, to read and decode the images and then just write them to parquet. It ran over 100k images in s3 on 40 nodes in 2.6 minutes instead of the 50 minutes that the vanilla readImages took.

from sparkdl.image.imageIO import _decodeImage, imageSchema

# this function will use boto3 on the workers directly to pull the image
def readFileFromS3(row):
  import boto3
  import os

  s3 = boto3.client('s3')
  
  filePath = row.image_url
  # strip off the starting s3a:// from the bucket
  bucket = os.path.dirname(str(filePath))[6:]
  key = os.path.basename(str(filePath))
  
  response = s3.get_object(Bucket=bucket, Key=key)
  body = response["Body"]
  contents = bytearray(body.read())
  body.close()

  if len(contents):
    return (filePath, bytearray(contents))

# rows_df is a dataframe with a single string column called "image_url" that has the full s3a filePath
# Running rows_df.rdd.take(2) gives the output
# [Row(image_url=u's3a://mybucket/14f89051-26b3-4bd9-88ad-805002e9a7c5'),
# Row(image_url=u's3a://mybucket/a47a9b32-a16e-4d04-bba0-cdc842c06052')]

# farm out our images to the workers with a map
images_rdd = (
  rows_df
  .rdd
  .map(readFileFromS3)
)

# convert our rdd to a dataframe and then
# use a udf to decode the image; the schema comes from sparkdl.image.imageIO
schema = StructType([StructField("filePath", StringType(), False),
                         StructField("fileData", BinaryType(), False)])

decodeImage = udf(_decodeImage, imageSchema)

image_df = (
  images_rdd
  .toDF(schema)
  .select("filePath", decodeImage("fileData").alias("image"))
)

(
  image_df
  .write
  .format("parquet")
  .mode("overwrite")
  .option("compression", "gzip")
  .save("s3://my_bucket/images.parquet")
)

0reactions

heng2jcommented, Oct 1, 2018

Hi @mdagost and @thunterdb

I am very interesting to the function that @mdagost provided to feed image data from S3 to train a model since your function above can construct image_df as image data source.

However, with the current version of sparkdl, it seems the code can no longer apply due to “_decodeImage” function is no longer in imageIO. It was there in the pysparkdl 0.2.0 documentation pysparkdl 0.2.0 documentation

And I have tried to use imageArrayToStruct but did not work as well.

decoded = imageArrayToStruct(bytearray(contents)). … schema = StructType([StructField("filePath", StringType(), False), StructField("image", ImageSchema)])

I received the following error.

File “/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py”, line 406, in init AssertionError: dataType should be DataType

I assume it was due to the way we are getting the image bytearray. I am planing to try to apply the following code later tonight to read the bytearray from S3 object (a jpg file).

img = image.load_img(BytesIO(obj.get()['Body'].read()), target_size=(224, 224))

So I would like to confirm from @thunterdb that currently we can construct the image_df regardless where the images are locating at right? I am constructing the file path from the object keys that sitting in different subfolders in my S3 bucket.

And how can I export the trained weight of an inception model after transfer learning?

Thank you, Heng