question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement an alternative function to image_dataset_from_directory that is independant of directory structure

See original GitHub issue

System information.

TensorFlow version (you are using): 2.9.1 Are you willing to contribute it (Yes/No) : Yes

Describe the feature and the current behavior/state.

Describe the feature clearly here. Be sure to convey here why the requested feature is needed. Any brief description about the use-case would help.

Currently, if you cannot fit the dataset into memory, your options are to use tf.keras.utils.image_dataset_from_directory or tf.keras.preprocessing.image.ImageDataGenerator. Since the latter one is deprecated, we assume that the image_dataset_from_directory api is the only viable option moving forward.

According to keras.io, in order to use image_dataset_from_directory your directory structure should be something like this:

main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

Now this structure is very straight forward for most use cases, but not all of them. In my case I had to train a model with memory layers on it (similar to LSTM). So, my dataset was preprocessed into context batches, meaning that the data is segmented into batches, then I shuffle the batches order but keep the order of the images inside each batch the same, a (simplified) sample of the data is presented below (context batch size is 4):

Drowsy/capture_0956_face_0956.jpg
Drowsy/capture_0957_face_0957.jpg
Drowsy/capture_0958_face_0958.jpg
Drowsy/capture_0959_face_0959.jpg
Low_Vigilant/capture_0816_face_0816.jpg
Low_Vigilant/capture_0817_face_0817.jpg
Low_Vigilant/capture_0818_face_0818.jpg
Low_Vigilant/capture_0819_face_0819.jpg
Drowsy/capture_0182_face_0182.jpg
Drowsy/capture_0183_face_0183.jpg
Drowsy/capture_0184_face_0184.jpg
Drowsy/capture_0185_face_0185.jpg
Alert/capture_0493_face_0492.jpg
Alert/capture_0494_face_0493.jpg
Alert/capture_0495_face_0494.jpg
Alert/capture_0496_face_0495.jpg

I had to save these preprocessed relative paths of the images as a text file. Along with another text file dedicated for labels, here is a sample:

1
1
1
1
0
0
0
0
1
1
1
1
2
2
2
2

image_dataset_from_directory is unable to handle such cases as I mentioned previously. So, I am proposing the addition of a function that is independent of the dataset directory structure, the loading order of images is provided by the end user calling this proposed function.

Will this change the current api? How? No, It will not break backward compatibility. it will simply be a newly added function to the keras.utils Api.

Who will benefit from this feature?

  • People who have images that need to be loaded to the model in a specific order
  • People with complex dataset directory structure, examples like deep nested folders.

Contributing

  • Do you want to contribute a PR? (yes/no): yes
  • If yes, please read this page for instructions. Already done ✔️ ✔️
  • Briefly describe your candidate solution(if contributing): My approach would be to create a function (for example called image_dataset_from_paths) that receives an iterator of image paths, and an iterator of corresponding labels, which then generates a tf.data.Dataset. I already implemented the code for my use case, and happy to make a PR.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
hertschuhcommented, Sep 1, 2022

@hahouari ,

tf.data.Dataset has a lot of capabilities from creating a Dataset manually to processing it. You may want to use list_files to find all the files and group_by_window to group by identical people and class.

I recommend you also checkout tf.io too, it has things like tf.io.decode_image.

Closing this issue now.

1reaction
hertschuhcommented, Aug 31, 2022

@hahouari ,

I think what you need already exists. Basically, what you want to do first is call tf.keras.utils.image_dataset_from_directory with batch_size=None and shuffle=False so that everything stays in order. In particular the images of the same class will be together. Then, you want to batch the result and shuffle afterwards (the reverse of what’s typically done). So your code may look like this:

batch_size = 4
dataset = tf.keras.utils.image_dataset_from_directory('my_path', batch_size=None, shuffle=False)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(10000)
model.fit(dataset)

Take a look at the documentation for tf.data.Dataset, there are a lot of transformations you can apply.

The code above may need to be modified the handle the class boundaries correctly. If the number of images per class is not a multiple of the batch size and you don’t want two classes to be mixed in the one batch, you’ll need to add something to prevent that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to loop over a folder, and recreate them in another folder ...
I would advise using glob.glob (or glob.iglob ) for this. It can recursively find all files under a directory. Then, we can simply...
Read more >
Image data loading - Keras
Animated gifs are truncated to the first frame. directory: Directory where the data is located. If labels is "inferred", it should contain subdirectories,...
Read more >
How to Load Large Datasets From Directories for Deep ...
Now that we have a basic directory structure, let's practice loading image data from file for use with modeling.
Read more >
tf.keras.utils.image_dataset_from_directory | TensorFlow v2.11.0
Generates a tf.data.Dataset from image files in a directory.
Read more >
File and Directory Names: File, Path, Paths (Java Files Tutorial)
io.File . A file object can represent a filename, a directory name, a relative file or directory path, or an absolute file or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found