Kedro asynchronous IO takes too much memory when downloading many files
See original GitHub issueDescription
I am using kedro run --async to download a lot of ~200000 of ~200K files from S3. This is accomplished by passing a function to Kedro IO using #744 implementation. This then starts doing the materialization of the dataset on the disk, and though it properly download and saves the dataset, it somehow keeps building in memory.
Context
This bug makes Kedro take basically the entire memory of the computer, even after the files are downloaded. It doesn’t seems like Kedro needs to take that much memory if it can just download the file and be done with it.
Steps to Reproduce
This is something I cannot reproduce without a large dataset with a lot of small files (image data). If strictly needed, I can look after some public dataset that could reproduce, but as of now I’d like help to understand how memory is being built. So if there are any scripts or shell commands to help, I can post whenever asked.
Expected Result
Kedro should be able to use the computer threads to download the file and then free the dataset from memory, giving space to download the remaining files.
Actual Result
The process runs and downloads the many files, and the files are correctly written on disk. However, the memory keeps building up until it gets 100% of memory and the process kills itself.
Oddly enough, when i run top to see which processes are taking memory, it doesn’t shows what is exactly taking memory.
When I look at the system monitor, it simply says that Kedro is using 1.4GB RAM, and shows no other processes that take that much memory (which amounts to 30GB of memory being written).
Note: The files are stored in memory as numpy arrays and saved in a compressed format in disk. This could partly explain why the in-memory size is larger.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedroorkedro -V): 0.17.4 - Python version used (
python -V): 3.7.6 - Operating system and version: Pop! Os 20.04
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)

Top Related StackOverflow Question
What actually happened was that when I have developed the implementation of raster datasets with
rasterio.openthe_savemethod did not implementfsspec, but rather implemented justrasterio.openwhich does not work well with kedro IO implementation due to losing somehow the information in the memory pointer of therasterio.openfunction (if I pass the rasterio DataSetReader to the load method, it would lose the information when working with many files, i.e.PartitionedDataSet). This probably was messing with garbage collection but I am not sure.Either way, this was not a good practice in object-oriented programming. Therefore what I have done is to not pass the rasterio DataSetReader, but instead the raster as a more common formats numpy array and the metadata as a dict in the
load()method. This was for loading. So a Raster would be aDictwith the"raster": np.arrayand"metadata": dictkeys.For loading, I did:
For saving, when I added the
fsspeclayer:it then worked properly when saving a large number of files in a
PartitionedDataSet.I have solved this issue by making proper use of
fsspecin the dataset I have built. Just follow something on the line of ImageDataSet. It really makes a huge difference when using large datasets.