Kedro asynchronous IO takes too much memory when downloading many files
See original GitHub issueDescription
I am using kedro run --async
to download a lot of ~200000 of ~200K files from S3. This is accomplished by passing a function to Kedro IO using #744 implementation. This then starts doing the materialization of the dataset on the disk, and though it properly download and saves the dataset, it somehow keeps building in memory.
Context
This bug makes Kedro take basically the entire memory of the computer, even after the files are downloaded. It doesn’t seems like Kedro needs to take that much memory if it can just download the file and be done with it.
Steps to Reproduce
This is something I cannot reproduce without a large dataset with a lot of small files (image data). If strictly needed, I can look after some public dataset that could reproduce, but as of now I’d like help to understand how memory is being built. So if there are any scripts or shell commands to help, I can post whenever asked.
Expected Result
Kedro should be able to use the computer threads to download the file and then free the dataset from memory, giving space to download the remaining files.
Actual Result
The process runs and downloads the many files, and the files are correctly written on disk. However, the memory keeps building up until it gets 100% of memory and the process kills itself.
Oddly enough, when i run top
to see which processes are taking memory, it doesn’t shows what is exactly taking memory.
When I look at the system monitor, it simply says that Kedro is using 1.4GB RAM, and shows no other processes that take that much memory (which amounts to 30GB of memory being written).
Note: The files are stored in memory as numpy arrays and saved in a compressed format in disk. This could partly explain why the in-memory size is larger.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedro
orkedro -V
): 0.17.4 - Python version used (
python -V
): 3.7.6 - Operating system and version: Pop! Os 20.04
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)
What actually happened was that when I have developed the implementation of raster datasets with
rasterio.open
the_save
method did not implementfsspec
, but rather implemented justrasterio.open
which does not work well with kedro IO implementation due to losing somehow the information in the memory pointer of therasterio.open
function (if I pass the rasterio DataSetReader to the load method, it would lose the information when working with many files, i.e.PartitionedDataSet
). This probably was messing with garbage collection but I am not sure.Either way, this was not a good practice in object-oriented programming. Therefore what I have done is to not pass the rasterio DataSetReader, but instead the raster as a more common formats numpy array and the metadata as a dict in the
load()
method. This was for loading. So a Raster would be aDict
with the"raster": np.array
and"metadata": dict
keys.For loading, I did:
For saving, when I added the
fsspec
layer:it then worked properly when saving a large number of files in a
PartitionedDataSet
.I have solved this issue by making proper use of
fsspec
in the dataset I have built. Just follow something on the line of ImageDataSet. It really makes a huge difference when using large datasets.