Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kedro asynchronous IO takes too much memory when downloading many files

See original GitHub issue

Description

I am using kedro run --async to download a lot of ~200000 of ~200K files from S3. This is accomplished by passing a function to Kedro IO using #744 implementation. This then starts doing the materialization of the dataset on the disk, and though it properly download and saves the dataset, it somehow keeps building in memory.

Context

This bug makes Kedro take basically the entire memory of the computer, even after the files are downloaded. It doesn’t seems like Kedro needs to take that much memory if it can just download the file and be done with it.

Steps to Reproduce

This is something I cannot reproduce without a large dataset with a lot of small files (image data). If strictly needed, I can look after some public dataset that could reproduce, but as of now I’d like help to understand how memory is being built. So if there are any scripts or shell commands to help, I can post whenever asked.

Expected Result

Kedro should be able to use the computer threads to download the file and then free the dataset from memory, giving space to download the remaining files.

Actual Result

The process runs and downloads the many files, and the files are correctly written on disk. However, the memory keeps building up until it gets 100% of memory and the process kills itself.

Oddly enough, when i run top to see which processes are taking memory, it doesn’t shows what is exactly taking memory. When I look at the system monitor, it simply says that Kedro is using 1.4GB RAM, and shows no other processes that take that much memory (which amounts to 30GB of memory being written).

Note: The files are stored in memory as numpy arrays and saved in a compressed format in disk. This could partly explain why the in-memory size is larger.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.17.4
Python version used (python -V): 3.7.6
Operating system and version: Pop! Os 20.04

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

3reactions

yurigbacommented, Nov 24, 2021

What actually happened was that when I have developed the implementation of raster datasets with rasterio.open the _save method did not implement fsspec, but rather implemented just rasterio.open which does not work well with kedro IO implementation due to losing somehow the information in the memory pointer of the rasterio.open function (if I pass the rasterio DataSetReader to the load method, it would lose the information when working with many files, i.e. PartitionedDataSet). This probably was messing with garbage collection but I am not sure.

Either way, this was not a good practice in object-oriented programming. Therefore what I have done is to not pass the rasterio DataSetReader, but instead the raster as a more common formats numpy array and the metadata as a dict in the load()method. This was for loading. So a Raster would be a Dict with the "raster": np.array and "metadata": dict keys.

For loading, I did:

    def _load(self) -> dict:
        
        out_raster = dict()
        
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

    
        with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
            with rasterio.open(fs_file) as src:
                
                out_raster["metadata"] = src.meta
                out_raster["raster"] = src.read()

        return out_raster

For saving, when I added the fsspec layer:

   def _save(self,raster_in: dict) -> None:

        save_path = get_filepath_str(self._get_save_path(), self._protocol)
        
        with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
            with rasterio.open(fs_file, 'w', **raster_in["metadata"]) as dst:
                
                dst.write(raster_in["raster"])
            
        self._invalidate_cache()

it then worked properly when saving a large number of files in a PartitionedDataSet.

1reaction

yurigbacommented, Nov 24, 2021

I have solved this issue by making proper use of fsspec in the dataset I have built. Just follow something on the line of ImageDataSet. It really makes a huge difference when using large datasets.

Top Results From Across the Web

Kedro asynchronous IO takes too much memory ... - GitHub

I am using kedro run --async to download a lot of ~200000 of ~200K files from S3. This is accomplished by passing a...

Run a pipeline — Kedro 0.18.4 documentation - Read the Docs

All the datasets used in the run have to be thread-safe in order for asynchronous loading/saving to work properly. Run a pipeline by...

Run a pipeline — Kedro 0.17.4 documentation - Read the Docs

All the datasets used in the run have to be thread-safe in order for asynchronous loading/saving to work properly. Run a pipeline by...

Kedro's command line interface - Read the Docs

The following command deletes all the files related to a modular pipeline in your Kedro project. kedro pipeline delete <pipeline_name> Copy to clipboard....

Nodes — Kedro 0.18.4 documentation - Read the Docs

Nodes¶. In this section, we introduce the concept of a node, for which the relevant API documentation is kedro.pipeline.node. Nodes are the building...