question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-2639] Cannot read csv in chunks with pandas

See original GitHub issue

Description

Cannot read csv in chunks with kedro data catalog.

df = pd.read_csv(csv, chunksize=1000) df.get_chunk()

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

df = catalog.load(“train_dataset”) df.get_chunk()

ValueError: I/O operation on closed file. df <pandas.io.parsers.TextFileReader at 0x7fde97a82450>

Expected Result

I should be able to loop over the reader.

Actual Result

ValueError: I/O operation on closed file.

-- If you received an error, place it here.

ValueError: I/O operation on closed file.

```yaml
train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

– Separate them if you have more than one.


## Your Environment
Include as many relevant details about the environment in which you experienced the bug:

* Kedro version used (`pip show kedro` or `kedro -V`):
kedro: 0.16.6
* Python version used (`python -V`):
3.7.5
* Operating system and version:
Ubuntu

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:18 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
noklamcommented, Jun 21, 2021

For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code.

Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the _load() method.

from typing import Any, Dict
from kedro.extras.datasets.pandas import CSVDataSet
import pandas as pd

from kedro.io.core import (
    get_filepath_str,
    get_protocol_and_path,
)


def _load(self)  -> pd.DataFrame:
    load_path = get_filepath_str(self._get_load_path(), self._protocol)

    return pd.read_csv(load_path, **self._load_args)

CSVDataSet._load = _load
2reactions
AntonyMilneQBcommented, Jun 1, 2021

I believe the problem here is that the context manager that is used in catalog.load for a csv file closes the file: https://github.com/quantumblacklabs/kedro/blob/e17a5e44e6d1ec1335b4cb69011babd7f38cad9b/kedro/extras/datasets/pandas/csv_dataset.py#L157

Since pandas added fsspec support in their API starting with version 1.1.0, we are in the process of converting this code (and others like JSONDataSet) to use pd.read_* without the need for the context manager. This should fix the bug but won’t be out until kedro 0.18.

In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk…):

   def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
        return pd.read_csv(load_path, **self._load_args)

Note also that since pandas 1.2 TextFileReader (which is what is returned when specifying chunksize) is now a context manager - see https://github.com/pandas-dev/pandas/pull/38225. It’s still iterable, so correct usage would now be:

with dataset_name as chunks:
    for chunk in chunks:
        process(chunk)
Read more comments on GitHub >

github_iconTop Results From Across the Web

[KED-2639] Cannot read csv in chunks with pandas · Issue #598
Cannot read csv in chunks with kedro data catalog. df = pd.read_csv(csv, chunksize=1000) df.get_chunk(). Context. How has this bug affected you?
Read more >
How do I read a large csv file with pandas? - Stack Overflow
The solution above tries to cope with this situation by reducing the chunks (e.g. by aggregating or extracting just the desired information) one ......
Read more >
pandas.read_csv — pandas 1.5.2 documentation
Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.
Read more >
How to Load a Massive File as small chunks in Pandas?
Pandas in flexible and easy to use open-source data analysis tool build on top of python ... The method used to read CSV...
Read more >
Splitting Large CSV files with Python - MungingData
It cannot be run on files stored in a cloud filesystem like S3 ... Here's how to read in chunks of the CSV...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found