[KED-2639] Cannot read csv in chunks with pandas
See original GitHub issueDescription
Cannot read csv in chunks with kedro data catalog.
df = pd.read_csv(csv, chunksize=1000) df.get_chunk()
Context
How has this bug affected you? What were you trying to accomplish?
Steps to Reproduce
train_dataset:
type: pandas.CSVDataSet
filepath: 'mycsv.csv'
load_args:
chunksize: 50000
df = catalog.load(“train_dataset”) df.get_chunk()
ValueError: I/O operation on closed file. df <pandas.io.parsers.TextFileReader at 0x7fde97a82450>
Expected Result
I should be able to loop over the reader.
Actual Result
ValueError: I/O operation on closed file.
-- If you received an error, place it here.
ValueError: I/O operation on closed file.
```yaml
train_dataset:
type: pandas.CSVDataSet
filepath: 'mycsv.csv'
load_args:
chunksize: 50000
– Separate them if you have more than one.
## Your Environment
Include as many relevant details about the environment in which you experienced the bug:
* Kedro version used (`pip show kedro` or `kedro -V`):
kedro: 0.16.6
* Python version used (`python -V`):
3.7.5
* Operating system and version:
Ubuntu
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:18 (12 by maintainers)
Top Results From Across the Web
[KED-2639] Cannot read csv in chunks with pandas · Issue #598
Cannot read csv in chunks with kedro data catalog. df = pd.read_csv(csv, chunksize=1000) df.get_chunk(). Context. How has this bug affected you?
Read more >How do I read a large csv file with pandas? - Stack Overflow
The solution above tries to cope with this situation by reducing the chunks (e.g. by aggregating or extracting just the desired information) one ......
Read more >pandas.read_csv — pandas 1.5.2 documentation
Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.
Read more >How to Load a Massive File as small chunks in Pandas?
Pandas in flexible and easy to use open-source data analysis tool build on top of python ... The method used to read CSV...
Read more >Splitting Large CSV files with Python - MungingData
It cannot be run on files stored in a cloud filesystem like S3 ... Here's how to read in chunks of the CSV...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code.
Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the
_load()method.I believe the problem here is that the context manager that is used in
catalog.loadfor a csv file closes the file: https://github.com/quantumblacklabs/kedro/blob/e17a5e44e6d1ec1335b4cb69011babd7f38cad9b/kedro/extras/datasets/pandas/csv_dataset.py#L157Since pandas added
fsspecsupport in their API starting with version 1.1.0, we are in the process of converting this code (and others likeJSONDataSet) to usepd.read_*without the need for the context manager. This should fix the bug but won’t be out until kedro 0.18.In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk…):
Note also that since pandas 1.2
TextFileReader(which is what is returned when specifyingchunksize) is now a context manager - see https://github.com/pandas-dev/pandas/pull/38225. It’s still iterable, so correct usage would now be: