Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv with filehandler and nrows argument

See original GitHub issue

Code Sample, a copy-pastable example if possible

%%file example.csv
1,2
3,4
5,6
7,8
9,10
11,12

import pandas as pd

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2)
    print(f.readline())

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=1, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=3, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

Problem description

The Issue https://github.com/pandas-dev/pandas/issues/2071 is probably related. The c-parser exhaustes the file handler even if nrows is passed.

The python-parser shows unexpected behaviour, when nrows=1 or nrows=2 is given.

Expected Output

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-27-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3 pytest: 3.1.2 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.0-b1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:17 (12 by maintainers)

Top GitHub Comments

1reaction

igilohcommented, Dec 15, 2021

@andydish From what I could gather, the C implementation of read_csv() reads a chunk from the file into a buffer, than parses it and stops once nrows was reached. The user has no control over the chunk size - so the actual file pointer would move forward beyond the nrowsth line.

The workaround I ended up doing was:

with open(fname, 'rb') as file:
    def read_length(length):
        before = file.tell()
        data = pd.read_csv(file,
                           float_precision='high',
                           nrows=length).values
        file.seek(before)
        for i in range(length):
            next(file)
        return data
    
    first_part = read_length(len1)
    second_part = read_length(len2)

(not proud of it, but it worked…)

0reactions

andydishcommented, Dec 15, 2021

@igiloh That is where I was starting to head in my own solution. Thanks for sharing and saving me some time!