question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv with filehandler and nrows argument

See original GitHub issue

Code Sample, a copy-pastable example if possible

%%file example.csv
1,2
3,4
5,6
7,8
9,10
11,12
import pandas as pd
with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2)
    print(f.readline())
with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=1, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)
7,8

9,10

   A  B
0  1  2
with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)
7,8

9,10

   A  B
0  1  2
1  3  4
with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=3, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)
7,8

9,10

   A  B
0  1  2
1  3  4
2  5  6

Problem description

The Issue https://github.com/pandas-dev/pandas/issues/2071 is probably related. The c-parser exhaustes the file handler even if nrows is passed.

The python-parser shows unexpected behaviour, when nrows=1 or nrows=2 is given.

Expected Output

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)
5,6

7,8

   A  B
0  1  2
1  3  4

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-27-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3 pytest: 3.1.2 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.0-b1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:4
  • Comments:17 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
igilohcommented, Dec 15, 2021

@andydish From what I could gather, the C implementation of read_csv() reads a chunk from the file into a buffer, than parses it and stops once nrows was reached. The user has no control over the chunk size - so the actual file pointer would move forward beyond the nrowsth line.

The workaround I ended up doing was:

with open(fname, 'rb') as file:
    def read_length(length):
        before = file.tell()
        data = pd.read_csv(file,
                           float_precision='high',
                           nrows=length).values
        file.seek(before)
        for i in range(length):
            next(file)
        return data
    
    first_part = read_length(len1)
    second_part = read_length(len2)

(not proud of it, but it worked…)

0reactions
andydishcommented, Dec 15, 2021

@igiloh That is where I was starting to head in my own solution. Thanks for sharing and saving me some time!

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.read_csv — pandas 1.5.2 documentation
Read a comma-separated values (csv) file into DataFrame. ... such as a file handle (e.g. via builtin open function) or StringIO . ......
Read more >
Pandas read_csv() - How to read a csv file in Python
A. nrows : This parameter allows you to control how many rows you want to load from the CSV file. It takes an...
Read more >
You Are Probably Not Making The Most of Pandas “read_csv ...
The read_csv is one of the most commonly used Pandas functions. It creates a dataframe by reading data from a csv file.
Read more >
Python | Read csv using pandas.read_csv() - GeeksforGeeks
To access data from the CSV file, we require a function read_csv() that ... Here, we just display only 5 rows using nrows...
Read more >
Is there a way to pass an open file to pandas read_csv
You can provide the engine='python' and nrows=N arguments to pick up where pandas's reader leaves off in a text file, or to use...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found