question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: "Skiprows" by a condition or set of conditions

See original GitHub issue

Code Sample, a copy-pastable example if possible

# data file path: foo/bar/data.csv
  col_1 col_2  ...  col_n
0   940    45  ...  0.023
1  1040    52  ...  0.430
2  1530    45  ...  0.302
3   753    43  ...  0.450
4   890    32  ...  0.023

schema={
    "col_1": int,
    ...,
    "col_n": float
}
mask = (col_1 >= 1000)
df = pd.read_csv('foo/bar/data.csv', skiprows=mask, dtype=schema)

Problem description

Pandas read methods currently support skipping rows by index with the parameter skiprows. It’d be a good feature if usage of skiprows is extended in way that it can take conditional statements just like many pandas objects. I anticipate that this feature may not be useful to all, it is certain that it will ease the people’s pain who are dealing with many (large) files. Memory usage would drastically be downed in some situations I think.

I am opening this issue with hoping a welcome on dev side, not opening this because it first glanced as a fancy request, but because I think it may affect many people’s work in a good way. It may be later used as a schema reference as well. Similar to schema validation of a data set.

NOTE: I am uncertain how can this be achieved or even if it can be done at all. It may not apply to pandas module. Consider this also as a brain storming.

Expected Output

This requires passing column names & dtype (schema) before reading the file, but I am not sure how to convey the mask for skiprows it should be similar to pandas.DataFrame condition & masks, but we may only pass column names since there is no pandas.DataFrame object.

  col_1 col_2  ...  col_n
0  1040    52  ...  0.430
1  1530    45  ...  0.302

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.3 numpy : 1.18.0 pytz : 2019.2 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : 0.29.12 pytest : 5.3.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.0 lxml.etree : 4.3.4 html5lib : None pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.1 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None s3fs : 0.4.0 scipy : 1.3.1 sqlalchemy : None tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.0

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:20 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
gfyoungcommented, Feb 19, 2020

My first intention and thought was to make things better in terms of performance (reducing memory pressure) and more user-friendly convention in terms of interaction (that would make easier to trim out part of data).

Those are certainly things you could contribute without any objections! That will also give you a chance to take a look how we implement skiprows (across engines).

0reactions
jorisvandenbosschecommented, Apr 9, 2020

@jorisvandenbossche do you know if the Arrow CSV reader plans to add support for filters like parquet?

Yes, this is coming to pyarrow (it probably not get into the 0.17 release next week, but certainly the next).

So I think the way to go here is the work on enabling the pyarrow engine in read_csv, and a filter can be passed to pyarrow to actually filter while reading (https://github.com/pandas-dev/pandas/pull/31817).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas read_csv skiprows with conditional statements
I use read_csv to extract the information inside. There are some rows to drop, and i was wondering if it's possible to use...
Read more >
Pandas : skip rows while reading csv file to a Dataframe using ...
In this article we will discuss how to skip rows from top , bottom or at specific indicies while reading a csv file...
Read more >
How to skip rows while reading csv file using Pandas?
Method 4: Skip rows based on a condition while reading a csv file. Code: Python3. Python3 ...
Read more >
How to use loc and iloc for selecting data in Pandas | by B. Chen
iloc with conditions. For iloc , we will get a ValueError if pass the condition straight into the statement: # Getting ValueError
Read more >
Callback | Migrate Conditions: Condition Plugins - Drupal
Skip rows where the phone number uses a 900 number using regex. Callback can be used with callables that require multiple arguments, such...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found