Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: "Skiprows" by a condition or set of conditions

See original GitHub issue

Code Sample, a copy-pastable example if possible

# data file path: foo/bar/data.csv
  col_1 col_2  ...  col_n
0   940    45  ...  0.023
1  1040    52  ...  0.430
2  1530    45  ...  0.302
3   753    43  ...  0.450
4   890    32  ...  0.023

schema={
    "col_1": int,
    ...,
    "col_n": float
}
mask = (col_1 >= 1000)
df = pd.read_csv('foo/bar/data.csv', skiprows=mask, dtype=schema)

Problem description

Pandas read methods currently support skipping rows by index with the parameter skiprows. It’d be a good feature if usage of skiprows is extended in way that it can take conditional statements just like many pandas objects. I anticipate that this feature may not be useful to all, it is certain that it will ease the people’s pain who are dealing with many (large) files. Memory usage would drastically be downed in some situations I think.

I am opening this issue with hoping a welcome on dev side, not opening this because it first glanced as a fancy request, but because I think it may affect many people’s work in a good way. It may be later used as a schema reference as well. Similar to schema validation of a data set.

NOTE: I am uncertain how can this be achieved or even if it can be done at all. It may not apply to pandas module. Consider this also as a brain storming.

Expected Output

This requires passing column names & dtype (schema) before reading the file, but I am not sure how to convey the mask for skiprows it should be similar to pandas.DataFrame condition & masks, but we may only pass column names since there is no pandas.DataFrame object.

  col_1 col_2  ...  col_n
0  1040    52  ...  0.430
1  1530    45  ...  0.302

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.3 numpy : 1.18.0 pytz : 2019.2 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : 0.29.12 pytest : 5.3.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.0 lxml.etree : 4.3.4 html5lib : None pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.1 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None s3fs : 0.4.0 scipy : 1.3.1 sqlalchemy : None tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.0

Issue Analytics

State:
Created 4 years ago
Comments:20 (15 by maintainers)

Top GitHub Comments

1reaction

gfyoungcommented, Feb 19, 2020

My first intention and thought was to make things better in terms of performance (reducing memory pressure) and more user-friendly convention in terms of interaction (that would make easier to trim out part of data).

Those are certainly things you could contribute without any objections! That will also give you a chance to take a look how we implement skiprows (across engines).

0reactions

jorisvandenbosschecommented, Apr 9, 2020

@jorisvandenbossche do you know if the Arrow CSV reader plans to add support for filters like parquet?

Yes, this is coming to pyarrow (it probably not get into the 0.17 release next week, but certainly the next).

So I think the way to go here is the work on enabling the pyarrow engine in read_csv, and a filter can be passed to pyarrow to actually filter while reading (https://github.com/pandas-dev/pandas/pull/31817).