Feature Request: "Skiprows" by a condition or set of conditions
See original GitHub issueCode Sample, a copy-pastable example if possible
# data file path: foo/bar/data.csv
col_1 col_2 ... col_n
0 940 45 ... 0.023
1 1040 52 ... 0.430
2 1530 45 ... 0.302
3 753 43 ... 0.450
4 890 32 ... 0.023
schema={
"col_1": int,
...,
"col_n": float
}
mask = (col_1 >= 1000)
df = pd.read_csv('foo/bar/data.csv', skiprows=mask, dtype=schema)
Problem description
Pandas read methods currently support skipping rows by index with the parameter skiprows
. It’d be a good feature if usage of skiprows
is extended in way that it can take conditional statements just like many pandas objects. I anticipate that this feature may not be useful to all, it is certain that it will ease the people’s pain who are dealing with many (large) files. Memory usage would drastically be downed in some situations I think.
I am opening this issue with hoping a welcome on dev side, not opening this because it first glanced as a fancy request, but because I think it may affect many people’s work in a good way. It may be later used as a schema reference as well. Similar to schema validation of a data set.
NOTE: I am uncertain how can this be achieved or even if it can be done at all. It may not apply to pandas module. Consider this also as a brain storming.
Expected Output
This requires passing column names & dtype (schema) before reading the file, but I am not sure how to convey the mask for skiprows it should be similar to pandas.DataFrame
condition & masks, but we may only pass column names since there is no pandas.DataFrame
object.
col_1 col_2 ... col_n
0 1040 52 ... 0.430
1 1530 45 ... 0.302
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 0.25.3 numpy : 1.18.0 pytz : 2019.2 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : 0.29.12 pytest : 5.3.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.0 lxml.etree : 4.3.4 html5lib : None pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.1 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None s3fs : 0.4.0 scipy : 1.3.1 sqlalchemy : None tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.0
Issue Analytics
- State:
- Created 4 years ago
- Comments:20 (15 by maintainers)
Top GitHub Comments
Those are certainly things you could contribute without any objections! That will also give you a chance to take a look how we implement
skiprows
(across engines).Yes, this is coming to pyarrow (it probably not get into the 0.17 release next week, but certainly the next).
So I think the way to go here is the work on enabling the pyarrow engine in read_csv, and a
filter
can be passed to pyarrow to actually filter while reading (https://github.com/pandas-dev/pandas/pull/31817).