BUG: on_bad_lines=callable does not invoke callable for all bad lines
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'
In [30]:
len(open("bad.csv").readlines())
Out [30]:
3
In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'
In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')
In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)
Issue Description
The above data file has two rows + header. Row 2 is valid, Row 3 is bad.
For df1
, I’m setting on_bad_line=warn
, and I see a warning for line 3.
For d2
, I’m passing on_bad_lines=print
, and I don’t see any prints - the bad line is silently skipped.
❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl
Expected Behavior
I would expect the bad line to be printed in the second case.
Installed Versions
INSTALLED VERSIONS
commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-49-generic Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 60.6.0 pip : 22.0.3 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.4.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None /home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn(“Setuptools is replacing distutils.”)
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:14 (11 by maintainers)
You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for “bad line”
Okay I can look at making documentation changes to
on_bad_lines
. Does this need a separate issue opened and a PR against that, as I suppose we can leave this open for discussions on whether code changes are appropriate down the line?If we open another issue we can discuss how we want to describe “bad lines” to reflect what’s happening there.
Otherwise, I propose something along the lines of telling the user that user defined callables act on “to many fields” whereas
warn
,error
are triggered by any CSV parsing error. What do you think?