question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: on_bad_lines=callable does not invoke callable for all bad lines

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'

In [30]:
len(open("bad.csv").readlines())
Out [30]:
3

In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'


In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')

In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)

Issue Description

The above data file has two rows + header. Row 2 is valid, Row 3 is bad.

For df1, I’m setting on_bad_line=warn, and I see a warning for line 3.

For d2, I’m passing on_bad_lines=print, and I don’t see any prints - the bad line is silently skipped.

❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl

Expected Behavior

I would expect the bad line to be printed in the second case.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-49-generic Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 60.6.0 pip : 22.0.3 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.4.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None /home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn(“Setuptools is replacing distutils.”)

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Dec 16, 2022

You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for “bad line”

0reactions
kostyafarbercommented, Dec 15, 2022

Okay I can look at making documentation changes to on_bad_lines. Does this need a separate issue opened and a PR against that, as I suppose we can leave this open for discussions on whether code changes are appropriate down the line?

If we open another issue we can discuss how we want to describe “bad lines” to reflect what’s happening there.

Otherwise, I propose something along the lines of telling the user that user defined callables act on “to many fields” whereas warn, error are triggered by any CSV parsing error. What do you think?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas dataframe read_csv on bad data - Stack Overflow
Starting with pandas 1.4.0 , read_csv() delivers capability that ... The on_bad_lines callable function is called on each bad line and has a ......
Read more >
pandas.read_csv — pandas 1.5.2 documentation
Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are...
Read more >
TypeError: 'CheckBoxConfig' object is not callable
Preliminarily investigating, config file looks fine. $ checkbo check-config. CRITICAL plainbox.crashes: Executable 'checkbox' invoked with ...
Read more >
read_csv( skiprows ) note working for bad rows.
I was trying to use skiprows to skip rows that are bad, but it does not work. Am I doing something wrong or...
Read more >
Pandas read_csv to DataFrames: Python Pandas Tutorial
With all this basic knowledge, we can start practicing pandas read_csv ... when not possible, we can also skip the bad lines by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found