Parallel data ingest can fail on malformed data if splitting is unfortunate
See original GitHub issueSee more in the PR #2828.
Details: https://github.com/modin-project/modin/issues/2845#issuecomment-1164484092:
I believe I have (accidentally) found the root cause for this issue while debugging some CI failure.
Imagine the scenario:
- We have some
.csv
file which we callpd.read_csv()
against- That
.csv
file is a bit incorrect, e.g. some lines in it have too many fields (like intest_read_csv_error_handling()
)- If partition split happens in a way that the very first line in a
.csv
chunk is a bad one, pandas would happily parse that chunk assuming that it’s other lines being shorter, not that one being invalid- Other distributed text logic fails then, because chunks are uneven and their indices could go strange.
I believe this could be relevant for other distributed text file parsers, as we use the same technique almost everywhere.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
BUG: read_csv handling of bad lines differs depending on first ...
I would expect both "good" and "bad" csv-s to be parsed equally, ... Parallel data ingest can fail on malformed data if splitting...
Read more >Malformed records are detected in schema inference.
Malformed records are detected in schema inference. Hello I am trying to ingest data from an api, I want to try using data...
Read more >NiFi Developer's Guide - Apache NiFi
For Processors that ingest data into NiFi from external sources, this step is skipped. The Processor is then free to examine FlowFile attributes;...
Read more >The Complete Guide to the ELK Stack - Logz.io
The Logz.io authoritative guide to the ELK Stack that shows the best practices for installation, monitoring, logging and log analysis.
Read more >Spring Cloud Data Flow Reference Guide
If an application fails, the state of the stream will change from 'deployed' to 'partial'. 11.2. Resource Management. Each target runtime lets ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I believe I have (accidentally) found the root cause for this issue while debugging some CI failure.
Imagine the scenario:
.csv
file which we callpd.read_csv()
against.csv
file is a bit incorrect, e.g. some lines in it have too many fields (like intest_read_csv_error_handling()
).csv
chunk is the bad one, pandas would happily parse that chunk assuming that it’s other lines being shorterI believe this could be relevant for almost all distributed text file parsing, as we use the same technique everywhere.
I have filed https://github.com/pandas-dev/pandas/issues/47490, let’s see what they respond with.