CI/BUG: pyarrow read_csv deadlock
See original GitHub issueWhen trying to figure out azure timeout issues, deadlock appeared to be occurring in parser code, so pyarrow
makes sense as the culprit. Seems like tests with weird input cause issues, for example some of the parse_dates
tests, or for a specific reproducer the test:
pandas/tests/io/parser/common/test_ints.py::test_outside_int64_uint64_range
On current pyarrow
I can’t reproduce, but azure uses 0.17.0, with which can reproduce a deadlock (just running the command pandas/tests/io/parser/common/test_ints.py::test_outside_int64_uint64_range
) on macOS. Doesn’t happen consistently, but will deadlock (to the point that need to sigkill to stop, which explains why pytest-timeout
didn’t catch it).
cc @lithomas1 if any thoughts here
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
pyarrow.csv.read_csv — Apache Arrow v10.0.1
Read a Table from a stream of CSV data. Parameters: input_file str , path or file-like object. The location of CSV ...
Read more >pyarrow.csv.ReadOptions — Apache Arrow v10.0.1
pyarrow.csv.ReadOptions¶ · Whether to use multiple threads to accelerate reading · How much bytes to process at a time from the input stream....
Read more >Reading and Writing CSV files — Apache Arrow v10.0.1
Reading and Writing CSV files¶. Arrow supports reading and writing columnar data from/to CSV files. The features currently offered are the following:.
Read more >pyarrow.csv.ParseOptions — Apache Arrow v10.0.1
Parameters: delimiter1-character str , optional (default ','). The character delimiting individual cells in the CSV data.
Read more >pyarrow.csv.ConvertOptions — Apache Arrow v10.0.1
Explicitly map column names to column types. Passing this argument disables type inference on the defined columns. null_values list , optional. A sequence...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’ll try to test this some more. It is also possible that pyarrow is getting stuck because of the TextIoWrapper that we are using on our side to force pyarrow to read StringIO’s, which would be a bug on our side.
Now that our minimum version is 6.0 I believe we shouldn’t hit this issue anyone as IIRC I was experiencing this with pyarrow 2.0 and had skipped those version in the CI due to the deadlock.
Closing since we haven’t seen this in a while but we can reopen if this shows up again