Weird behavior with .read_csv(header=0)
See original GitHub issueSo I was testing both a pandas and a dask solution for an assignment of parsing a CSV file, when I found a weird behavior with dask.read_csv().
A) Let’s assume we have the next CSV file (input.csv) with 13 unique rows.
1 city1,date,193
< This 193 value will be important going forward
2 city2,date,14
(...)
13 city13,date,18
Given this CSV file, we run this script
df = dd.read_csv("input.csv", header=0, names=["city", "date", "sales"], usecols=["city", "sales"], blocksize=16 * 1024 * 1024)
result = df.groupby("city").sum()
result = result.compute().to_csv("./output.csv")
This give us an output CSV file with 2 columns, city and sales, and 12 rows. The CSV has not header info in line 1, and because of the (header=0) we’re “skipping” the first line and city1 is missing. It was my error to put the header=0 there, but everything is working as intended here.
But groupby().sum() in the second line is not doing anything since we just have 13, unique rows. So I replicated those lines 780k times, ending with a 10.140.000 rows file.
This is where things start to get messy.
Remember, city1 have a sales value = 193. Given the new input file, the output file should have a value for city1 = 150539807. This is 193 * 779.999 (We have the same rows 780k times, but we’re skipping the first line, which is a city1 row)
However, the result for city1 in the output file is 150542123. So it’s looks like if additional 12 rows for city1 are being parsed.
What if instead of 780k times, I repeat the rows half of it (390k times)? The additional rows also halts, and additional 6 rows are parsed.
Halts again, repeat the rows 195k times, and value for city1 is 37634807, which is exactly 193 * 194.999, cause first rows is skipped. So issue disappears at this point.
But if I replicate the rows 194.999/998 times, or even adds up to 195.001/002 times, first line is not skipped and we also have an additional row in total count.
All other 12 cities values are always, in all cases, what they should be.
Why is this happening? It’s really weird.
If I run the same script but just with pandas, the first line is always skipped and total count for the city rows is exactly what should be.
Environment:
- Dask version: 2022.10.0
- Python version: 3.11.0
- Operating System: Windows 10.0.19043
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created a year ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
The unexpected behavior can all be explained by the
header=0
argument. Since we think (incorrrectly) that the 0th row is a header, we prepend a copy of that row to the byte range used for every other partition in the dataset. Partitions 1+ then end up including the copied row as “real” data for some reason. Still trying to figure out what the correct logic is here. I do know that the bug is resolved if we avoid dropping theheader
option whennames
is specified.Update: Yeah, it looks like we are just trying to avoid using a something like
header=1
for partitions >0, because we have the header row prepended to the byte range, and just want “default” header behavior. However, we are failing to handle thenames
/header=0
case. Therefore, I suggest we do something like:@ncclementi
Yes, only city 1, cause that’s the row that is having the “weird” interaction with the
header=0
“targeting” at it in line 1.Maybe there’s something happening at every block/chunk when using
blocksize
.So I just checked every block by saving them in separates output files.
- N = 780_000
We have 11 blocks as a result. And there’s the problem:
The same behavior in part_01.csv appears up until part_10. There we have our 10 additional rows.
The thing is, for N = 780_000, delta between pandas and dask is 11 rows. So 10 of those rows are being additionally parsed, but there’s 1 extra row I don’t know where is coming from.
For N = 200_000 we have just 3 blocks:
For N = 100_000 we just have 1 block that start with city2 and ends with city13, so there are not additional rows being parsed
For what is worth, I guess line 1 is being added to every block/partition as header except for the first one, and then being parsed (even with
header=0
).Curious on why I didn’t have problems with N = 195.000 while I did for N = 195.001/194.999? Didn’t check partitions for N = those values.