Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird behavior with .read_csv(header=0)

See original GitHub issue

So I was testing both a pandas and a dask solution for an assignment of parsing a CSV file, when I found a weird behavior with dask.read_csv().

A) Let’s assume we have the next CSV file (input.csv) with 13 unique rows.

1 city1,date,193 < This 193 value will be important going forward 2 city2,date,14 (...) 13 city13,date,18

Given this CSV file, we run this script

df = dd.read_csv("input.csv", header=0, names=["city", "date", "sales"], usecols=["city", "sales"], blocksize=16 * 1024 * 1024) result = df.groupby("city").sum() result = result.compute().to_csv("./output.csv")

This give us an output CSV file with 2 columns, city and sales, and 12 rows. The CSV has not header info in line 1, and because of the (header=0) we’re “skipping” the first line and city1 is missing. It was my error to put the header=0 there, but everything is working as intended here.

But groupby().sum() in the second line is not doing anything since we just have 13, unique rows. So I replicated those lines 780k times, ending with a 10.140.000 rows file.

This is where things start to get messy.

Remember, city1 have a sales value = 193. Given the new input file, the output file should have a value for city1 = 150539807. This is 193 * 779.999 (We have the same rows 780k times, but we’re skipping the first line, which is a city1 row)

However, the result for city1 in the output file is 150542123. So it’s looks like if additional 12 rows for city1 are being parsed.

What if instead of 780k times, I repeat the rows half of it (390k times)? The additional rows also halts, and additional 6 rows are parsed.

Halts again, repeat the rows 195k times, and value for city1 is 37634807, which is exactly 193 * 194.999, cause first rows is skipped. So issue disappears at this point.

But if I replicate the rows 194.999/998 times, or even adds up to 195.001/002 times, first line is not skipped and we also have an additional row in total count.

All other 12 cities values are always, in all cases, what they should be.

Why is this happening? It’s really weird.

If I run the same script but just with pandas, the first line is always skipped and total count for the city rows is exactly what should be.

Environment:

Dask version: 2022.10.0
Python version: 3.11.0
Operating System: Windows 10.0.19043
Install method (conda, pip, source): pip

Issue Analytics

State:
Created a year ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

rjzamoracommented, Nov 1, 2022

The unexpected behavior can all be explained by the header=0 argument. Since we think (incorrrectly) that the 0th row is a header, we prepend a copy of that row to the byte range used for every other partition in the dataset. Partitions 1+ then end up including the copied row as “real” data for some reason. Still trying to figure out what the correct logic is here. I do know that the bug is resolved if we avoid dropping the header option when names is specified.

Update: Yeah, it looks like we are just trying to avoid using a something like header=1 for partitions >0, because we have the header row prepended to the byte range, and just want “default” header behavior. However, we are failing to handle the names/header=0 case. Therefore, I suggest we do something like:

        if not is_first:
            write_header = True
            rest_kwargs.pop("skiprows", None)
            if rest_kwargs.get("header", 0) not in (None, 0):
                rest_kwargs.pop("header", None)  # Want default behavior

1reaction

cryptohabitatcommented, Nov 1, 2022

@ncclementi

This is what I found

N= 780_000 Only city 1 has different values, and in the 3 cases is different. But the rest of the cities all match

Yes, only city 1, cause that’s the row that is having the “weird” interaction with the header=0 “targeting” at it in line 1.

(…) but only when we use blocksize.

Maybe there’s something happening at every block/chunk when using blocksize.

So I just checked every block by saving them in separates output files.

- N = 780_000

ddf = dd.read_csv("input.csv", header=0, names=["city", "date", "sales"], usecols=["city", "sales"], blocksize=16 * 1024 * 1024)
outputs = ddf.to_csv('blocks/part_*.csv', index=False, header=False)

We have 11 blocks as a result. And there’s the problem:

part_00.csv starts with city2 (because the first row is skipped) and ends with city1. So the first block have the same number of rows for each city.
part_01.csv starts with city1 again (header?) and ends also with city1 < +1 additional row.

The same behavior in part_01.csv appears up until part_10. There we have our 10 additional rows.

The thing is, for N = 780_000, delta between pandas and dask is 11 rows. So 10 of those rows are being additionally parsed, but there’s 1 extra row I don’t know where is coming from.

For N = 200_000 we have just 3 blocks:

part_00.csv starts with city2 (because the first row is skipped) and ends with city6.
part_01.csv starts with city1 again and then keep counting from city7, so we have an additional row for city1. Same happens with part_02.

For N = 100_000 we just have 1 block that start with city2 and ends with city13, so there are not additional rows being parsed

For what is worth, I guess line 1 is being added to every block/partition as header except for the first one, and then being parsed (even with header=0).

Curious on why I didn’t have problems with N = 195.000 while I did for N = 195.001/194.999? Didn’t check partitions for N = those values.