Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Contributing a CSV module [RE: dask dataframe `read_csv`]

See original GitHub issue

Following our discussion in dask #8045 I’m seeking to contributing a CSV module to fsspec-reference-maker, and wanted to kick off a feature proposal issue here to clarify some aspects.

Hope this is a decent start, but WIP and comments appreciated (I need to review the role of fsspec in dask more closely).

Firstly, as described in #7, this library is geared towards evenly spaced chunks. The README of this repo gives an example of this with a single variable, or “dimension”, i (the number of chunks), which is used in the example spec along with a fixed length (1000) to produce range(0, i * 1000, 1000).

This of course matches how dask reads CSVs as range(0, size, blocksize), but the whole point of my intervention here is to make them no longer evenly spaced deterministic chunks.

The word ‘deterministic’ here seems to mean both:

“based on the byte positions rather than byte content at those positions”, as well as
“uniquely identifiable, not changing upon recalculation”.

The alternative I am proposing for CSVs to dask only fits the 2nd of these criteria (as it will involve checking bytes, which may result in changes to the offsets, thus chunks may not stay evenly spaced).

Also in #7 there is mention of “callbacks”, and I’m uncertain whether this is the route I should take to achieving this adjustment to the offsets (or something entirely different I should ignore).

I am inclined to copy the conclusions of that issue, that perhaps it is easiest to begin by aiming to produce the explicit ‘version 0’ spec rather than the templated ‘version 1’ until how this works is clearer to me.

As a less significant problem for me, the docs here refer to jinja2 rendered strings, but I don’t see that library as a requirement anywhere here, so I’m not sure how that works (perhaps it is a future feature, I’m noting this library is a recent/future-facing effort).

Here’s my first attempt at an idea of how this would look (filenames vaguely copying the format of the WIT dataset as “base_0_ordinal_count of base_1_cardinal_total”):

The values in the 2nd and 3rd parts of these values are supposed to indicate where a previous routine has calculated the offsets (as 10, 5, 20, 0, 50, 25), which are added to the evenly spaced offsets (1000+10, 2000+5 etc.) and subtracted from the lengths between consecutive offsets (1010-0 = 1010, 2005-1010=995, etc.)
I’m fairly sure that the 3rd item in the gen keys should indicate the length of the chunk that ends at that offset but please correct me if I’m wrong.

This then gives the spec of a ‘filesystem for partitions within a file’, addressable by filepath plus ‘virtual’ partition index:

{
  "key0": "data",
  "gen_key0": ["/path/to/csv_0_of_1.csv.gz/partition_0", 1010, 1010],
  "gen_key1": ["/path/to/csv_0_of_1.csv.gz/partition_1", 2005, 995],
  "gen_key2": ["/path/to/csv_0_of_1.csv.gz/partition_2", 3020, 1015],
  "gen_key3": ["/path/to/csv_0_of_1.csv.gz/partition_3", 4000, 980],
  "gen_key4": ["/path/to/csv_0_of_1.csv.gz/partition_4", 5050, 1050],
  "gen_key5": ["/path/to/csv_0_of_1.csv.gz/partition_5", 6025, 975]
}

I’m not sure what to put in the ‘key’ entries so have removed the ones from the example spec (please let me know if this is unadvisable, and if you have an idea of what should go there instead)
I presume the one that is currently bearing the bytes b”data” should be storing something important to identify the CSV, but I can’t determine what that is on my first attempt
- My understanding is that this will be fed into the OpenFile object as the fs argument, so it should store things relevant to that. Perhaps path? I’m very unsure how this should look though, and suspect if I guess I’d only end up putting irrelevant info in that’ll already be passed in.
For simplicity I’m considering 2 files here, each with 3 offsets (i.e. 4 chunks: the offset starting at 0 is always going to be assumed to be valid: if it’s not then that’s a corrupt CSV, not the problem I’m seeking to solve here)

As for the matter of identifying the offset adjustments (10, 5, 20, 0, 50, 25) I expect the fastest way to do so is

initialise separator_skip_count = separator_skip_offset = 0 at each offset mark (1000, 2000, etc.)
try pandas.read_csv(nrows=1)
- catch failure; increment separator_skip_count += 1 if it fails (repeat)
finally [upon success]
- break out of the loop
- use the tell minus the offset to give the ‘offset adjustment’ (assign separator_skip_offset)
  - left as 0 for no adjustment (if separator_skip_count == 0), or a positive integer

The separator_skip_count indicating the number of newlines that were skipped after the offset+length to find the ‘genuine’ row-delimiting offset seems redundant to store, but useful while writing/debugging this algorithm.

I say that, but I don’t know: perhaps it’d be inexpensive to recalculate the actual byte offsets from the number of newlines to skip after the offset, rather than store that offset? (Not clear to me yet)

Only the separator_skip_offset needs to be stored: summed with the offset, in the 2nd item of the values (1010, 2005, etc.)

I think at the point that the separator_skip_offset is calculated, the ‘version 1’ spec could be computed, to reduce to the above ‘version 0’ spec, as something like:

{
    "version": 1,
    "templates": {
        "u": "/path/to/csv_0_of_1.csv.gz",
        "f": "partition_{{c}}"
    },
    "gen": [
        {
            "key": "gen_key{{i}}",
            "url": "{{u}}/{{f(c=i)}}",
            "offset": "{{(i + 1) * 1000}}",
            "length": "1000",
            "dimensions": 
              {
                "i": {"stop":  5}
              }
        }   
    ],
    "refs": {
      "key0": "data",
    }
}

I may be misunderstanding something by putting the filename within the template rather than as a variable
should gen_key0 be partition0 (etc) ? or should this stay as gen_key0 to make clear that it’s generated?
I can’t figure out where the array specifying the separator_skip_offset should go (if I put it in the gen.dimensions key it’ll become a Cartesian product, whereas I want to ‘zip’ it against the i range…)
Should I change the gen.url key from “url” to something else, since it’s expected to refer to a file path not a web resource?

Without understanding how to incorporate the offset adjustments into this template, I don’t think I can write the ‘version 1’ spec at this time, but I hope we might be able to figure it out here.

Issue Analytics

State:
Created 2 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

2reactions

martindurantcommented, Oct 4, 2021

Note that you probably want to take dask out of the equation too - it might be where you want the files to be processes eventually, but I think you should be able to find valid offsets without it, simplifying the process (at the expense of no parallelism).

1reaction

martindurantcommented, Oct 4, 2021

I would not attempt to solve the compression and parsing issues in one go, it would be better to use an uncompressed target at first, I think.

Top Results From Across the Web

Reading CSV files into Dask DataFrames with read_csv

This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv. It'll discuss the different options...

dask.dataframe.read_csv - Dask documentation

Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......

Reading csv with separator in python dask - Stack Overflow

Once the data is in a pandas frame, you can use the pandas infer_objects ... import dask.dataframe as dd df = dd.read_csv('D:\temp.csv' ...

Pandas read_csv() - How to read a csv file in Python

You can skip or select a specific number of rows from the dataset using the pandas.read_csv function. There are 3 parameters that can...

pandas merge two dataframes remove duplicates

Contributed on Oct 27 2021 Pandas drop_duplicates () function removes ... Scenario-1 : Import csv to pandas DataFrame using read_csv () Here we...