Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Chunks" and resources split over multiple physical files

See original GitHub issue

(Original title: Support concatenations of CSV files in a Tabular Data Package)

Dealing with large datasets split across multiple CSVs is a common enough problem that we should support a way of specifying concatenation directly in the datapackage.json.

Originally referenced here and here as part of the Fiscal Data Package work, @pwalsh had the suggestion to allow path, data, and url in a resource object to be arrays that a given implementation would transparently concatenate and treat as a single entity. Given that the data property of a resource specifies inline data (i.e. concatenation is irrelevant), I will amend this to a recommendation allowing path and url to be arrays.

@pwalsh’s original post below:

I thought about this more. I think we should definitely support the basic use case of multiple files of the “same” data, and we should do it as part of the core Tabular Data Package spec, on a resource object.

eg: given:

budget1.csv
amount, date

budget2.csv
amount,date

We would not have 2 distinct resources: rather, we’d allow path, data and url to be arrays:

{
  "name": "My Stuff",
  "resources": [
    {
      "path": ["budget1.csv", "budget2.csv"]
    }
  ]
}

I don’t like having a distinct “concatenations” property: this pattern is a much better representation of the thing we are describing IMHO, and simply extends what already exists.

I agree that this is blurring a line between ETL and metadata.

I think the pros outweigh the cons considering how common this really is out in the wild.

In the various libs that exist and will be further developed for dealing with DataPackages and resources therein, it is just trivial to extend iteration patterns over files to work on an array of them, or, perform some concat first and internally treat the files as one thing (shutil.copyfileobj, or whatever is behind csvstack, for example).

Alternate suggestion from @rgrp for context:

Rather than addressing this in the mapping by having to reference multiple files from each mapped attribute I propose that instead we have some way to indicate that files should be concatenated e.g. in mapping attribute or similar we have:

concatentations: [
  ["budget1.csv", "budget2.csv", ...],
  ["payee1.csv", "payee2.csv"]
]

Thoughts? @pudo @rgrp @jindrichmynarz @pwalsh

Issue Analytics

State:
Created 8 years ago
Comments:24 (22 by maintainers)

Top GitHub Comments

2reactions

rufuspollockcommented, Aug 11, 2016

OK I will start a WIP on a draft text - thoughts welcome

### A Resource in Multiple Files ("Chunked Resources")

Usually a resource has a single file associated to it containing the data for that resource.

Sometimes, however, it may be convenient to have a single resource whose data is split across
multiple files -- perhaps the data is large and having it in one file would be inconvenient.

To support this use case we allow for the `path` property to be an array of file path rathe
than a single file path.

0reactions

rufuspollockcommented, Dec 1, 2016

The one remaining question is what to do with headers:

https://github.com/onyxfish/csvkit/issues/188 - csvkit supports --no-header now
Postgres COPY command as used in Redshift has “ignoreheader” as a number which is number of rows to skip: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreheader

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326