"Chunks" and resources split over multiple physical files
See original GitHub issue(Original title: Support concatenations of CSV files in a Tabular Data Package)
Dealing with large datasets split across multiple CSVs is a common enough problem that we should support a way of specifying concatenation directly in the datapackage.json
.
Originally referenced here and here as part of the Fiscal Data Package work, @pwalsh had the suggestion to allow path
, data
, and url
in a resource
object to be arrays that a given implementation would transparently concatenate and treat as a single entity. Given that the data
property of a resource specifies inline data (i.e. concatenation is irrelevant), I will amend this to a recommendation allowing path
and url
to be arrays.
@pwalsh’s original post below:
I thought about this more. I think we should definitely support the basic use case of multiple files of the “same” data, and we should do it as part of the core Tabular Data Package spec, on a
resource
object.eg: given:
budget1.csv
amount, date
budget2.csv
amount,date
We would not have 2 distinct resources: rather, we’d allow
path
,data
andurl
to be arrays:
{
"name": "My Stuff",
"resources": [
{
"path": ["budget1.csv", "budget2.csv"]
}
]
}
I don’t like having a distinct “concatenations” property: this pattern is a much better representation of the thing we are describing IMHO, and simply extends what already exists.
I agree that this is blurring a line between ETL and metadata.
I think the pros outweigh the cons considering how common this really is out in the wild.
In the various libs that exist and will be further developed for dealing with DataPackages and resources therein, it is just trivial to extend iteration patterns over files to work on an array of them, or, perform some concat first and internally treat the files as one thing (
shutil.copyfileobj
, or whatever is behindcsvstack
, for example).
Alternate suggestion from @rgrp for context:
Rather than addressing this in the mapping by having to reference multiple files from each mapped attribute I propose that instead we have some way to indicate that files should be concatenated e.g. in mapping attribute or similar we have:
concatentations: [
["budget1.csv", "budget2.csv", ...],
["payee1.csv", "payee2.csv"]
]
Thoughts? @pudo @rgrp @jindrichmynarz @pwalsh
Issue Analytics
- State:
- Created 8 years ago
- Comments:24 (22 by maintainers)
Top GitHub Comments
OK I will start a WIP on a draft text - thoughts welcome
The one remaining question is what to do with headers:
--no-header now
My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of
skipInitialRows
orheaderRow
etc - see #326