question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Chunks" and resources split over multiple physical files

See original GitHub issue

(Original title: Support concatenations of CSV files in a Tabular Data Package)

Dealing with large datasets split across multiple CSVs is a common enough problem that we should support a way of specifying concatenation directly in the datapackage.json.

Originally referenced here and here as part of the Fiscal Data Package work, @pwalsh had the suggestion to allow path, data, and url in a resource object to be arrays that a given implementation would transparently concatenate and treat as a single entity. Given that the data property of a resource specifies inline data (i.e. concatenation is irrelevant), I will amend this to a recommendation allowing path and url to be arrays.

@pwalsh’s original post below:

I thought about this more. I think we should definitely support the basic use case of multiple files of the “same” data, and we should do it as part of the core Tabular Data Package spec, on a resource object.

eg: given:

budget1.csv
amount, date

budget2.csv
amount,date

We would not have 2 distinct resources: rather, we’d allow path, data and url to be arrays:

{
  "name": "My Stuff",
  "resources": [
    {
      "path": ["budget1.csv", "budget2.csv"]
    }
  ]
}

I don’t like having a distinct “concatenations” property: this pattern is a much better representation of the thing we are describing IMHO, and simply extends what already exists.

I agree that this is blurring a line between ETL and metadata.

I think the pros outweigh the cons considering how common this really is out in the wild.

In the various libs that exist and will be further developed for dealing with DataPackages and resources therein, it is just trivial to extend iteration patterns over files to work on an array of them, or, perform some concat first and internally treat the files as one thing (shutil.copyfileobj, or whatever is behind csvstack, for example).

Alternate suggestion from @rgrp for context:

Rather than addressing this in the mapping by having to reference multiple files from each mapped attribute I propose that instead we have some way to indicate that files should be concatenated e.g. in mapping attribute or similar we have:

concatentations: [
  ["budget1.csv", "budget2.csv", ...],
  ["payee1.csv", "payee2.csv"]
]

Thoughts? @pudo @rgrp @jindrichmynarz @pwalsh

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:24 (22 by maintainers)

github_iconTop GitHub Comments

2reactions
rufuspollockcommented, Aug 11, 2016

OK I will start a WIP on a draft text - thoughts welcome

### A Resource in Multiple Files ("Chunked Resources")

Usually a resource has a single file associated to it containing the data for that resource.

Sometimes, however, it may be convenient to have a single resource whose data is split across
multiple files -- perhaps the data is large and having it in one file would be inconvenient.

To support this use case we allow for the `path` property to be an array of file path rathe
than a single file path.
0reactions
rufuspollockcommented, Dec 1, 2016

The one remaining question is what to do with headers:

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding Data Deduplication | Microsoft Learn
A chunk is a section of a file that has been selected by the Data Deduplication chunking algorithm as likely to occur in...
Read more >
Understanding Database Sharding | DigitalOcean
Sharding involves breaking up one's data into two or more smaller chunks, called logical shards. The logical shards are then distributed across ...
Read more >
What is defragmentation: Why do I need it? | Diskeeper
Physical file fragmentation occurs in two different ways. First, individual files are broken into multiple pieces and scattered about a disk ...
Read more >
MediaConvert split audio into multiple output chunks
Unfortunately, there isn't a way to do this in the service currently. The service does not support input clipping for audio only inputs...
Read more >
What Is a File System? Types of Computer ... - freeCodeCamp
Partitioning is splitting a storage device into several logical regions, so they can be managed separately as if they are separate storage ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found