Discussion of Catalogs re Data Packages
See original GitHub issueNeed to think further about this. Removed the material below from the current spec since this is not finalized.
Current Primary Proposal
Making your registry into a (tabular) Data Package. A real-life example here:
https://github.com/datasets/registry
Here’s the rough structure:
datapackage.json
catalog.csv
catalog.csv is a CSV file with the following structure:
url,name,owner
...
url
: url to the dataset, usually the URL to the github repositoryname
: the name of the dataset as set in itsdatapackage.json
(will usually be the same as the name of the repository)owner
: the username of the owner of the package. For datasets in github this will be the github username
name
and owner
are both optional.
# OLD
Options
Option 1
[ { data-package }, { data-package } ]
Option 2
{ dp-id: { data-package }, dp-id: { data-package } }
Option 3
{ dataPackageCatalogVersion: [an integer indicating version of the spec this corresponds to] dataPackages: like option 1 or 2 ... ... }
Existing material
Catalogs and Discovery
In order to find Data Packages tools may make use of a “consolidated” catalog either online or locally.
A general specification for (online) Data Catalogs can be found at http://spec.datacatalogs.org/.
For local catalogs on disk we suggest locating at “HOME/.dpm/catalog.json” and having the following structure::
{
version: ...
datasets:
{name}: {
{version}:
metadata: {metadata},
bundles: [
url: ...
type: file, url, ckan, zip, tgz
]
}
When Package metadata is added to the catalog a field called bundle is added pointing to a bundle for this item (see below for more on bundles).
Issue Analytics
- State:
- Created 10 years ago
- Comments:20 (17 by maintainers)
Top GitHub Comments
@rufuspollock sure: So, for a “mixed” package registry:
Or, for a ubiquitous package registry where profile constraints the valid
resources
:I think all fields could be hypothetically optional except
"path"
, as it can be used to pull the rest.Seems like some other values might be desirable, such as:
So in the case where the dataset is NOT hosted on Github, perhaps the
owner
value could be better scoped by something like