question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discussion of Catalogs re Data Packages

See original GitHub issue

Need to think further about this. Removed the material below from the current spec since this is not finalized.

Current Primary Proposal

Making your registry into a (tabular) Data Package. A real-life example here:

https://github.com/datasets/registry

Here’s the rough structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...
  • url: url to the dataset, usually the URL to the github repository
  • name: the name of the dataset as set in its datapackage.json (will usually be the same as the name of the repository)
  • owner: the username of the owner of the package. For datasets in github this will be the github username

name and owner are both optional.


# OLD

Options

Option 1

[ 
   { data-package },
   { data-package }
]

Option 2

{ 
   dp-id: { data-package },
   dp-id: { data-package }
}

Option 3

 {
    dataPackageCatalogVersion: [an integer indicating version of the spec this corresponds to]
    dataPackages: 
      like option 1 or 2 ...
    ...
 }

Existing material

Catalogs and Discovery

In order to find Data Packages tools may make use of a “consolidated” catalog either online or locally.

A general specification for (online) Data Catalogs can be found at http://spec.datacatalogs.org/.

For local catalogs on disk we suggest locating at “HOME/.dpm/catalog.json” and having the following structure::

 {
    version: ...
    datasets:
      {name}: {
        {version}:
          metadata: {metadata},
          bundles: [
            url: ...
            type: file, url, ckan, zip, tgz
          ]
 }

When Package metadata is added to the catalog a field called bundle is added pointing to a bundle for this item (see below for more on bundles).

Issue Analytics

  • State:closed
  • Created 10 years ago
  • Comments:20 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
micimizecommented, Apr 25, 2019

@rufuspollock sure: So, for a “mixed” package registry:

{
  "profile": "data-package-catalog",
  "name": "climate-change-packages",
  "resources": [
    {
      // this would probably actually be a custom profile,
      // like "aq-deployment-data-package"
      "profile": "json-data-package",
      "name": "beacon-network-description",
      "path": "https://http://beacon.berkeley.edu/hypothetical_deployment_description.json"
    },
    {
      "profile": "tabular-data-package",
      "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
    },
    {
      "profile": "tabular-data-package",
      "name": "co2-fossil-global",
      "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
    }
  ]
}

Or, for a ubiquitous package registry where profile constraints the valid resources:

{
  "profile": "tabular-data-package-catalog",
  "name": "datahub-climate-change-packages",
  "resources": [
    {
      "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
    },
    {
      "name": "co2-fossil-global",
      "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
    }
  ]
}

I think all fields could be hypothetically optional except "path", as it can be used to pull the rest.

1reaction
bnvkcommented, Sep 21, 2015

Seems like some other values might be desirable, such as:

url, name, owner, keywords, last_updated
...

So in the case where the dataset is NOT hosted on Github, perhaps the owner value could be better scoped by something like

@github-username
name@domain.com
https://domain.com
Read more comments on GitHub >

github_iconTop Results From Across the Web

What Is A Data Catalog? Examples, Use Cases, & Tools
An enterprise data catalog is a data and metadata management tool companies use to inventory and organize the data within their systems. Data...
Read more >
[Proof of Concept] Data Catalogs are dead? - YouTube
Join us on Thursday, 11 Nov at 5:30pm SGT, for a new episode on Data Catalogs. During this episode we will discuss what...
Read more >
Building an Enterprise Data Catalog: 5 Critical Strategies
This blog talks about how you can build an effective Data Catalog from scratch after giving a brief introduction to the concept of...
Read more >
Data Catalogs Are Dead; Long Live Data Discovery
Data catalogs aren't cutting it any more when it comes to metadata management and data governance. Here's how data discovery can help.
Read more >
What Is a Data Catalog and Why Do You Need One? - Oracle
Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found