question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

name and id as identifiers for Data Packages

See original GitHub issue

Currently Data Packagese must have a name attribute but do not have an id attribute.

There has been debate about both the semantics (e.g. uniqueness) of the name field and its usability for certain cases (e.g. importing datasets into a new catalog) - see #220 for extensive discussions.

Proposal

Two identifier fields:

  • name: SHOULD be present (and certainly required for installation etc). Name is human meaningful and is designed to support both resolution (protocol to be determined) and easy use by humans e.g. in data dependencies
    • (?) Have this as a MUST?
  • id: MAY be present. If present MUST be globally unique. Propose it is a 36 bit uuid or similar.

What is the structure of name?

name may only contain lower case alphanumeric plus _-. and / as a separator (?? should we allow other url compatible values e.g. :?)

Option 1 - 3 part

Name has the following structure:

[registry/[owner-or-namespace/]]local-name

The primary Data Package registry (assuming there is one) will have the special registry name dp

local-name MUST NOT contain a ‘/’

# single-part - for resolution one would anticipate these implicitly become
# `{primary-registry}/core/{name}`
finance-vix

#2 part: `registry/local-name`
# Propose that namespace MUST
# either come from a designated central data package registry if / when we have one e.g. `core/gdp`
# OR be a valid domain name e.g. `data.gov.uk/my-name` (so we can piggy back on domain name issuance)
datahub.io/xyz
data.gov.uk/xyz

#3 part:
doi/{doi}   # {doi} usually has /
github.com/rgrp/court-decisions-gb

Asides

  1. I did think about having an initial “scheme” value e.g. dp/core/abc or www/data.gov.uk/xyz but felt we were starting to reinvent the url wheel a bit too much …*
  2. one option I thought about was about keeping name single-valued and having id support the multipart option.*
  3. What about just using DOI? Ans: DOI requires a relatively complex registration process in order to able to issue DOIs. We want anyone to be able to create data packages
  4. Why not just use URIs / URLs? That is an option and we should think move about it. Main disadvantages are:
    • They are somewhat cumbersome
    • Are liable to breakage e.g. if a registry simply moves url … (but that may creates problems with the above too?)
    • Do not translate well to local installation
    • Implicitly creates relation between name and URL resolution – what happens if you don’t control any url space?

Use Cases

Why does having an identifier matter? What is used for? At the moment the use cases are not very clear.

Note also @amercader comment: “As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one.” – though my question is why do you want to decide if it is the same?

Context

  • Check out Zooko’s Triangle. For names hard to have more than 2 of:
    • meaningful (for humans)
    • decentralized
    • secure / non-colliding

Aims for name:

  • be human-usable and usable in dependencies
  • make possible and likely but not guarantee non-collision
  • be partially distributed

Content-based naming / addressing

One attractive approach to naming that is both secure and decentralized is content-based naming based on hashes. The basic idea is you name content via the (e.g. sha1) hash of the content.

This is attractive and clever but does have 2 drawbacks:

  • The name changes if the content changes (this could be a feature rather than a bug)
  • The name is an opaque long string

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:25 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
joehandcommented, Nov 11, 2016

I’m starting to swing towards what I think is the view of @rgrp being that the spec does not need a unique identifier as part of it: it is a platform-specific concern (federated or otherwise).

This seems like a good compromise. Global uniqueness will be hard to guarantee and only meaningful within the datapackage space. Allowing the id to be a platform specific unique id will make it easier to use datapackages in those platforms. This would allow id’s such as:

"id" : "https://doi.org/10.5281/zenodo.166271"
"id" : "dat://f677bd23661a1d5871e40092268d197c73f213f6b8aefebe01709647cfde9528/"

These IDs will be resolvable within the specific platform but also meaningful when viewing the datapackage outside that platform. It is clear what subspace these IDs come from and where they are guaranteed to be unique.

0reactions
rufuspollockcommented, Feb 5, 2017

AGREED: will do as separate PR:

  • id field which MUST be unique (e.g. uuid, doi etc)
  • name field is MAY and can be anything you like within reason …
Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Package Identifier - Frictionless Standards
Data Package Identifiers are small JSON-oriented structure or strings which identify a Data Package (and, usually, its location). Author(s), Rufus Pollock.
Read more >
Database Identifiers - SQL Server - Microsoft Learn
The database object name is referred to as its identifier. Everything in Microsoft SQL Server can have an identifier. Servers, databases ...
Read more >
Identifiers - IBM
An identifier is a token that is used to form a name. An identifier in an SQL statement is either an SQL identifier...
Read more >
Built-in Data Identifiers - Cisco Umbrella Documentation
The built-in data identifiers match specific personal identification information. ... Lenient (default): One person name + two or more other identifiers
Read more >
Identifier - Wikipedia
An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found