Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

name and id as identifiers for Data Packages

See original GitHub issue

Currently Data Packagese must have a name attribute but do not have an id attribute.

There has been debate about both the semantics (e.g. uniqueness) of the name field and its usability for certain cases (e.g. importing datasets into a new catalog) - see #220 for extensive discussions.

Proposal

Two identifier fields:

name: SHOULD be present (and certainly required for installation etc). Name is human meaningful and is designed to support both resolution (protocol to be determined) and easy use by humans e.g. in data dependencies
- (?) Have this as a MUST?
id: MAY be present. If present MUST be globally unique. Propose it is a 36 bit uuid or similar.

What is the structure of `name`?

name may only contain lower case alphanumeric plus _-. and / as a separator (?? should we allow other url compatible values e.g. :?)

Option 1 - 3 part

Name has the following structure:

[registry/[owner-or-namespace/]]local-name

The primary Data Package registry (assuming there is one) will have the special registry name dp

local-name MUST NOT contain a ‘/’

# single-part - for resolution one would anticipate these implicitly become
# `{primary-registry}/core/{name}`
finance-vix

#2 part: `registry/local-name`
# Propose that namespace MUST
# either come from a designated central data package registry if / when we have one e.g. `core/gdp`
# OR be a valid domain name e.g. `data.gov.uk/my-name` (so we can piggy back on domain name issuance)
datahub.io/xyz
data.gov.uk/xyz

#3 part:
doi/{doi}   # {doi} usually has /
github.com/rgrp/court-decisions-gb

Asides

I did think about having an initial “scheme” value e.g. dp/core/abc or www/data.gov.uk/xyz but felt we were starting to reinvent the url wheel a bit too much …*
one option I thought about was about keeping name single-valued and having id support the multipart option.*
What about just using DOI? Ans: DOI requires a relatively complex registration process in order to able to issue DOIs. We want anyone to be able to create data packages
Why not just use URIs / URLs? That is an option and we should think move about it. Main disadvantages are:
- They are somewhat cumbersome
- Are liable to breakage e.g. if a registry simply moves url … (but that may creates problems with the above too?)
- Do not translate well to local installation
- Implicitly creates relation between name and URL resolution – what happens if you don’t control any url space?

Use Cases

Why does having an identifier matter? What is used for? At the moment the use cases are not very clear.

To refer to in e.g. dependencies dataDependencies field
- could we just use urls? Probably: see http://dataprotocols.org/data-package-identifier/
To use in tooling e.g. dpm install {data-package-name}
- could we just use urls or some other more complex identifier? Probably - see http://dataprotocols.org/data-package-identifier/
Discovery: using an identifier we can locate a data package [in a registry]
- Question: Which registry - or more generally what is the “resolution” protocol. See http://dataprotocols.org/data-package-identifier/ for more on this
To support storing and management in a catalog or registry
- Online e.g. CKAN
- Or locally e.g. .datapackages or similar - a local store or cache

Note also @amercader comment: “As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one.” – though my question is why do you want to decide if it is the same?

Context

Check out Zooko’s Triangle. For names hard to have more than 2 of:
- meaningful (for humans)
- decentralized
- secure / non-colliding

Aims for name:

be human-usable and usable in dependencies
make possible and likely but not guarantee non-collision
be partially distributed

Content-based naming / addressing

One attractive approach to naming that is both secure and decentralized is content-based naming based on hashes. The basic idea is you name content via the (e.g. sha1) hash of the content.

This is attractive and clever but does have 2 drawbacks:

The name changes if the content changes (this could be a feature rather than a bug)
The name is an opaque long string

Issue Analytics

State:
Created 8 years ago
Comments:25 (14 by maintainers)

Top GitHub Comments

1reaction

joehandcommented, Nov 11, 2016

I’m starting to swing towards what I think is the view of @rgrp being that the spec does not need a unique identifier as part of it: it is a platform-specific concern (federated or otherwise).

This seems like a good compromise. Global uniqueness will be hard to guarantee and only meaningful within the datapackage space. Allowing the id to be a platform specific unique id will make it easier to use datapackages in those platforms. This would allow id’s such as:

"id" : "https://doi.org/10.5281/zenodo.166271"

"id" : "dat://f677bd23661a1d5871e40092268d197c73f213f6b8aefebe01709647cfde9528/"

These IDs will be resolvable within the specific platform but also meaningful when viewing the datapackage outside that platform. It is clear what subspace these IDs come from and where they are guaranteed to be unique.

0reactions

rufuspollockcommented, Feb 5, 2017

AGREED: will do as separate PR:

id field which MUST be unique (e.g. uuid, doi etc)
name field is MAY and can be anything you like within reason …

Top Results From Across the Web

Data Package Identifier - Frictionless Standards

Data Package Identifiers are small JSON-oriented structure or strings which identify a Data Package (and, usually, its location). Author(s), Rufus Pollock.

Database Identifiers - SQL Server - Microsoft Learn

The database object name is referred to as its identifier. Everything in Microsoft SQL Server can have an identifier. Servers, databases ...

Identifiers - IBM

An identifier is a token that is used to form a name. An identifier in an SQL statement is either an SQL identifier...

Built-in Data Identifiers - Cisco Umbrella Documentation

The built-in data identifiers match specific personal identification information. ... Lenient (default): One person name + two or more other identifiers

Identifier - Wikipedia

An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may...