name and id as identifiers for Data Packages
See original GitHub issueCurrently Data Packagese must have a name
attribute but do not have an id
attribute.
There has been debate about both the semantics (e.g. uniqueness) of the name
field and its usability for certain cases (e.g. importing datasets into a new catalog) - see #220 for extensive discussions.
Proposal
Two identifier fields:
name
: SHOULD be present (and certainly required for installation etc). Name is human meaningful and is designed to support both resolution (protocol to be determined) and easy use by humans e.g. in data dependencies- (?) Have this as a MUST?
id
: MAY be present. If present MUST be globally unique. Propose it is a 36 bit uuid or similar.
What is the structure of name
?
name
may only contain lower case alphanumeric plus _-.
and /
as a separator (?? should we allow other url compatible values e.g. :
?)
Option 1 - 3 part
Name has the following structure:
[registry/[owner-or-namespace/]]local-name
The primary Data Package registry (assuming there is one) will have the special registry name dp
local-name
MUST NOT contain a ‘/’
# single-part - for resolution one would anticipate these implicitly become
# `{primary-registry}/core/{name}`
finance-vix
#2 part: `registry/local-name`
# Propose that namespace MUST
# either come from a designated central data package registry if / when we have one e.g. `core/gdp`
# OR be a valid domain name e.g. `data.gov.uk/my-name` (so we can piggy back on domain name issuance)
datahub.io/xyz
data.gov.uk/xyz
#3 part:
doi/{doi} # {doi} usually has /
github.com/rgrp/court-decisions-gb
Asides
- I did think about having an initial “scheme” value e.g.
dp/core/abc
orwww/data.gov.uk/xyz
but felt we were starting to reinvent the url wheel a bit too much …* - one option I thought about was about keeping
name
single-valued and havingid
support the multipart option.* - What about just using DOI? Ans: DOI requires a relatively complex registration process in order to able to issue DOIs. We want anyone to be able to create data packages
- Why not just use URIs / URLs? That is an option and we should think move about it. Main disadvantages are:
- They are somewhat cumbersome
- Are liable to breakage e.g. if a registry simply moves url … (but that may creates problems with the above too?)
- Do not translate well to local installation
- Implicitly creates relation between name and URL resolution – what happens if you don’t control any url space?
Use Cases
Why does having an identifier matter? What is used for? At the moment the use cases are not very clear.
- To refer to in e.g. dependencies
dataDependencies
field- could we just use urls? Probably: see http://dataprotocols.org/data-package-identifier/
- To use in tooling e.g.
dpm install {data-package-name}
- could we just use urls or some other more complex identifier? Probably - see http://dataprotocols.org/data-package-identifier/
- Discovery: using an identifier we can locate a data package [in a registry]
- Question: Which registry - or more generally what is the “resolution” protocol. See http://dataprotocols.org/data-package-identifier/ for more on this
- To support storing and management in a catalog or registry
- Online e.g. CKAN
- Or locally e.g.
.datapackages
or similar - a local store or cache
Note also @amercader comment: “As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one.” – though my question is why do you want to decide if it is the same?
Context
- Check out Zooko’s Triangle. For names hard to have more than 2 of:
- meaningful (for humans)
- decentralized
- secure / non-colliding
Aims for name
:
- be human-usable and usable in dependencies
- make possible and likely but not guarantee non-collision
- be partially distributed
Content-based naming / addressing
One attractive approach to naming that is both secure and decentralized is content-based naming based on hashes. The basic idea is you name content via the (e.g. sha1) hash of the content.
This is attractive and clever but does have 2 drawbacks:
- The name changes if the content changes (this could be a feature rather than a bug)
- The name is an opaque long string
Issue Analytics
- State:
- Created 8 years ago
- Comments:25 (14 by maintainers)
Top GitHub Comments
This seems like a good compromise. Global uniqueness will be hard to guarantee and only meaningful within the datapackage space. Allowing the
id
to be a platform specific unique id will make it easier to use datapackages in those platforms. This would allow id’s such as:These IDs will be resolvable within the specific platform but also meaningful when viewing the datapackage outside that platform. It is clear what subspace these IDs come from and where they are guaranteed to be unique.
AGREED: will do as separate PR:
id
field which MUST be unique (e.g. uuid, doi etc)name
field is MAY and can be anything you like within reason …