Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Release curated data on the Web

See original GitHub issue

#269 made it apparent that keeping NPM-only the consistency-guarantees provided by the NPM packages had two limitations:

it forces to go through NPM which is at best awkward for non-NPM based consumers
it limits the impact of the curation to whatever is packaged (e.g. the raw IDL files for the WebIDL package) - there is no easy way to access the generated JSON files that derive from it

An idea @tidoust and I discussed was to use w3c.github.io/webref to publish the curated-based view of the data.

This would require:

moving the anomaly report out of webref (which we have discussed doing for quite some time in any case)
adopting the release workflows to make publication on the gh-pages branch another of their outcomes

One question that needs more thinking is how to manage versioning.

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:6 (1 by maintainers)

Top GitHub Comments

2reactions

tidoustcommented, Jan 31, 2022

I’m closing the issue as the curation job is now in place and seems to work. I wouldn’t be too surprised if additional tweaks turn out to be needed but let’s handle them separately.

The @webref/idl@latest ref will point to the right commit on the curated branch as soon as a new version of the @webref/idl package gets released (currently blocked by non-distinguishable IDL in Autoplay Policy Detection)

Same thing for the @webref/css@latest ref (there was a remaining bug in the release script when the @webref/css@3.0.4 was released, so tags could not be added).

The @webref/elements@latest ref… does not exist yet, @dontcallmedom to create it and have it point to @webref/elements@1.0.4.

1reaction

tidoustcommented, Jan 7, 2022

Jotting down notes for discussion on a possible plan.

Suggested plan

In short:

The main branch continues to hold the raw data and all the code (same as today)
A new curated branch get created to contain the curated data. That new branch is published under https://w3c.github.io/webref/. Curation means applying patches to the raw data and re-generating the idlnames, idlnamesparsed, and idlparsed folders.

The curated branch would actually contain two curated views of the data:

an ed view that contains data for all the specs crawled under the ed folder
a browser view that only contains data for specs identified as browser specs (all specs in Webref are browser specs for now, but the goal is to relax that requirement soonish e.g. to extend the xref database to other types of specs).

We cannot maintain only one view for both situations because there is no easy way to filter out specs from the idlnames and idlnamesparsed folders. Data will be duplicated across views as a result but so be it. It does not seem useful to maintain a view for the tr crawl at this stage, although that could be considered later on.

NPM packages will be released from the data in the curated branch. Existing NPM packages will typically be released from the curated data in the browser view. When an NPM package gets released, a new version tag is created to the corresponding commit on the curated branch. I propose not to introduce a global curated data version for now.

To avoid “growing” the size of the repo, an alternative approach would be to create one separate repo for the curated data, and another one for the browser-only view of the curated data. If some projects are planning to clone the repos, this would allow them to only get the data they need. Is that needed?

Anything else?

Top Results From Across the Web

Open data, open curation | Scientific Data - Nature

The most tangible output of the curation process is the machine-accessible metadata record that forms a part of each Data Descriptor (in ISA-Tab ......

Data Curation Best Practices - Brown University Library

Researchers commonly place data on their website, which commonly has broken links. Instead, get a Digital Object Identifier (DOI) for citing your data...

What is Data Curation? - TechTarget

Data curation is the process of creating, organizing and maintaining data sets so they can be accessed and used by people looking for...

Data curation - Wikipedia

Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data ......

How do properties of data, their curation, and their funding ...

In Level 1, curators create a study website with descriptive metadata, a PDF codebook that explains what each variable represents, and data ......