Intake Integration
See original GitHub issueIntake is a “lightweight package for finding, investigating, loading and disseminating data.” It would be nice to figure out how the JupyterLab data registry could integrate with this package.
Catalogs
Having JupyterLab be aware of Intake’s “Data catalogs” are probably a good place to start. They “provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries.”
Local
For example, if you have a catalog as a file on disk in a catalog.yaml
file, we might want to be able to see the datasets it defines in the data registry. This is similar to how currently if you have a .ipynb
file, you can view the the datasets in its cell outputs. To do this, we would have to be able to parse it’s YAML format in javascript, and map the different entries to URLs.
For example, this catalog.yml
file:
metadata:
version: 1
sources:
example:
description: test
driver: random
args: {}
entry1_full:
description: entry1 full
metadata:
foo: 'bar'
bar: [1, 2, 3]
driver: csv
args: # passed to the open() method
urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'
entry1_part:
description: entry1 part
parameters: # User parameters
part:
description: section of the data
type: str
default: "stable"
allowed: ["latest", "stable"]
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'
Might map to a number of nested URLs:
./dataset.yml#/sources/example
./dataset.yml#/sources/entry1_full
./dataset.yml#/sources/entry1_part
And the ones that point to CSV files, would also point to some nested URLs, like dataset.yml#/sources/entry1_part
would point to:
./entry1_latest.csv
./entry1_stable.csv
This basically requires re-implementing the logic of the all the drivers, so that they can work client side.
Remote
We could also support loading a remote Intake data catalog. If you loaded a URL like intake://catalog1:5000
in the data registry you would want to be able to see the datasets available. Here, the proxy mode might be useful:
Proxied access: In this mode, the catalog server uses its local drivers to open the data source and stream the data over the network to the client. The client does not need any special drivers to read the data, and can read data from files and data servers that it cannot access, as long as the catalog server has the required access.
If we implement a client API for this server protocol, then we can let it handle all the data parsing and just expose the results it returns to the user. We would have to look more in depth in its specification.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:8 (3 by maintainers)
Top GitHub Comments
I have no preference where this lives. On jupyterlab or other related org or in Intake, all are fine.
@saulshanabrook - this dropped off the table at some point. Are you still interested?