question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Integration with Github as a data portal

See original GitHub issue

Overview

An important step for Frictionless Framework is to provide an ability to read and write packages from different data portals (CKAN/Github/Zenodod/etc) so the users can publish and access their packages easily and using a straightforward API. This issue is for Github integration. The implementation is already prototyped in v5 branch.

Specs

Read package

Read package from a repository that has a datapackage.json/yaml:

package = Package("https://github.com/datasets/population")
package = Package.from_github(...) # alias

Read package from a repository without a datapackage.json/yaml. We probably need to filter files and add only CSV/XLS(X) to the package. Also GithubControl should have this configurable. We need to map as much as possible metadata provided for Github Repository:

package = Package("https://github.com/frictionlessdata/repository-demo")
package = Package("<link>", control=portals.GithubControl(formats=['csv']))

Write package

Publish a package on Github (for now, only if the repo doesn’t exist). Also we need to provide an ability to store credentials in ENV/etc. We need to map as much as possible metadata provided by Package:

package.to_github(user=, repo=, api_key=)

Read catalog

Read catalog from github search. Design some search configurations like limit and offset (pagination).

catalog = Catalog(control=portals.GithubControl(search="<frictionless>")
for package in catalog.packages:
  print(package.name)

Plan

  • prototype the functionality based on the functional requirements
  • get feedback from @roll on the implementation
  • finish the implementation
  • design the testing approach (probably using pytest.vcr fro reading but how to test writing?)
  • write a great deal of tests to be sure that the integration works correctly
  • write a comprehensive tutorial - https://framework.frictionlessdata.io/docs/tutorials/tutorials-overview (new section Portals Tutorials)

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:3
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
shashigharticommented, Aug 12, 2022

Thanks. I have implemented to_github and from_github. I will add publish also.

1reaction
rollcommented, Aug 12, 2022

Someting like this:

def publish(self, target: Optional[Any] = None, *, control: Optional[Control] = None) -> None:
            manager = system.create_manager(source, control=control)
            if manager:
                package = manager.write_package(self)
            # raise: unsupported target

Then we need to use it in tests for writing/publishing

Read more comments on GitHub >

github_iconTop Results From Across the Web

SAEON/data-portal: Software for exploring and ... - GitHub
SAEON Data Portal. A suite of services that provide a platform for searching and exploring SAEON-curated datasets. README Contents. Overview. The stack.
Read more >
DataHub - Turn GitHub into a DataHub - Datopian
Present Data as Data (not Code). Integrate a powerful data portal with one click to start showcasing your data to technical and non-technical...
Read more >
Github Portal - Frictionless Framework
You can read data from a github repository as follows: Python. from pprint import pprint from frictionless import portals, Package package ...
Read more >
Source control - Azure Data Factory - Microsoft Learn
To learn more about how Azure Data Factory integrates with Git, ... Configure the code repository settings from Azure Portal ...
Read more >
Github ETL | Data Integration - Fivetran
Github is a web-based Git repository hosting service, that enables distribution revision control and source code management. Access your Github data using ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found