question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SodaCL metadata/table discovery

See original GitHub issue

Generating and pushing table information such as: name, columns, database types will be performed by soda-core.

The original proposition looks like this:

profiling basic:
  tables:
    - SODATEST_%
    - include SODATEST_%
    - exclude SODATEST_%
  schema: enabled

We are however thinking about another top-level name along the lines of (based on a more explicit proposal from @janet-can:

discover tables:

I assume the rest of the controls are going to stay the same meaning that users should configure column profiling via this canonical sodaCL spec:

discover tables:
  tables:
    - SODATEST_%
    - include SODATEST_%
    - exclude SODATEST_%
  schema: enabled

@tombaeyens can you confirm this makes sense language-wise? Also, can you explain the intention behind the schema key? What does it control?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
tombaeyenscommented, Apr 6, 2022

Agreed. Blessing given for naming the SodaCL top level entry discover tables:

0reactions
tombaeyenscommented, Apr 19, 2022

@mathissedestrooper @bastienboutonnet collecting table samples is indeed also on a table level just like table discovery. But I think it’s important that the set of tables are distinct for those 2 things. I ll explain:

Querying a data source for all it’s tables and ensuring that all tables are available in the Soda Cloud UI, is something that you typically want for all tables. There is no real performance problem.

Capturing samples for all tables is much more demanding in terms of compute and storage requirements. So that is something that you want to limit to a specific subset of tables. At least that is my guess.

wdyt ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Soda Core Roadmap - GitHub
Data reliability tools for SQL- and Spark-accessible data - Soda Core Roadmap · sodadata/soda-core.
Read more >
Quick start for SodaCL - Soda Documentation
Follow the quick start tutorial to get started with SodaCL, a human-readable, domain-specific language for data reliability.
Read more >
Soda Envisions Data Reliability as Code with SodaCL
SodaCL is contained in Soda Core, a new open source framework designed to help data engineers with data quality, observability, and data ...
Read more >
Discovering source metadata - IBM
IBM® InfoSphere® Discovery is used to identify the transformation rules that have been applied to a source system to populate a target such...
Read more >
Soda Core & SodaCL Product Showcase - YouTube
Soda Core is an open-source CLI tool and Python library for data reliability. Use Soda Core for data quality testing both in and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found