SodaCL metadata/table discovery
See original GitHub issueGenerating and pushing table information such as: name, columns, database types will be performed by soda-core.
The original proposition looks like this:
profiling basic:
tables:
- SODATEST_%
- include SODATEST_%
- exclude SODATEST_%
schema: enabled
We are however thinking about another top-level name along the lines of (based on a more explicit proposal from @janet-can:
discover tables:
I assume the rest of the controls are going to stay the same meaning that users should configure column profiling via this canonical sodaCL spec:
discover tables:
tables:
- SODATEST_%
- include SODATEST_%
- exclude SODATEST_%
schema: enabled
@tombaeyens can you confirm this makes sense language-wise? Also, can you explain the intention behind the schema
key? What does it control?
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Soda Core Roadmap - GitHub
Data reliability tools for SQL- and Spark-accessible data - Soda Core Roadmap · sodadata/soda-core.
Read more >Quick start for SodaCL - Soda Documentation
Follow the quick start tutorial to get started with SodaCL, a human-readable, domain-specific language for data reliability.
Read more >Soda Envisions Data Reliability as Code with SodaCL
SodaCL is contained in Soda Core, a new open source framework designed to help data engineers with data quality, observability, and data ...
Read more >Discovering source metadata - IBM
IBM® InfoSphere® Discovery is used to identify the transformation rules that have been applied to a source system to populate a target such...
Read more >Soda Core & SodaCL Product Showcase - YouTube
Soda Core is an open-source CLI tool and Python library for data reliability. Use Soda Core for data quality testing both in and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Agreed. Blessing given for naming the SodaCL top level entry
discover tables:
@mathissedestrooper @bastienboutonnet collecting table samples is indeed also on a table level just like table discovery. But I think it’s important that the set of tables are distinct for those 2 things. I ll explain:
Querying a data source for all it’s tables and ensuring that all tables are available in the Soda Cloud UI, is something that you typically want for all tables. There is no real performance problem.
Capturing samples for all tables is much more demanding in terms of compute and storage requirements. So that is something that you want to limit to a specific subset of tables. At least that is my guess.
wdyt ?