question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perform aggregation in the database to speed up WhiteRabbit

See original GitHub issue

Forgive me if this idea has already been considered or if I am misunderstanding how whiterabbit works. Whiterabbit takes a long time to run and I think one reason is that it downloads a lot of data and then processes the downloaded sample (e.g. counts unique values). Would it be possible to let the database handle the processing and then only download the result so that we could take advantage of in-database optimizations (e.g. fast counting unique values in a single column)?

So instead of SELECT * FROM {table} SAMPLE({percentage}); and then processing the values of each column in java whiterabbit scan would use a set of queries like select {fieldname}, count({fieldname}) from {table} group by {fieldname} sample({percentage});

This isn’t a well-formed alternative but it is meant to communicate the idea of doing the aggregation in the database rather than downloading a large sample of the data as a way to speed up the database scan. Calculations like MIN, MAX, and MEAN would be handled in the database as well.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
MaximMoinatcommented, Feb 4, 2022

Agreed, this would be a nice enhancement. It does require a considerable amount of refactoring and thorough testing. We should first assemble a small team with access to all database systems supported by WR. @ablack3 Could you take the lead to form such a team?

What is the CSV input data? Is this for the ETL unit test functionality? If so I think it would still be worth speeding up the scan report since the scan report and ETL unit tests can be used independently. CSV is simply one of the formats a dataset can come in. Either exported from a database or from Excel (yes, some databases like registries are kept entirely in Excel).

0reactions
ablack3commented, Jun 17, 2022

Recording our plan:

I will work with @mgabetta to implement a new version of processDatabaseTable tentatively named processColumnOrientedDatabaseTable that would will have the same input and outputs as the original but do the aggregation in the database.

Concretely - it will iterate over the columns of the table, and perform the query select {column_name}, count(*) as n from {table} group by {column_name}

Read more comments on GitHub >

github_iconTop Results From Across the Web

Speed up an aggregate query on an 11 million row table
Setting all of that aside, I'm going to try to show you a few ways to speed up the query in the question....
Read more >
Precise time and frequency transfer in a White Rabbit network
▷ Recovering a low jitter clock requires either a very stable oscillator in every slave. (such as an OCXO) or an increase in...
Read more >
How We Made Data Aggregation Better and Faster on ...
Today, we are introducing TimescaleDB 2.7 and the performance boost it brings for aggregate queries. Expect more news this week about ...
Read more >
Chapter 6 Extract Transform Load | The Book of OHDSI
1 White Rabbit. To initiate an ETL process on a database you need to understand your data, including the tables, fields, and content....
Read more >
Wrinavbgen2 · Wiki · Projects / White Rabbit Standardization
It seems that AVB Gen2 and White Rabbit have some commonalties (in terms of ... The efforts to support network redundancy and speed-up...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found