Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perform aggregation in the database to speed up WhiteRabbit

See original GitHub issue

Forgive me if this idea has already been considered or if I am misunderstanding how whiterabbit works. Whiterabbit takes a long time to run and I think one reason is that it downloads a lot of data and then processes the downloaded sample (e.g. counts unique values). Would it be possible to let the database handle the processing and then only download the result so that we could take advantage of in-database optimizations (e.g. fast counting unique values in a single column)?

So instead of SELECT * FROM {table} SAMPLE({percentage}); and then processing the values of each column in java whiterabbit scan would use a set of queries like select {fieldname}, count({fieldname}) from {table} group by {fieldname} sample({percentage});

This isn’t a well-formed alternative but it is meant to communicate the idea of doing the aggregation in the database rather than downloading a large sample of the data as a way to speed up the database scan. Calculations like MIN, MAX, and MEAN would be handled in the database as well.

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

1reaction

MaximMoinatcommented, Feb 4, 2022

Agreed, this would be a nice enhancement. It does require a considerable amount of refactoring and thorough testing. We should first assemble a small team with access to all database systems supported by WR. @ablack3 Could you take the lead to form such a team?

What is the CSV input data? Is this for the ETL unit test functionality? If so I think it would still be worth speeding up the scan report since the scan report and ETL unit tests can be used independently. CSV is simply one of the formats a dataset can come in. Either exported from a database or from Excel (yes, some databases like registries are kept entirely in Excel).

0reactions

ablack3commented, Jun 17, 2022

Recording our plan:

I will work with @mgabetta to implement a new version of processDatabaseTable tentatively named processColumnOrientedDatabaseTable that would will have the same input and outputs as the original but do the aggregation in the database.

Concretely - it will iterate over the columns of the table, and perform the query select {column_name}, count(*) as n from {table} group by {column_name}