Perform aggregation in the database to speed up WhiteRabbit
See original GitHub issueForgive me if this idea has already been considered or if I am misunderstanding how whiterabbit works. Whiterabbit takes a long time to run and I think one reason is that it downloads a lot of data and then processes the downloaded sample (e.g. counts unique values). Would it be possible to let the database handle the processing and then only download the result so that we could take advantage of in-database optimizations (e.g. fast counting unique values in a single column)?
So instead of
SELECT * FROM {table} SAMPLE({percentage});
and then processing the values of each column in java whiterabbit scan would use a set of queries like
select {fieldname}, count({fieldname}) from {table} group by {fieldname} sample({percentage});
This isn’t a well-formed alternative but it is meant to communicate the idea of doing the aggregation in the database rather than downloading a large sample of the data as a way to speed up the database scan. Calculations like MIN, MAX, and MEAN would be handled in the database as well.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top GitHub Comments
Agreed, this would be a nice enhancement. It does require a considerable amount of refactoring and thorough testing. We should first assemble a small team with access to all database systems supported by WR. @ablack3 Could you take the lead to form such a team?
Recording our plan:
I will work with @mgabetta to implement a new version of processDatabaseTable tentatively named processColumnOrientedDatabaseTable that would will have the same input and outputs as the original but do the aggregation in the database.
Concretely - it will iterate over the columns of the table, and perform the query
select {column_name}, count(*) as n from {table} group by {column_name}