HBase behavior during joins with shc
See original GitHub issueIn the documentation, it’s mentioned that where
clauses, if using the row key vs. a regular column, will avoid a full scan. My question is what is the behavior when two HBase tables are joined together in a dataframe? Is there a way to avoid a full scan in that situation?
I ask because I can see a large shuffle phase (over 4 TB) which suggests full scans are going on vs. using the rowkeys.
Some specifics:
- table_a is an HBase table that has a rowkey which is a sha1 – we’ll call this sha1_1
- table_b is an HBase table that has a rowkey which is a string that is comprised of two different types of sha1s, e.g. so the row key is <sha1_2>-<sha1_3>
- table_c is from Postgres, and I want to join to it using sha1_1 from table_a
Each table is in a DataFrame, and a view has been created on it. The join then looks something like this:
select *
from table_c as tc
join table_b as tb on (tc.sha1_2 = ta.sha1_2)
join table_a as ta on (tb.sha1_1 = ta.rowkey)
So, in the first join, I essentially want to do an HBase prefix scan using sha1_2, so it shouldn’t need to do a sequential scan. I’m using an explicit column at the moment for equality, but I could specify a regex against the rowkey conceptually.
In the second join, I want to use the rowkey directly.
The catalog for both tables defines the rowkey as a string which is a sha1 for table_a and a combination of <sha1_2>-<sha1_3) for table_b.
Is there a way to do this? Or will joining always force a sequential scan?
My application is such that I only am looking at a portion of table_b (which I am filtering out before hand when constructing that data frame), so when joining to table_c, I would really like to ensure that a sequential scan is avoided.
The join of table_c to table_b is probably less consequential, since I’m pre-filtering it, but I am still curious as to how scan behavior works in terms of join
s.
Thanks, Ken
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (2 by maintainers)
@weiqingy I attended your one of the session https://www.youtube.com/watch?v=MDWgPK6XfEo and expecting push down of JOIN query on data source (HBase), Is it implemented in SHC? I am seeing no push down of JOIN filter on HBase table. I’ve enabled all CBO properties in Spark but no luck.
spark.sql.cbo.enabled=true spark.sql.cbo.joinReorder.enabled=true spark.sql.cbo.joinReorder.dp.star.filter=true spark.sql.cbo.starSchemaDetection=true spark.sql.crossJoin.enabled=true spark.sql.optimizer.metadataOnly=true
Please note I am using latest version (v1.1.1-2.1) of SHC in my testing.
Is there any more more information on this? i agree with @khampson this functionally seems like a important use case. it would be nice to be able to pass an rdd of keys to a bulk get operation, rather then scanning the whole hbase table.