Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HBase behavior during joins with shc

See original GitHub issue

In the documentation, it’s mentioned that where clauses, if using the row key vs. a regular column, will avoid a full scan. My question is what is the behavior when two HBase tables are joined together in a dataframe? Is there a way to avoid a full scan in that situation?

I ask because I can see a large shuffle phase (over 4 TB) which suggests full scans are going on vs. using the rowkeys.

Some specifics:

table_a is an HBase table that has a rowkey which is a sha1 – we’ll call this sha1_1
table_b is an HBase table that has a rowkey which is a string that is comprised of two different types of sha1s, e.g. so the row key is <sha1_2>-<sha1_3>
table_c is from Postgres, and I want to join to it using sha1_1 from table_a

Each table is in a DataFrame, and a view has been created on it. The join then looks something like this:

select *
from table_c as tc
join table_b as tb on (tc.sha1_2 = ta.sha1_2)
join table_a as ta on (tb.sha1_1 = ta.rowkey)

So, in the first join, I essentially want to do an HBase prefix scan using sha1_2, so it shouldn’t need to do a sequential scan. I’m using an explicit column at the moment for equality, but I could specify a regex against the rowkey conceptually.

In the second join, I want to use the rowkey directly.

The catalog for both tables defines the rowkey as a string which is a sha1 for table_a and a combination of <sha1_2>-<sha1_3) for table_b.

Is there a way to do this? Or will joining always force a sequential scan?

My application is such that I only am looking at a portion of table_b (which I am filtering out before hand when constructing that data frame), so when joining to table_c, I would really like to ensure that a sequential scan is avoided.

The join of table_c to table_b is probably less consequential, since I’m pre-filtering it, but I am still curious as to how scan behavior works in terms of joins.

Thanks, Ken

Issue Analytics

State:
Created 6 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

awadheshjicommented, Jul 3, 2018

@weiqingy I attended your one of the session https://www.youtube.com/watch?v=MDWgPK6XfEo and expecting push down of JOIN query on data source (HBase), Is it implemented in SHC? I am seeing no push down of JOIN filter on HBase table. I’ve enabled all CBO properties in Spark but no luck.

spark.sql.cbo.enabled=true spark.sql.cbo.joinReorder.enabled=true spark.sql.cbo.joinReorder.dp.star.filter=true spark.sql.cbo.starSchemaDetection=true spark.sql.crossJoin.enabled=true spark.sql.optimizer.metadataOnly=true

Please note I am using latest version (v1.1.1-2.1) of SHC in my testing.

0reactions

john-drewscommented, Oct 4, 2017

Is there any more more information on this? i agree with @khampson this functionally seems like a important use case. it would be nice to be able to pass an rdd of keys to a bulk get operation, rather then scanning the whole hbase table.

Top Results From Across the Web

HBase on CDP | CDP Public Cloud - Cloudera Documentation

Configure Ranger ACLs ACLs in CDP corresponding to the HBase or Ranger ACLs in your existing HDP cluster. Migrate your applications to use...

Feature Rich and Efficient Access to HBase through Spark SQL

Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key...

subject:"Spark and HBase" - The Mail Archive

readFrom(HbaseReader.scala:36) at com.join. ... gmail.com> > wrote: > >> For SHC documentation, please refer the README in SHC github, ...

[Solved]-Cant transfer dataset from SPARK to HBase table-scala

A proper example is given in the documentation of the Spark Hbase connector. You catalog should look like this (I have not tested...

Cascading Map-Side Joins over HBase for Scalable Join ...

In this paper, we introduce the Map-Side Index Nested. Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL data stores like...