Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SHC with Spark Structured Streaming

See original GitHub issue

Hi,

I have a Spark Structured Streaming application where I’d like to write streaming data to HBase using SHC. It reads data from a location where new csv files continuously are being created. The defined catalog works for writing a DataFrame with identical data into HBase. The key components of my streaming application are a DataStreamReader and a DataStreamWriter.

val inputDataStream = spark
      .readStream
      .option("sep", ",")
      .schema(schema)
      .csv("/path/to/data/*.csv")

inputDataStream
      .writeStream
      .outputMode("append")
      .options(
        Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "2"))
      .format("org.apache.spark.sql.execution.datasources.hbase")
      .start

When running the application I’m getting the following message:

Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.execution.datasources.hbase does not support streamed writing at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:285) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286) at my.package.SHCStreamingApplication$.main(SHCStreamingApplication.scala:153) at my.package.SHCStreamingApplication.main(SHCStreamingApplication.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Does anyone know a solution or way/workaround to still use the SHC for writing structured streaming data to HBase? Thanks in advance!

Issue Analytics

State:
Created 6 years ago
Comments:35

Top GitHub Comments

2reactions

sutugincommented, Mar 20, 2018

Excellent, glad to help!!!

2reactions

sutugincommented, Mar 23, 2018

You can write your custom sink provider, inherited from StreamSinkProvider, this is my implementation:

package HBase
import org.apache.spark.internal.Logging
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.execution.datasources.hbase._

class HBaseSink(options: Map[String, String]) extends Sink with Logging {
  // String with HBaseTableCatalog.tableCatalog
  private val hBaseCatalog = options.get("hbasecat").map(_.toString).getOrElse("")

  override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {   
    val df = data.sparkSession.createDataFrame(data.rdd, data.schema)
    df.write
      .options(Map(HBaseTableCatalog.tableCatalog->hBaseCatalog,
        HBaseTableCatalog.newTable -> "5"))
      .format("org.apache.spark.sql.execution.datasources.hbase").save()
  }
}

class HBaseSinkProvider extends StreamSinkProvider with DataSourceRegister {
  def createSink(
                  sqlContext: SQLContext,
                  parameters: Map[String, String],
                  partitionColumns: Seq[String],
                  outputMode: OutputMode): Sink = {
    new HBaseSink(parameters)
  }

  def shortName(): String = "hbase"
}

This is example, how to use ():

inputDF.
   writeStream.
   queryName("hbase writer").
   format("HBase.HBaseSinkProvider").
   option("checkpointLocation", checkPointProdPath).
   option("hbasecat", catalog).
   outputMode(OutputMode.Update()).
   trigger(Trigger.ProcessingTime(30.seconds)).
   start

Top Results From Across the Web

spark structured streaming dataframe/dataset apply map ...

spark structured streaming dataframe/dataset apply map, iterate record and look into hbase table using shc connector.

Structured Streaming Programming Guide - Apache Spark

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the ......

Spark Structured Streaming — Performing unsupported batch ...

Programming using SHC — Spark HBase Connector — using Scala This connector enables us to read the HBase data into spark dataframe, and...

Feature Rich and Efficient Access to HBase through Spark SQL

SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs ......

Developers - SHC with Spark Structured Streaming - - Bountysource

I have a Spark Structured Streaming application where I'd like to write streaming data to HBase using SHC. It reads data from a...