Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataproc Spark job that creates connection to BigTable instance never successfully terminates

See original GitHub issue

Hi,

I’ve tried scala Dataproc Bigtable WordCount Sample and noticed that the Dataproc Spark job never successfully returns. At the end I simplified code and it looks like the program, that creates connection never successfully returns after submission with the gcloud command line tool to the Dataproc cluster. All command line parameters I have taken are from the example. Note, that the full example does all the job, but the problem is the same - it never successfully terminates. Here’s my simplified example:


package com.example.bigtable.spark.wordcount

import java.util.concurrent.TimeUnit

import com.google.bigtable.repackaged.com.google.cloud.bigtable.config.BigtableOptions
import com.google.cloud.bigtable.hbase.{BigtableConfiguration, BigtableOptionsFactory}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{HColumnDescriptor, HConstants, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.client.{Connection, Put}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkContext

/**
  * Basic WordCount sample of using Cloud Dataproc (managed Apache Spark)
  * to write to Cloud Bigtable.
  */
object WordCount {
  val ColumnFamily = "cf"
  val ColumnFamilyBytes = Bytes.toBytes(ColumnFamily)
  val ColumnNameBytes = Bytes.toBytes("Count")

  def createConnection(ProjectId: String, InstanceID: String): Connection = {
    BigtableConfiguration.connect(ProjectId, InstanceID)
  }

private def setBatchConfigOptions(config: Configuration) = {
    config.set(BigtableOptionsFactory.BIGTABLE_USE_CACHED_DATA_CHANNEL_POOL, "true")

    // Dataflow should use a different endpoint for data operations than online traffic.
    config.set(BigtableOptionsFactory.BIGTABLE_HOST_KEY, BigtableOptions.BIGTABLE_BATCH_DATA_HOST_DEFAULT)

    config.set(BigtableOptionsFactory.INITIAL_ELAPSED_BACKOFF_MILLIS_KEY, String.valueOf(TimeUnit.SECONDS.toMillis(5)))

    config.set(BigtableOptionsFactory.MAX_ELAPSED_BACKOFF_MILLIS_KEY, String.valueOf(TimeUnit.MINUTES.toMillis(5)))

    // This setting can potentially decrease performance for large scale writes. However, this
    // setting prevents problems that occur when streaming Sources, such as PubSub, are used.
    // To override this behavior, call:
    //    Builder.withConfiguration(BigtableOptionsFactory.BIGTABLE_ASYNC_MUTATOR_COUNT_KEY,
    //                              BigtableOptions.BIGTABLE_ASYNC_MUTATOR_COUNT_DEFAULT);
    config.set(BigtableOptionsFactory.BIGTABLE_ASYNC_MUTATOR_COUNT_KEY, "0")
  }

def runner(projectId: String,
             instanceId: String,
             tableName: String,
             fileName: String,
             sc: SparkContext) = {
    val createTableConnection = createConnection(projectId, instanceId)
    createTableConnection.close()
}

def main(args: Array[String]) {
    val ProjectId = args(0)
    val InstanceID = args(1)
    val WordCountTableName = args(2)
    val File = args(3)
    val sc = new SparkContext()
    runner(ProjectId, InstanceID, WordCountTableName, File, sc)
  }
}

Issue Analytics

State:
Created 6 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

santhhcommented, Apr 18, 2019

@sduskis seems like updating the BT Hbase Connector to 1.7.0 did the trick. Also upgraded the spark to 2.4.0. I will create a PR after testing is completed. Pom File in the Repo

0reactions

sduskiscommented, Apr 30, 2019

@santhh, thanks for the fix.