question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataproc Spark job that creates connection to BigTable instance never successfully terminates

See original GitHub issue

Hi,

I’ve tried scala Dataproc Bigtable WordCount Sample and noticed that the Dataproc Spark job never successfully returns. At the end I simplified code and it looks like the program, that creates connection never successfully returns after submission with the gcloud command line tool to the Dataproc cluster. All command line parameters I have taken are from the example. Note, that the full example does all the job, but the problem is the same - it never successfully terminates. Here’s my simplified example:


package com.example.bigtable.spark.wordcount

import java.util.concurrent.TimeUnit

import com.google.bigtable.repackaged.com.google.cloud.bigtable.config.BigtableOptions
import com.google.cloud.bigtable.hbase.{BigtableConfiguration, BigtableOptionsFactory}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{HColumnDescriptor, HConstants, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.client.{Connection, Put}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkContext

/**
  * Basic WordCount sample of using Cloud Dataproc (managed Apache Spark)
  * to write to Cloud Bigtable.
  */
object WordCount {
  val ColumnFamily = "cf"
  val ColumnFamilyBytes = Bytes.toBytes(ColumnFamily)
  val ColumnNameBytes = Bytes.toBytes("Count")

  def createConnection(ProjectId: String, InstanceID: String): Connection = {
    BigtableConfiguration.connect(ProjectId, InstanceID)
  }

private def setBatchConfigOptions(config: Configuration) = {
    config.set(BigtableOptionsFactory.BIGTABLE_USE_CACHED_DATA_CHANNEL_POOL, "true")

    // Dataflow should use a different endpoint for data operations than online traffic.
    config.set(BigtableOptionsFactory.BIGTABLE_HOST_KEY, BigtableOptions.BIGTABLE_BATCH_DATA_HOST_DEFAULT)

    config.set(BigtableOptionsFactory.INITIAL_ELAPSED_BACKOFF_MILLIS_KEY, String.valueOf(TimeUnit.SECONDS.toMillis(5)))

    config.set(BigtableOptionsFactory.MAX_ELAPSED_BACKOFF_MILLIS_KEY, String.valueOf(TimeUnit.MINUTES.toMillis(5)))

    // This setting can potentially decrease performance for large scale writes. However, this
    // setting prevents problems that occur when streaming Sources, such as PubSub, are used.
    // To override this behavior, call:
    //    Builder.withConfiguration(BigtableOptionsFactory.BIGTABLE_ASYNC_MUTATOR_COUNT_KEY,
    //                              BigtableOptions.BIGTABLE_ASYNC_MUTATOR_COUNT_DEFAULT);
    config.set(BigtableOptionsFactory.BIGTABLE_ASYNC_MUTATOR_COUNT_KEY, "0")
  }

def runner(projectId: String,
             instanceId: String,
             tableName: String,
             fileName: String,
             sc: SparkContext) = {
    val createTableConnection = createConnection(projectId, instanceId)
    createTableConnection.close()
}

def main(args: Array[String]) {
    val ProjectId = args(0)
    val InstanceID = args(1)
    val WordCountTableName = args(2)
    val File = args(3)
    val sc = new SparkContext()
    runner(ProjectId, InstanceID, WordCountTableName, File, sc)
  }
}

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
santhhcommented, Apr 18, 2019

@sduskis seems like updating the BT Hbase Connector to 1.7.0 did the trick. Also upgraded the spark to 2.4.0. I will create a PR after testing is completed. Pom File in the Repo

0reactions
sduskiscommented, Apr 30, 2019

@santhh, thanks for the fix.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Run a Cloud Bigtable Spark job on Dataproc
Prerequisites · Create the Dataproc cluster · Upload the file to Cloud Storage · Submit the Wordcount job · Verify · Cleaning up...
Read more >
Google Dataproc Jobs Never Cancel, Stop, or Terminate
Jobs keep running. In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it...
Read more >
Professional Data Engineer on Google Cloud Platform Exam ...
The Cloud Bigtable cluster has too many nodes. There are issues with the network connection. Check the answer and show the description.
Read more >
Google Professional Cloud Data Engineer Practice Exam
Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector. E. Create a Hadoop cluster on Google Compute Engine that uses ......
Read more >
Hbase to Bigtable : Spark Job failing - Google Groups
Now I am running same Spark - Java/Scala job in Data proc. ... Please let me know without code change How I make...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found