Dataproc Spark job that creates connection to BigTable instance never successfully terminates
See original GitHub issueHi,
I’ve tried scala Dataproc Bigtable WordCount Sample and noticed that the Dataproc Spark job never successfully returns. At the end I simplified code and it looks like the program, that creates connection never successfully returns after submission with the gcloud command line tool to the Dataproc cluster. All command line parameters I have taken are from the example. Note, that the full example does all the job, but the problem is the same - it never successfully terminates. Here’s my simplified example:
package com.example.bigtable.spark.wordcount
import java.util.concurrent.TimeUnit
import com.google.bigtable.repackaged.com.google.cloud.bigtable.config.BigtableOptions
import com.google.cloud.bigtable.hbase.{BigtableConfiguration, BigtableOptionsFactory}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{HColumnDescriptor, HConstants, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.client.{Connection, Put}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkContext
/**
* Basic WordCount sample of using Cloud Dataproc (managed Apache Spark)
* to write to Cloud Bigtable.
*/
object WordCount {
val ColumnFamily = "cf"
val ColumnFamilyBytes = Bytes.toBytes(ColumnFamily)
val ColumnNameBytes = Bytes.toBytes("Count")
def createConnection(ProjectId: String, InstanceID: String): Connection = {
BigtableConfiguration.connect(ProjectId, InstanceID)
}
private def setBatchConfigOptions(config: Configuration) = {
config.set(BigtableOptionsFactory.BIGTABLE_USE_CACHED_DATA_CHANNEL_POOL, "true")
// Dataflow should use a different endpoint for data operations than online traffic.
config.set(BigtableOptionsFactory.BIGTABLE_HOST_KEY, BigtableOptions.BIGTABLE_BATCH_DATA_HOST_DEFAULT)
config.set(BigtableOptionsFactory.INITIAL_ELAPSED_BACKOFF_MILLIS_KEY, String.valueOf(TimeUnit.SECONDS.toMillis(5)))
config.set(BigtableOptionsFactory.MAX_ELAPSED_BACKOFF_MILLIS_KEY, String.valueOf(TimeUnit.MINUTES.toMillis(5)))
// This setting can potentially decrease performance for large scale writes. However, this
// setting prevents problems that occur when streaming Sources, such as PubSub, are used.
// To override this behavior, call:
// Builder.withConfiguration(BigtableOptionsFactory.BIGTABLE_ASYNC_MUTATOR_COUNT_KEY,
// BigtableOptions.BIGTABLE_ASYNC_MUTATOR_COUNT_DEFAULT);
config.set(BigtableOptionsFactory.BIGTABLE_ASYNC_MUTATOR_COUNT_KEY, "0")
}
def runner(projectId: String,
instanceId: String,
tableName: String,
fileName: String,
sc: SparkContext) = {
val createTableConnection = createConnection(projectId, instanceId)
createTableConnection.close()
}
def main(args: Array[String]) {
val ProjectId = args(0)
val InstanceID = args(1)
val WordCountTableName = args(2)
val File = args(3)
val sc = new SparkContext()
runner(ProjectId, InstanceID, WordCountTableName, File, sc)
}
}
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
Run a Cloud Bigtable Spark job on Dataproc
Prerequisites · Create the Dataproc cluster · Upload the file to Cloud Storage · Submit the Wordcount job · Verify · Cleaning up...
Read more >Google Dataproc Jobs Never Cancel, Stop, or Terminate
Jobs keep running. In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it...
Read more >Professional Data Engineer on Google Cloud Platform Exam ...
The Cloud Bigtable cluster has too many nodes. There are issues with the network connection. Check the answer and show the description.
Read more >Google Professional Cloud Data Engineer Practice Exam
Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector. E. Create a Hadoop cluster on Google Compute Engine that uses ......
Read more >Hbase to Bigtable : Spark Job failing - Google Groups
Now I am running same Spark - Java/Scala job in Data proc. ... Please let me know without code change How I make...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@sduskis seems like updating the BT Hbase Connector to 1.7.0 did the trick. Also upgraded the spark to 2.4.0. I will create a PR after testing is completed. Pom File in the Repo
@santhh, thanks for the fix.