question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auto-scaling performance issues when compiling Google-provided template with Maven

See original GitHub issue

I am using GCS_Text_to_BigQuery batch pipeline template to launch a Dataflow task using the latest compilation available on GCS (gs://dataflow-templates/latest/GCS_Text_to_BigQuery) and the performance is as expected (10-15 minutes processing 600M records using a n1-standard-16 machine).

However, testing the pipeline with the same code compiled by me using Maven the results are very different respect the ones using the compiled template provided by Google on GCS: auto-scaling seems not to be working in a proper way (the increase of workers during the task is very low and the process is taking much more time, in addition the % CPU consumption of the workers is very low).

I compiled the template with the following command as indicated in your README file (compilation is successful):

mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.TextIOToBigQuery \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<gcp-project-id> \
--stagingLocation=gs://<gcs-bucket>/staging \
--tempLocation=gs://<gcs-bucket>/temp \
--templateLocation=gs://<gcs-bucket>/templates/text-io-to-bq.json \
--runner=DataflowRunner"

I am launching the task using the python API, with exactly the same parameters using Google-provided template on GCS or my own one:

from googleapiclient.discovery import build

dataflow = build('dataflow', 'v1b3', cache_discovery=False)

# Compilation of template provided by Google on GCS (performance OK)
# dataflow_template = 'gs://dataflow-templates/latest/GCS_Text_to_BigQuery'

# Compilation of the same template build by me (performance NOK)
dataflow_template = 'gs://<gcs-bucket>/templates/text-io-to-bq.json'

parameters = {
    'javascriptTextTransformFunctionName': '<udf-function>',
    'JSONPath': 'gs://<gcs-bucket>/resources/schemas/<bq-schema-file>',
    'javascriptTextTransformGcsPath': 'gs://<gcs-bucket>/resources/UDF/<udf-file>',
    'inputFilePattern': 'gs://<gcs-bucket>/data/*',
    'outputTable': '<gcp-project>:<bq-dataset>.<bq-table>',
    'bigQueryLoadingTemporaryDirectory': 'gs://<gcs-bucket>/bq_temp_location/'
}

environment = {
    'tempLocation': 'gs://<gcs-bucket>/temp',
    'machineType': 'n1-standard-16'
}

request = dataflow.projects().locations().templates().launch(
    projectId='<gcp-project-id>',
    gcsPath=dataflow_template,
    location='<location>',
    body={
        'jobName': '<job-name>',
        'parameters': parameters,
        'environment': environment
    }
)

response = request.execute()

Any ideas on what could be the cause of this difference? Any support from your side would be really appreciated. Thanks in advance.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
sabhyankarcommented, Jun 15, 2020

“userAgent” : “Apache_Beam_SDK_for_Java/2.20.0(JDK_11_environment)”,

Hi @AAB87 - Can you recompile with JDK 8 and see if the behavior changes 😃

0reactions
AAB87commented, Jun 15, 2020

Hi @sabhyankar now it works! Sorry, I did not realized that that specific Java version was required for the compilation :S :S

Thanks a lot for your help 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Auto-scaling performance issues in Cloud Dataflow task from ...
I launched a task using the latest compilation of this template provided by Google (gs://dataflow-templates/latest/GCS_Text_to_BigQuery) and the ...
Read more >
Troubleshoot Dataflow autoscaling | Google Cloud
This page shows you how to resolve issues with the Dataflow autoscaling features and provides information about how to manage autoscaling.
Read more >
Create an Auto Scaling group using a launch template
To configure Amazon EC2 instances that are launched by your Auto Scaling group, you can specify a launch template or a launch configuration....
Read more >
How to use Google Cloud Data Flow? - Testprep Training Blog
For running a job, horizontal autoscaling enables the Dataflow service to automatically ... They also take advantage of many Google-provided templates for ...
Read more >
Why you should be using Flex templates for your Dataflow ...
Consider how you would build a pipeline to process a set of data files and a single-record control file that provides the total...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found