Auto-scaling performance issues when compiling Google-provided template with Maven
See original GitHub issueI am using GCS_Text_to_BigQuery
batch pipeline template to launch a Dataflow task using the latest compilation available on GCS (gs://dataflow-templates/latest/GCS_Text_to_BigQuery) and the performance is as expected (10-15 minutes processing 600M records using a n1-standard-16 machine).
However, testing the pipeline with the same code compiled by me using Maven the results are very different respect the ones using the compiled template provided by Google on GCS: auto-scaling seems not to be working in a proper way (the increase of workers during the task is very low and the process is taking much more time, in addition the % CPU consumption of the workers is very low).
I compiled the template with the following command as indicated in your README file (compilation is successful):
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.TextIOToBigQuery \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<gcp-project-id> \
--stagingLocation=gs://<gcs-bucket>/staging \
--tempLocation=gs://<gcs-bucket>/temp \
--templateLocation=gs://<gcs-bucket>/templates/text-io-to-bq.json \
--runner=DataflowRunner"
I am launching the task using the python API, with exactly the same parameters using Google-provided template on GCS or my own one:
from googleapiclient.discovery import build
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
# Compilation of template provided by Google on GCS (performance OK)
# dataflow_template = 'gs://dataflow-templates/latest/GCS_Text_to_BigQuery'
# Compilation of the same template build by me (performance NOK)
dataflow_template = 'gs://<gcs-bucket>/templates/text-io-to-bq.json'
parameters = {
'javascriptTextTransformFunctionName': '<udf-function>',
'JSONPath': 'gs://<gcs-bucket>/resources/schemas/<bq-schema-file>',
'javascriptTextTransformGcsPath': 'gs://<gcs-bucket>/resources/UDF/<udf-file>',
'inputFilePattern': 'gs://<gcs-bucket>/data/*',
'outputTable': '<gcp-project>:<bq-dataset>.<bq-table>',
'bigQueryLoadingTemporaryDirectory': 'gs://<gcs-bucket>/bq_temp_location/'
}
environment = {
'tempLocation': 'gs://<gcs-bucket>/temp',
'machineType': 'n1-standard-16'
}
request = dataflow.projects().locations().templates().launch(
projectId='<gcp-project-id>',
gcsPath=dataflow_template,
location='<location>',
body={
'jobName': '<job-name>',
'parameters': parameters,
'environment': environment
}
)
response = request.execute()
Any ideas on what could be the cause of this difference? Any support from your side would be really appreciated. Thanks in advance.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Hi @AAB87 - Can you recompile with JDK 8 and see if the behavior changes 😃
Hi @sabhyankar now it works! Sorry, I did not realized that that specific Java version was required for the compilation :S :S
Thanks a lot for your help 😃