question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ERROR HoodieDeltaStreamer: Got error running delta sync once.

See original GitHub issue

I’m executing the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when running the command suggested for the second and following times.

Have DMS generate the raw .parquet files in S3. Use HoodieDeltaStreamer to process the raw .parquet files:

spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
  --packages org.apache.spark:spark-avro_2.12:2.4.4 \
  --master yarn --deploy-mode client \
  hudi-utilities-bundle_2.12-0.5.2-incubating.jar \
  --table-type COPY_ON_WRITE \
  --source-ordering-field dms_timestamp \
  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
  --target-base-path s3://my-test-bucket/hudi_orders --target-table hudi_orders \
  --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \
  --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
  --hoodie-conf hoodie.datasource.write.recordkey.field=id,hoodie.datasource.write.partitionpath.field=id,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders

When I run it for the first time it works perfectly however when I try to keep “refreshing” the data on a scheduled job I get the following error:

ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down
org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!
	at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Hudi version : 0.5.2 (incubating) Spark version : 2.4.4 Hive version : 3.1.2 Storage (HDFS/S3/GCS…) : S3 Running on Docker? (yes/no) : No

Thank you.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
bvaradarcommented, Aug 12, 2020

@jcunhafonte : @bschell confirmed it works in master. Can you try using master or wait for 0.6 (Release should happen in a weeks time).

1reaction
bvaradarcommented, Jul 14, 2020

@jcunhafonte : This could happen when there are no more files to be ingested when running in non-continuous mode. I have opened a jira to get it fixed in 0.6.0 : https://issues.apache.org/jira/browse/HUDI-1091. With no input data, automatic schema resolution wont be possible. In continuous mode, we do cache the previous schema registry instance to handle this case. Can you try with that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[#HUDI-735] Improve deltastreamer error message when case ...
Improve deltastreamer error message when case mismatch of commandline ... ERROR HoodieDeltaStreamer: Got error running delta sync once.
Read more >
[jira] [Updated] (HUDI-1762) Hive Sync is not working with ...
HoodieDeltaStreamer : Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception ...
Read more >
Hudi DeltaStreamer使用总结 - 阿里云开发者社区
ERROR HoodieDeltaStreamer : Got error running delta sync once. Shutting down org.apache.hudi.utilities.exception.
Read more >
Issue for Integrating Hudi with Kafka using Avro Schema
Can someone please help me out with this? 21/02/24 13:02:08 ERROR TaskResultGetter: Exception while getting task result org.apache.spark.
Read more >
Hudi DeltaStreamer使用总结 - 伦少的博客
ERROR HoodieDeltaStreamer : Got error running delta sync once. Shutting down org.apache.hudi.utilities.exception.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found