ERROR HoodieDeltaStreamer: Got error running delta sync once.
See original GitHub issueI’m executing the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when running the command suggested for the second and following times.
Have DMS generate the raw .parquet files in S3. Use HoodieDeltaStreamer to process the raw .parquet files:
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--packages org.apache.spark:spark-avro_2.12:2.4.4 \
--master yarn --deploy-mode client \
hudi-utilities-bundle_2.12-0.5.2-incubating.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field dms_timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3://my-test-bucket/hudi_orders --target-table hudi_orders \
--transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
--hoodie-conf hoodie.datasource.write.recordkey.field=id,hoodie.datasource.write.partitionpath.field=id,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders
When I run it for the first time it works perfectly however when I try to keep “refreshing” the data on a scheduled job I get the following error:
ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down
org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!
at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Hudi version : 0.5.2 (incubating) Spark version : 2.4.4 Hive version : 3.1.2 Storage (HDFS/S3/GCS…) : S3 Running on Docker? (yes/no) : No
Thank you.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
[#HUDI-735] Improve deltastreamer error message when case ...
Improve deltastreamer error message when case mismatch of commandline ... ERROR HoodieDeltaStreamer: Got error running delta sync once.
Read more >[jira] [Updated] (HUDI-1762) Hive Sync is not working with ...
HoodieDeltaStreamer : Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception ...
Read more >Hudi DeltaStreamer使用总结 - 阿里云开发者社区
ERROR HoodieDeltaStreamer : Got error running delta sync once. Shutting down org.apache.hudi.utilities.exception.
Read more >Issue for Integrating Hudi with Kafka using Avro Schema
Can someone please help me out with this? 21/02/24 13:02:08 ERROR TaskResultGetter: Exception while getting task result org.apache.spark.
Read more >Hudi DeltaStreamer使用总结 - 伦少的博客
ERROR HoodieDeltaStreamer : Got error running delta sync once. Shutting down org.apache.hudi.utilities.exception.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jcunhafonte : @bschell confirmed it works in master. Can you try using master or wait for 0.6 (Release should happen in a weeks time).
@jcunhafonte : This could happen when there are no more files to be ingested when running in non-continuous mode. I have opened a jira to get it fixed in 0.6.0 : https://issues.apache.org/jira/browse/HUDI-1091. With no input data, automatic schema resolution wont be possible. In continuous mode, we do cache the previous schema registry instance to handle this case. Can you try with that.