question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS

See original GitHub issue

I am attempting to create a hudi table using a parquet file on S3. The motivation for this approach is based on this Hudi blog: https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi

To first attempt usage of deltastreamer to ingest a full initial batch load, I attempted to use parquet files used in an aws blog at s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/ https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/

At first I used the spark shell on EMR to load the data into a dataframe and view it, this happens with no issues:

image

I then attempted to use Hudi Deltastreamer as per my understanding of the documentation, however I ran into a couple of issues.

Steps to reproduce the behavior:

  1. Ran the following:
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4\
  --master yarn --deploy-mode client \
/usr/lib/hudi/hudi-utilities-bundle.jar --table-type MERGE_ON_READ \
  --source-ordering-field request_timestamp \
  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
  --target-base-path s3://mysqlcdc-stream-prod/hudi_tryout/hudi_aws_test --target-table hudi_aws_test \
--hoodie-conf hoodie.datasource.write.recordkey.field=request_timestamp,hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1,hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP

Stacktrace:

Exception in thread "main" java.io.IOException: Could not load key generator class org.apache.hudi.keygen.SimpleKeyGenerator
	at org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:94)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.<init>(DeltaSync.java:190)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:552)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:129)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:99)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:464)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class 
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:98)
	at org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:92)
	... 17 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:87)
	... 19 more
Caused by: java.lang.IllegalArgumentException: Property hoodie.datasource.write.partitionpath.field not found
	at org.apache.hudi.common.config.TypedProperties.checkKey(TypedProperties.java:42)
	at org.apache.hudi.common.config.TypedProperties.getString(TypedProperties.java:47)
	at org.apache.hudi.keygen.SimpleKeyGenerator.<init>(SimpleKeyGenerator.java:36)
	... 24 more
  1. I understand that for a timestamp based partition field it is recommended to use a CustomKeyGenerator:
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4\
  --master yarn --deploy-mode client \
/usr/lib/hudi/hudi-utilities-bundle.jar --table-type MERGE_ON_READ \
  --source-ordering-field request_timestamp \
  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
  --target-base-path s3://mysqlcdc-stream-prod/hudi_tryout/hudi_aws_test --target-table hudi_aws_test \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator,hoodie.datasource.write.recordkey.field=request_timestamp,hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1,hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP

This gives rise to a different error:

Exception in thread "main" java.io.IOException: Could not load key generator class org.apache.hudi.keygen.CustomKeyGenerator,hoodie.datasource.write.recordkey.field=request_timestamp,hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1,hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP
	at org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:94)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.<init>(DeltaSync.java:190)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:552)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:129)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:99)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:464)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hudi.exception.HoodieException: Unable to load class
	at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:56)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:87)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:98)
	at org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:92)
	... 17 more
Caused by: java.lang.ClassNotFoundException: org.apache.hudi.keygen.CustomKeyGenerator,hoodie.datasource.write.recordkey.field=request_timestamp,hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1,hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:53)
	... 20 more

Expected behavior I’ve clearly specified the partition path field in hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP. However this consistently fails for me even on other parquet files. I assumed the problem might be that it needs to be added in the dfs-source.properties file, so I’d added the following to that file:

include=base.properties
hoodie.datasource.write.recordkey.field=request_timestamp
hoodie.datasource.write.partitionpath.field=request_timestamp

However that didn’t fix anything. I also added the location of the file under --props, however it couldn’t find the file even though I am able to display the contents of the file in terminal using cat.

Suspecting the choice of key generator being the issue, I tried several other partitioners including Custom, Complex and TimeBased. However it wasn’t able to load class for any of them.

Please let me know if I am doing anything wrong here.

Environment Description

  • Hudi version : 0.6.0

  • Spark version : version 2.4.7-amzn-0 Using Scala version 2.11.12

  • Hive version :

  • Hadoop version : Hadoop 2.10.1-amzn-0

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Feb 6, 2021

looks like you are required to set “–schemaprovider-class” This blog covers multi table transformer w/ example. Might be of help to you.

@pratyakshsharma : can you please follow up on this ticket.

1reaction
bvaradarcommented, Jan 6, 2021

I think the command line parameters are not passed correctly

–hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator –hoodie-conf hoodie.datasource.write.recordkey.field=request_timestamp:TIMESTAMP –hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1 –hoodie-conf hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP

On a related note, your record key and partition path are both same. This is ok if you are testing out a sample dataset but wont scale in real world as you would end-up with one record per directory.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] Kavin88 edited a comment on issue #2406
... edited a comment on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found