question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HoodieDeltaStreamer not inferring schema from JsonDFSSource

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I have been successful in using HoodieDeltaStreamer on AWS Glue ETL when using the source class ParquetDFSSource. However, when dealing with JsonDFSSource, I am forced to specify a schema provider class.

2022-06-07 17:39:22,292 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!
	at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:55)
	at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:64)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:425)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:290)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
	at GlueApp$.main(logistics_hoodiedeltastreamer:49)
	at GlueApp.main(logistics_hoodiedeltastreamer)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

So I opted to use the FilebasedSchemaProvider class, however my job fails with:

2022-06-07 18:09:01,368 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
	at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:129)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:612)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:142)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:114)
	at GlueApp$.main(logistics_hoodiedeltastreamer:50)
	at GlueApp.main(logistics_hoodiedeltastreamer)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:100)
	at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:127)
	... 15 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
	... 17 more
Caused by: org.apache.avro.SchemaParseException: Illegal character in: ...
	at org.apache.avro.Schema.validateName(Schema.java:1151)
	at org.apache.avro.Schema.access$200(Schema.java:81)
	at org.apache.avro.Schema$Name.<init>(Schema.java:489)
	at org.apache.avro.Schema$Names.get(Schema.java:1111)
	at org.apache.avro.Schema.parse(Schema.java:1263)
	at org.apache.avro.Schema$Parser.parse(Schema.java:1032)
	at org.apache.avro.Schema$Parser.parse(Schema.java:1004)
	at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:59)
	... 22 more

To Reproduce

Steps to reproduce the behavior:

  1. Build hudi package and upload to S3
  2. Create AWS Glue Job and set --extra-jars as per the documentation
  3. Write a wrapper in Scala to call the HoodieDeltaStreamer constructor and pass the necessary arguments

Expected behavior

Initially I expected HoodieDeltaStreamer to work with schemaProviderClassName=null and to infer the schema based on the source data. I understand that the schema is explicitly defined in parquet files, hence the ParquetDFSSource source class working without the schemaProvider.

After that, I expected the FilebasedSchemaProvider to work as it should. Not sure if I am missing anything in my set up. I did define the hoodie.deltastreamer.schemaprovider.source.schema.file and hoodie.deltastreamer.schemaprovider.target.schema.file in my properties file.

Environment Description

  • Hudi version : 10.0.1

  • Spark version : 3.1.1

  • Hive version : N/A

  • Hadoop version : N/A

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Additional context

N/A

Stacktrace

See above

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Jun 8, 2022

yes, internally, hudi uses sparkContext.textFile(pathStr) to read json files and hence we need the schema for now. ifyou are interested, we can file a jira for adding a new source where we read it as spark.json and so it automatically infers schema.

0reactions
atmabdallacommented, Jun 8, 2022

@nsivabalan I am curious though: why can HoodieDeltaStreamer not infer the schema from JsonDFSSource?

From the HoodieDeltaStreamer documentation:

By default, Spark will infer the schema of the source and use that inferred schema when writing to a table. If you need to explicitly define the schema you can use one of the following Schema Providers below.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Streaming Ingestion - Apache Hudi
If not, you need to set schemas for both source and target. HoodieDeltaStreamer reads the files under the source base path ( hoodie.deltastreamer.source.dfs....
Read more >
Apache Hudi deltastreamer throwing Exception in thread ...
Below is the standard spark-submit command for Hudi deltastreamer, where it is throwing as no main parameter is defined. I could see all...
Read more >
Index (hudi-utilities_2.11 0.8.0 API) - javadoc.io
Reads avro serialized Kafka data, based on the confluent schema-registry. ... adds `Op` field with value `I`, for AWS DMS data, if the...
Read more >
Blog Feed - The blaqfire Round up
Articles about data, analytics, cloud technology and code solutions.
Read more >
Spark cannot read hive orc table - Cloudera Community
22/07/20 11:50:56 WARN HiveMetastoreCatalog: Unable to infer schema for table orc_table from file format ORC (inference mode: INFER_AND_SAVE).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found