Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HoodieDeltaStreamer not inferring schema from JsonDFSSource

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I have been successful in using HoodieDeltaStreamer on AWS Glue ETL when using the source class ParquetDFSSource. However, when dealing with JsonDFSSource, I am forced to specify a schema provider class.

2022-06-07 17:39:22,292 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!
	at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:55)
	at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:64)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:425)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:290)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
	at GlueApp$.main(logistics_hoodiedeltastreamer:49)
	at GlueApp.main(logistics_hoodiedeltastreamer)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

So I opted to use the FilebasedSchemaProvider class, however my job fails with:

2022-06-07 18:09:01,368 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
	at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:129)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:612)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:142)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:114)
	at GlueApp$.main(logistics_hoodiedeltastreamer:50)
	at GlueApp.main(logistics_hoodiedeltastreamer)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:100)
	at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:127)
	... 15 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
	... 17 more
Caused by: org.apache.avro.SchemaParseException: Illegal character in: ...
	at org.apache.avro.Schema.validateName(Schema.java:1151)
	at org.apache.avro.Schema.access$200(Schema.java:81)
	at org.apache.avro.Schema$Name.<init>(Schema.java:489)
	at org.apache.avro.Schema$Names.get(Schema.java:1111)
	at org.apache.avro.Schema.parse(Schema.java:1263)
	at org.apache.avro.Schema$Parser.parse(Schema.java:1032)
	at org.apache.avro.Schema$Parser.parse(Schema.java:1004)
	at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:59)
	... 22 more

To Reproduce

Steps to reproduce the behavior:

Build hudi package and upload to S3
Create AWS Glue Job and set --extra-jars as per the documentation
Write a wrapper in Scala to call the HoodieDeltaStreamer constructor and pass the necessary arguments

Expected behavior

Initially I expected HoodieDeltaStreamer to work with schemaProviderClassName=null and to infer the schema based on the source data. I understand that the schema is explicitly defined in parquet files, hence the ParquetDFSSource source class working without the schemaProvider.

After that, I expected the FilebasedSchemaProvider to work as it should. Not sure if I am missing anything in my set up. I did define the hoodie.deltastreamer.schemaprovider.source.schema.file and hoodie.deltastreamer.schemaprovider.target.schema.file in my properties file.

Environment Description

Hudi version : 10.0.1
Spark version : 3.1.1
Hive version : N/A
Hadoop version : N/A
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no

Additional context

N/A

Stacktrace

See above

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

nsivabalancommented, Jun 8, 2022

yes, internally, hudi uses sparkContext.textFile(pathStr) to read json files and hence we need the schema for now. ifyou are interested, we can file a jira for adding a new source where we read it as spark.json and so it automatically infers schema.

0reactions

atmabdallacommented, Jun 8, 2022

@nsivabalan I am curious though: why can HoodieDeltaStreamer not infer the schema from JsonDFSSource?

From the HoodieDeltaStreamer documentation:

By default, Spark will infer the schema of the source and use that inferred schema when writing to a table. If you need to explicitly define the schema you can use one of the following Schema Providers below.

Top Results From Across the Web

Streaming Ingestion - Apache Hudi

If not, you need to set schemas for both source and target. HoodieDeltaStreamer reads the files under the source base path ( hoodie.deltastreamer.source.dfs....

Apache Hudi deltastreamer throwing Exception in thread ...

Below is the standard spark-submit command for Hudi deltastreamer, where it is throwing as no main parameter is defined. I could see all...

Index (hudi-utilities_2.11 0.8.0 API) - javadoc.io

Reads avro serialized Kafka data, based on the confluent schema-registry. ... adds `Op` field with value `I`, for AWS DMS data, if the...

Blog Feed - The blaqfire Round up

Articles about data, analytics, cloud technology and code solutions.

Spark cannot read hive orc table - Cloudera Community

22/07/20 11:50:56 WARN HiveMetastoreCatalog: Unable to infer schema for table orc_table from file format ORC (inference mode: INFER_AND_SAVE).