[SUPPORT] HoodieDeltaStreamer not inferring schema from JsonDFSSource
See original GitHub issueTips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I have been successful in using HoodieDeltaStreamer on AWS Glue ETL when using the source class ParquetDFSSource
. However, when dealing with JsonDFSSource
, I am forced to specify a schema provider class.
2022-06-07 17:39:22,292 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!
at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:55)
at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:64)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:425)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:290)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
at GlueApp$.main(logistics_hoodiedeltastreamer:49)
at GlueApp.main(logistics_hoodiedeltastreamer)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
So I opted to use the FilebasedSchemaProvider
class, however my job fails with:
2022-06-07 18:09:01,368 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:129)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:612)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:142)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:114)
at GlueApp$.main(logistics_hoodiedeltastreamer:50)
at GlueApp.main(logistics_hoodiedeltastreamer)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91)
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:100)
at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:127)
... 15 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
... 17 more
Caused by: org.apache.avro.SchemaParseException: Illegal character in: ...
at org.apache.avro.Schema.validateName(Schema.java:1151)
at org.apache.avro.Schema.access$200(Schema.java:81)
at org.apache.avro.Schema$Name.<init>(Schema.java:489)
at org.apache.avro.Schema$Names.get(Schema.java:1111)
at org.apache.avro.Schema.parse(Schema.java:1263)
at org.apache.avro.Schema$Parser.parse(Schema.java:1032)
at org.apache.avro.Schema$Parser.parse(Schema.java:1004)
at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:59)
... 22 more
To Reproduce
Steps to reproduce the behavior:
- Build hudi package and upload to S3
- Create AWS Glue Job and set
--extra-jars
as per the documentation - Write a wrapper in Scala to call the HoodieDeltaStreamer constructor and pass the necessary arguments
Expected behavior
Initially I expected HoodieDeltaStreamer to work with schemaProviderClassName=null
and to infer the schema based on the source data. I understand that the schema is explicitly defined in parquet files, hence the ParquetDFSSource
source class working without the schemaProvider.
After that, I expected the FilebasedSchemaProvider
to work as it should. Not sure if I am missing anything in my set up. I did define the hoodie.deltastreamer.schemaprovider.source.schema.file
and hoodie.deltastreamer.schemaprovider.target.schema.file
in my properties file.
Environment Description
-
Hudi version : 10.0.1
-
Spark version : 3.1.1
-
Hive version : N/A
-
Hadoop version : N/A
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Additional context
N/A
Stacktrace
See above
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
yes, internally, hudi uses
sparkContext.textFile(pathStr)
to read json files and hence we need the schema for now. ifyou are interested, we can file a jira for adding a new source where we read it as spark.json and so it automatically infers schema.@nsivabalan I am curious though: why can HoodieDeltaStreamer not infer the schema from
JsonDFSSource
?From the HoodieDeltaStreamer documentation: