Hive Sync Error when creating a table with partition
See original GitHub issueI have a bunch of data that I am writing to s3 and doing a Hive sync during the write process. The write to S3 is successful but the Hive sync is failing.
Hudi version - 0.5.0-SNAPSHOT
from the master branch
Hive version - Hive 2.3.2-amzn-2
Spark version - 2.4.3
Sample Data Frame
| gender|comments| title| cc| ip_address|last_name| id| birthdate| salary| registration_dttm| country| email|first_name| key | timestamp | date |
+-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+
| Female| 1E+02| Internal Auditor|6759521864920116| 1.197.201.2| Jordan| 1| 3/8/1971| 49756.53|2016-02-03T07:55:29Z| Indonesia| ajordan0@com.com| Amanda|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Male| | Accountant IV| |218.111.175.34| Freeman| 2| 1/16/1968|150280.17|2016-02-03T17:04:03Z| Canada| afreeman1@is.gd| Albert|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Female| | Structural Engineer|6767119071901597| 7.161.136.94| Morgan| 3| 2/1/1960|144972.51|2016-02-03T01:09:31Z| Russia|emorgan2@altervis...| Evelyn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Female| |Senior Cost Accou...|3576031598965625| 140.35.109.83| Riley| 4| 4/8/1997| 90263.05|2016-02-03T12:36:21Z| China| driley3@gmpg.org| Denise|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| | | |5602256255204850|169.113.235.40| Burns| 5| | |2016-02-03T05:05:31Z| South Africa|cburns4@miitbeian...| Carlos|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
|Transgender| | Account Executive|3583136326049310|195.131.81.179| White| 6| 2/25/1983| 69227.11|2016-02-03T07:22:34Z| Indonesia| kwhite5@google.com| Kathryn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Unknown| |Senior Financial ...|3582641366974690|232.234.81.197| Holmes| 7|12/18/1987| 14247.62|2016-02-03T08:33:08Z| Portugal|sholmes6@foxnews.com| Samuel|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Secret| | Web Developer IV| | 91.235.51.73| Howell| 8| 3/1/1962|186469.43|2016-02-03T06:47:06Z|Bosnia and Herzeg...| hhowell7@eepurl.com| Harry|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Obvious| 1E+02|Software Test Eng...| | 132.31.53.61| Foster| 9| 3/27/1992|231067.84|2016-02-03T03:52:53Z| South Korea| jfoster8@yelp.com| Jose|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
| Female| | Health Coach IV|3574254110301671|143.28.251.245| Stewart| 10| 1/28/1997| 27234.28|2016-02-03T18:29:47Z| Nigeria|estewart9@opensou...| Emily|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04|
+-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+
Spark Scala Code
cleanedDF
.write.format("org.apache.hudi")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, catalogName)
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "date")
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, catalogName)
.option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, sparkConfig.hiveJDBCUri)
.option("path", basePath)
.mode(SaveMode.Append)
.save()
Error
837579 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] WARN com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
837580 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO org.apache.hudi.hive.HiveSyncTool - Table hudi_test is not found. Creating it
837594 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO org.apache.hudi.hive.HoodieHiveClient - Creating table with CREATE EXTERNAL TABLE IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi'
837602 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO org.apache.hudi.hive.HoodieHiveClient - Executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi'
837698 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] ERROR org.apache.spark.sql.execution.streaming.MicroBatchExecution - Query [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420] terminated with error
org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi'
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:467)
at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:265)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)\
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:522 cannot recognize input near 'date' 'string' ')' in column specification
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
at org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:465)
... 53 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:522 cannot recognize input near 'date' 'string' ')' in column specification
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy35.executeStatementAsync(Unknown Source)
at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:522 cannot recognize input near 'date' 'string' ')' in column specification
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
... 27 more
As far as I understand, Hive does not allow you to add partitions to tables during CREATE
.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Hive Sync Error when creating a table with partition · Issue #879
I have a bunch of data that I am writing to s3 and doing a Hive sync during the write process. The write...
Read more >HIVE partitions adding not working as expected..pa... - 224083
Currently i am working on HIVE tables and facing issue with hive partitions ,we have script to drop partitions if exist based on...
Read more >Hive sync fails to register tables partitioned by Date Type column
Hive is not able to make sense of the partition field values like 17897 as it is not able to convert it to...
Read more >[jira] [Updated] (HUDI-4099) hive sync no partition table error
ERROR > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing > SQL CREATE EXTERNAL TABLE IF NOT EXISTS > `default`.
Read more >Troubleshoot Athena query failing with the error ...
If you created the table manually, then use an Athena data definition language (DDL) statement to drop the affected partition and recreate the ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
doh. Thanks for chasing this down @firecast . Opened HUDI-244 to fix this. I think the other fields are escaped correctly.
It looks like the issue is with the partition name
__xxx_date__
. When I tried the SQL hoodie was doing it gave the following error in HiveFAILED: ParseException line 1:503 cannot recognize input near '__xx_date__' 'string' ')' in column specification
So I escaped the partition column manually in the SQL and it worked.
It looks like HUDI does not escape the column name in the
PARTITIONED BY
clause.