Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hive Sync Error when creating a table with partition

See original GitHub issue

I have a bunch of data that I am writing to s3 and doing a Hive sync during the write process. The write to S3 is successful but the Hive sync is failing.

Hudi version - 0.5.0-SNAPSHOT from the master branch Hive version - Hive 2.3.2-amzn-2 Spark version - 2.4.3

Sample Data Frame

|     gender|comments|               title|              cc|    ip_address|last_name| id| birthdate|   salary|   registration_dttm|             country|               email|first_name|       key          | timestamp          |   date       |
+-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+
|     Female|   1E+02|    Internal Auditor|6759521864920116|   1.197.201.2|   Jordan|  1|  3/8/1971| 49756.53|2016-02-03T07:55:29Z|           Indonesia|    ajordan0@com.com|    Amanda|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|       Male|        |       Accountant IV|                |218.111.175.34|  Freeman|  2| 1/16/1968|150280.17|2016-02-03T17:04:03Z|              Canada|     afreeman1@is.gd|    Albert|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|     Female|        | Structural Engineer|6767119071901597|  7.161.136.94|   Morgan|  3|  2/1/1960|144972.51|2016-02-03T01:09:31Z|              Russia|emorgan2@altervis...|    Evelyn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|     Female|        |Senior Cost Accou...|3576031598965625| 140.35.109.83|    Riley|  4|  4/8/1997| 90263.05|2016-02-03T12:36:21Z|               China|    driley3@gmpg.org|    Denise|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|           |        |                    |5602256255204850|169.113.235.40|    Burns|  5|          |         |2016-02-03T05:05:31Z|        South Africa|cburns4@miitbeian...|    Carlos|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|Transgender|        |   Account Executive|3583136326049310|195.131.81.179|    White|  6| 2/25/1983| 69227.11|2016-02-03T07:22:34Z|           Indonesia|  kwhite5@google.com|   Kathryn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|    Unknown|        |Senior Financial ...|3582641366974690|232.234.81.197|   Holmes|  7|12/18/1987| 14247.62|2016-02-03T08:33:08Z|            Portugal|sholmes6@foxnews.com|    Samuel|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|     Secret|        |    Web Developer IV|                |  91.235.51.73|   Howell|  8|  3/1/1962|186469.43|2016-02-03T06:47:06Z|Bosnia and Herzeg...| hhowell7@eepurl.com|     Harry|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|    Obvious|   1E+02|Software Test Eng...|                |  132.31.53.61|   Foster|  9| 3/27/1992|231067.84|2016-02-03T03:52:53Z|         South Korea|   jfoster8@yelp.com|      Jose|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
|     Female|        |     Health Coach IV|3574254110301671|143.28.251.245|  Stewart| 10| 1/28/1997| 27234.28|2016-02-03T18:29:47Z|             Nigeria|estewart9@opensou...|     Emily|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    2019/09/04|
+-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+

Spark Scala Code

cleanedDF
    .write.format("org.apache.hudi")
    .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")
    .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date")
    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp")
    .option(HoodieWriteConfig.TABLE_NAME, catalogName)
    .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
    .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "date")
    .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, catalogName)
    .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, sparkConfig.hiveJDBCUri)
    .option("path", basePath)
    .mode(SaveMode.Append)
    .save()

Error

837579 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] WARN  com.amazonaws.services.s3.internal.S3AbortableInputStream  - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
837580 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO  org.apache.hudi.hive.HiveSyncTool  - Table hudi_test is not found. Creating it
837594 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO  org.apache.hudi.hive.HoodieHiveClient  - Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi'
837602 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO  org.apache.hudi.hive.HoodieHiveClient  - Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi'
837698 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] ERROR org.apache.spark.sql.execution.streaming.MicroBatchExecution  - Query [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420] terminated with error
org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi'
	at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:467)
	at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:265)
	at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)\
	at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:522 cannot recognize input near 'date' 'string' ')' in column specification
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
	at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
	at org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
	at org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
	at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:465)
	... 53 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:522 cannot recognize input near 'date' 'string' ')' in column specification
	at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
	at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
	at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
	at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
	at com.sun.proxy.$Proxy35.executeStatementAsync(Unknown Source)
	at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
	at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)
	at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
	at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:522 cannot recognize input near 'date' 'string' ')' in column specification
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
	... 27 more

As far as I understand, Hive does not allow you to add partitions to tables during CREATE.

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

vinothchandarcommented, Sep 11, 2019

doh. Thanks for chasing this down @firecast . Opened HUDI-244 to fix this. I think the other fields are escaped correctly.

0reactions

firecastcommented, Sep 11, 2019

It looks like the issue is with the partition name __xxx_date__. When I tried the SQL hoodie was doing it gave the following error in Hive FAILED: ParseException line 1:503 cannot recognize input near '__xx_date__' 'string' ')' in column specification

So I escaped the partition column manually in the SQL and it worked.

CREATE EXTERNAL TABLE  IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birth__xx_date__` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (`__xx_date__` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://demo-bucket/catalogs/hudi_test/hudi'

It looks like HUDI does not escape the column name in the PARTITIONED BY clause.