question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] SaveMode.Append fails on renamed hudi tables

See original GitHub issue

Describe the problem you faced

Hello team, we recently upgraded from emr-5.30.2 to 5.31.1 and noticed failure in our pipelines doing incremental append to hudi tables.

Issue : SaveMode.Append throws exception and fails on renamed hudi tables affects hudi 0.6 and above.

To Reproduce

Steps to reproduce the behavior:

  1. Create a hudi table with s3 path
  2. Rename the table using spark.sql(s"ALTER TABLE $oldTableName RENAME TO $newTableName")
  3. Use spark df.write with mode("append") to save into newTableName
  4. Exception is thrown

Expected behavior

SaveMode.Append works for renamed tables, when using new table name DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> $newTableName.

Environment Description EMR-5.31.1

  • Hudi version : Hudi 0.6

  • Spark version : 2.4.6

  • Hive version : 2.3.7

  • Hadoop version : 2.10.0

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : No

Additional context

Related code : HoodieSparkSqlWriter.scala#L295

HiveTableConfig.tableName is set from .hoodie/hoodie.properties file. When the table is renamed with spark sql, HoodieSparkSqlWriter is still expecting the existing table name from HiveTableConfig to match the new table name.

Stacktrace

org.apache.hudi.exception.HoodieException: hoodie table with name <old_table_name> already exists at s3://<table-path>
  at org.apache.hudi.HoodieSparkSqlWriter$.handleSaveModes(HoodieSparkSqlWriter.scala:297)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:109)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
  at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
  at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
  at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
YannByroncommented, Jan 26, 2022

@ranjitha-shenoy i also guess so. Maybe this is a bug for hudi 0.6, and we can’t patch a bugfix for this old version.

0reactions
ranjitha-shenoycommented, Jan 25, 2022

@YannByron I have not been able to test it with hudi 0.10, but I believe the introduction of table name check HoodieSparkSqlWriter.scala#L295 started the issue, of not being able to append on renamed tables.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing Data | Apache Hudi
Note: After the initial creation of a table, this value must stay consistent when writing to (updating) the table using the Spark SaveMode.Append...
Read more >
Schema Evolution - Apache Hudi
Schema evolution allows users to easily change the current schema of a Hudi table to adapt to the data that is changing over...
Read more >
FAQs | Apache Hudi
When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/ ...
Read more >
All Configurations | Apache Hudi
Comma separated list of file paths to read within a Hudi table. ... .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath);
Read more >
FAQs - Apache Hudi
Your current job is rewriting entire table/partition to deal with updates, ... Even for append-only data streams, Hudi supports key based ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found