question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT][HELP] SparkSQL can not read the latest change data without execute "refresh table xxx"

See original GitHub issue

SparkSQL can not read the latest change data without execute “refresh table xxx” after write the data in datasource mode

To Reproduce

Steps to reproduce the behavior:

  1. run spark-shell and import class import org.apache.spark.sql.SaveMode
  2. create like this:
spark.sql(
    s"""|CREATE TABLE IF NOT EXISTS `malawi`.`hudi_0_12_1_spark_test` (
        |     `id` INT
        |    ,`name` STRING
        |    ,`age` INT
        |    ,`sync_time` TIMESTAMP
        |) USING HUDI
        |TBLPROPERTIES (
        |     type = 'mor'
        |    ,primaryKey = 'id'
        |    ,preCombineField = 'sync_time'
        |    ,`hoodie.bucket.index.hash.field` = ''
        |    ,`hoodie.datasource.write.hive_style_partitioning` = 'false'
        |    ,`hoodie.table.keygenerator.class`='org.apache.hudi.keygen.ComplexKeyGenerator'
        |)
        |COMMENT 'hudi_0.12.1_test'""".stripMargin
)

spark.sql(
    s"""|create table `malawi`.`hudi_0_12_1_spark_test_rt`
        |using hudi
        |options(`hoodie.query.as.ro.table` = 'false')
        |location 'hdfs:/xxx/malawi/hudi_0_12_1_spark_test';
        |""".stripMargin
)
  1. make test data
var dfData = spark.sql(
    s"""|select 1 as id,'name1' as name, 18 as age, now() as sync_time 
        | union all 
        |select 2 as id,'name2' as name, 22 as age, now() as sync_time 
        | union all 
        |select 3 as id,'name3' as name, 23 as age, now() as sync_time
        |""".stripMargin
)

var dfData2 = spark.sql(
    s"""|select 4 as id,'name1' as name, 18 as age, now() as sync_time
        |""".stripMargin
)
  1. make hudi datasource options
var hoodieProp = Map("hoodie.table.name" -> "hudi_0_12_1_spark_test", "hoodie.datasource.write.operation"->"upsert", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.datasource.write.keygenerator.class"->"org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field"->"", "hoodie.datasource.write.precombine.field"->"sync_time", "hoodie.metadata.enable"->"true", "hoodie.upsert.shuffle.parallelism"->"10", "hoodie.embed.timeline.server"->"false")
  1. write data fist time
dfData.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")
  1. query in spark sql
spark.sql(s"""|select *
              | from (
              |         select 'ori' as flag,a.* from `malawi`.`hudi_0_12_1_spark_test` a
              |         union all
              |         select '_rt' as flag,b.* from `malawi`.`hudi_0_12_1_spark_test_rt` b
              |      ) t
              |order by t.id asc, t.flag asc""".stripMargin
).show(false)

image

  1. write data second time
dfData2.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")
  1. repeat step 6 the data(id = 4) should be queried out in _rt table but it is not image

  2. refresh table

spark.sql("REFRESH TABLE `malawi`.`hudi_0_12_1_spark_test`")
  1. repeat step 6

image

Environment Description

  • Hudi version : 0.12.1

  • Spark version : 3.1.3

  • Hive version : 3.1.1

  • Hadoop version : 3.1.0

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
JoshuaZhuCNcommented, Dec 14, 2022

@alexeykudinkin i don’t understand what “write into the table by its id” means, just using sql like insert into/update/delete from db.table to write data?

Correct. You can do the same from Spark DS.

@alexeykudinkin I think the query engine should not limit the writing way for querying data. Even for the tables created by Spakrsql, the query engine should be able to query new data regardless of the way in which the data is written by spark datasource, spark sql, java client, flink sql, and flink stream api, without requiring users to do additional operations for different writing methods when using the query engine.

@alexeykudinkin At present, the problem I encounter is not only that the Spark datasource cannot be read after it is written, but also that the Spark sql cannot be read after it is written by Flink using hive sync. In other words, the SparkSQL query can not immediately read new data in any other way except by writing data in SQL. Therefore, I think this is a problem that needs to be solved

Interesting. Can you please create another issue specifically for this one as this hardly could be related?

I’ll verify it again.

0reactions
JoshuaZhuCNcommented, Dec 14, 2022

@alexeykudinkin At present, the problem I encounter is not only that the Spark datasource cannot be read after it is written, but also that the Spark sql cannot be read after it is written by Flink using hive sync. In other words, the SparkSQL query can not immediately read new data in any other way except by writing data in SQL. Therefore, I think this is a problem that needs to be solved

Interesting. Can you please create another issue specifically for this one as this hardly could be related?

@alexeykudinkin @danny0405 here https://github.com/apache/hudi/issues/7452

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SUPPORT]SparkSQL can not read the latest data(snapshot ...
using spark sql to execute refresh table xxx; using spark sql to query again, it can query the latest data in step3. Expected...
Read more >
Spark SQL SaveMode.Overwrite, getting java.io ...
I solved this , first I write my Dataframe to a temp directory , and delete the source I reading , and rename...
Read more >
Apache Spark job fails with Parquet column cannot be ...
Problem. You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...
Read more >
Why Cannot I Query Newly Inserted Data in a Parquet Hive ...
Why cannot I query newly inserted data in a parquet Hive table using SparkSQL? This problem occurs in the following scenarios:For partitioned tables...
Read more >
CSV Files - Spark 3.3.1 Documentation
Property Name Default Scope sep, read/write encoding UTF‑8 read/write quote " read/write
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found