Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT][HELP] SparkSQL can not read the latest change data without execute "refresh table xxx"

See original GitHub issue

SparkSQL can not read the latest change data without execute “refresh table xxx” after write the data in datasource mode

To Reproduce

Steps to reproduce the behavior:

run spark-shell and import class import org.apache.spark.sql.SaveMode
create like this:

spark.sql(
    s"""|CREATE TABLE IF NOT EXISTS `malawi`.`hudi_0_12_1_spark_test` (
        |     `id` INT
        |    ,`name` STRING
        |    ,`age` INT
        |    ,`sync_time` TIMESTAMP
        |) USING HUDI
        |TBLPROPERTIES (
        |     type = 'mor'
        |    ,primaryKey = 'id'
        |    ,preCombineField = 'sync_time'
        |    ,`hoodie.bucket.index.hash.field` = ''
        |    ,`hoodie.datasource.write.hive_style_partitioning` = 'false'
        |    ,`hoodie.table.keygenerator.class`='org.apache.hudi.keygen.ComplexKeyGenerator'
        |)
        |COMMENT 'hudi_0.12.1_test'""".stripMargin
)

spark.sql(
    s"""|create table `malawi`.`hudi_0_12_1_spark_test_rt`
        |using hudi
        |options(`hoodie.query.as.ro.table` = 'false')
        |location 'hdfs:/xxx/malawi/hudi_0_12_1_spark_test';
        |""".stripMargin
)

make test data

var dfData = spark.sql(
    s"""|select 1 as id,'name1' as name, 18 as age, now() as sync_time 
        | union all 
        |select 2 as id,'name2' as name, 22 as age, now() as sync_time 
        | union all 
        |select 3 as id,'name3' as name, 23 as age, now() as sync_time
        |""".stripMargin
)

var dfData2 = spark.sql(
    s"""|select 4 as id,'name1' as name, 18 as age, now() as sync_time
        |""".stripMargin
)

make hudi datasource options

var hoodieProp = Map("hoodie.table.name" -> "hudi_0_12_1_spark_test", "hoodie.datasource.write.operation"->"upsert", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.datasource.write.keygenerator.class"->"org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field"->"", "hoodie.datasource.write.precombine.field"->"sync_time", "hoodie.metadata.enable"->"true", "hoodie.upsert.shuffle.parallelism"->"10", "hoodie.embed.timeline.server"->"false")

write data fist time

dfData.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")

query in spark sql

spark.sql(s"""|select *
              | from (
              |         select 'ori' as flag,a.* from `malawi`.`hudi_0_12_1_spark_test` a
              |         union all
              |         select '_rt' as flag,b.* from `malawi`.`hudi_0_12_1_spark_test_rt` b
              |      ) t
              |order by t.id asc, t.flag asc""".stripMargin
).show(false)

write data second time

dfData2.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")

repeat step 6 the data(id = 4) should be queried out in _rt table but it is not
refresh table

spark.sql("REFRESH TABLE `malawi`.`hudi_0_12_1_spark_test`")

repeat step 6

Environment Description

Hudi version : 0.12.1
Spark version : 3.1.3
Hive version : 3.1.1
Hadoop version : 3.1.0
Storage (HDFS/S3/GCS…) : HDFS
Running on Docker? (yes/no) : no

Issue Analytics

State:
Created 10 months ago
Comments:12 (7 by maintainers)

Top GitHub Comments

1reaction

JoshuaZhuCNcommented, Dec 14, 2022

@alexeykudinkin i don’t understand what “write into the table by its id” means, just using sql like insert into/update/delete from db.table to write data?

Correct. You can do the same from Spark DS.

@alexeykudinkin I think the query engine should not limit the writing way for querying data. Even for the tables created by Spakrsql, the query engine should be able to query new data regardless of the way in which the data is written by spark datasource, spark sql, java client, flink sql, and flink stream api, without requiring users to do additional operations for different writing methods when using the query engine.

@alexeykudinkin At present, the problem I encounter is not only that the Spark datasource cannot be read after it is written, but also that the Spark sql cannot be read after it is written by Flink using hive sync. In other words, the SparkSQL query can not immediately read new data in any other way except by writing data in SQL. Therefore, I think this is a problem that needs to be solved

Interesting. Can you please create another issue specifically for this one as this hardly could be related?

I’ll verify it again.

0reactions

JoshuaZhuCNcommented, Dec 14, 2022

@alexeykudinkin At present, the problem I encounter is not only that the Spark datasource cannot be read after it is written, but also that the Spark sql cannot be read after it is written by Flink using hive sync. In other words, the SparkSQL query can not immediately read new data in any other way except by writing data in SQL. Therefore, I think this is a problem that needs to be solved

Interesting. Can you please create another issue specifically for this one as this hardly could be related?

@alexeykudinkin @danny0405 here https://github.com/apache/hudi/issues/7452

Top Results From Across the Web

[SUPPORT]SparkSQL can not read the latest data(snapshot ...

using spark sql to execute refresh table xxx; using spark sql to query again, it can query the latest data in step3. Expected...

Spark SQL SaveMode.Overwrite, getting java.io ...

I solved this , first I write my Dataframe to a temp directory , and delete the source I reading , and rename...

Apache Spark job fails with Parquet column cannot be ...

Problem. You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...

Why Cannot I Query Newly Inserted Data in a Parquet Hive ...

Why cannot I query newly inserted data in a parquet Hive table using SparkSQL? This problem occurs in the following scenarios:For partitioned tables...

CSV Files - Spark 3.3.1 Documentation

Property Name Default Scope sep, read/write encoding UTF‑8 read/write quote " read/write