[SUPPORT][HELP] SparkSQL can not read the latest change data without execute "refresh table xxx"
See original GitHub issueSparkSQL can not read the latest change data without execute “refresh table xxx” after write the data in datasource mode
To Reproduce
Steps to reproduce the behavior:
- run spark-shell and import class
import org.apache.spark.sql.SaveMode
- create like this:
spark.sql(
s"""|CREATE TABLE IF NOT EXISTS `malawi`.`hudi_0_12_1_spark_test` (
| `id` INT
| ,`name` STRING
| ,`age` INT
| ,`sync_time` TIMESTAMP
|) USING HUDI
|TBLPROPERTIES (
| type = 'mor'
| ,primaryKey = 'id'
| ,preCombineField = 'sync_time'
| ,`hoodie.bucket.index.hash.field` = ''
| ,`hoodie.datasource.write.hive_style_partitioning` = 'false'
| ,`hoodie.table.keygenerator.class`='org.apache.hudi.keygen.ComplexKeyGenerator'
|)
|COMMENT 'hudi_0.12.1_test'""".stripMargin
)
spark.sql(
s"""|create table `malawi`.`hudi_0_12_1_spark_test_rt`
|using hudi
|options(`hoodie.query.as.ro.table` = 'false')
|location 'hdfs:/xxx/malawi/hudi_0_12_1_spark_test';
|""".stripMargin
)
- make test data
var dfData = spark.sql(
s"""|select 1 as id,'name1' as name, 18 as age, now() as sync_time
| union all
|select 2 as id,'name2' as name, 22 as age, now() as sync_time
| union all
|select 3 as id,'name3' as name, 23 as age, now() as sync_time
|""".stripMargin
)
var dfData2 = spark.sql(
s"""|select 4 as id,'name1' as name, 18 as age, now() as sync_time
|""".stripMargin
)
- make hudi datasource options
var hoodieProp = Map("hoodie.table.name" -> "hudi_0_12_1_spark_test", "hoodie.datasource.write.operation"->"upsert", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.datasource.write.keygenerator.class"->"org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field"->"", "hoodie.datasource.write.precombine.field"->"sync_time", "hoodie.metadata.enable"->"true", "hoodie.upsert.shuffle.parallelism"->"10", "hoodie.embed.timeline.server"->"false")
- write data fist time
dfData.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")
- query in spark sql
spark.sql(s"""|select *
| from (
| select 'ori' as flag,a.* from `malawi`.`hudi_0_12_1_spark_test` a
| union all
| select '_rt' as flag,b.* from `malawi`.`hudi_0_12_1_spark_test_rt` b
| ) t
|order by t.id asc, t.flag asc""".stripMargin
).show(false)
- write data second time
dfData2.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")
-
repeat step 6 the data(id = 4) should be queried out in _rt table but it is not
-
refresh table
spark.sql("REFRESH TABLE `malawi`.`hudi_0_12_1_spark_test`")
- repeat step 6
Environment Description
-
Hudi version : 0.12.1
-
Spark version : 3.1.3
-
Hive version : 3.1.1
-
Hadoop version : 3.1.0
-
Storage (HDFS/S3/GCS…) : HDFS
-
Running on Docker? (yes/no) : no
Issue Analytics
- State:
- Created 10 months ago
- Comments:12 (7 by maintainers)
Top Results From Across the Web
[SUPPORT]SparkSQL can not read the latest data(snapshot ...
using spark sql to execute refresh table xxx; using spark sql to query again, it can query the latest data in step3. Expected...
Read more >Spark SQL SaveMode.Overwrite, getting java.io ...
I solved this , first I write my Dataframe to a temp directory , and delete the source I reading , and rename...
Read more >Apache Spark job fails with Parquet column cannot be ...
Problem. You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...
Read more >Why Cannot I Query Newly Inserted Data in a Parquet Hive ...
Why cannot I query newly inserted data in a parquet Hive table using SparkSQL? This problem occurs in the following scenarios:For partitioned tables...
Read more >CSV Files - Spark 3.3.1 Documentation
Property Name Default Scope
sep, read/write
encoding UTF‑8 read/write
quote " read/write
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@alexeykudinkin I think the query engine should not limit the writing way for querying data. Even for the tables created by Spakrsql, the query engine should be able to query new data regardless of the way in which the data is written by spark datasource, spark sql, java client, flink sql, and flink stream api, without requiring users to do additional operations for different writing methods when using the query engine.
I’ll verify it again.
@alexeykudinkin @danny0405 here https://github.com/apache/hudi/issues/7452