Data misplaced when reading a table that does not have the same field positions as the spark schema
See original GitHub issueI am trying to create a dataset from bigquery table. The table has the same fields as case class but not in the same order. When creating the dataset, we get columns mapped to the wrong fields.
Given this table:
When loading dataset
case class NestedClass(
int3: Int,
int1: Int,
int2: Int)
case class ClassA(str: String, l1: NestedClass)
val schema = Encoders.product[ClassA].schema
val ds2 = spark.read
.schema(schema)
.option("table", "customers_sale.test_table")
.format("com.google.cloud.spark.bigquery")
.load()
.as[ClassA]
ds2.map(_.l1).show(false)
NB: notice that NestedClass has the same fields as table but in different order: (int3, int1, int2) instead of (int1, int2, int3)
we got this:
+----+----+----+
|int1|int2|int3|
+----+----+----+
|2 |3 |1 |
+----+----+----+
We expect to get this
+----+----+----+
|int1|int2|int3|
+----+----+----+
|1 |2 |3 |
+----+----+----+
=> The connector does not use fields name to assign values but correlates field position in case class with the same position in table.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Incompatible schema in some files - Databricks
Problem. The Spark job fails with an exception like the following while reading Parquet files: Error in SQL statement: SparkException: Job ...
Read more >How to change a column position in a spark dataframe?
Show activity on this post. I was wondering if it is possible to change the position of a column in a dataframe, actually...
Read more >Incompatible schema in some files - Azure Databricks
Solution. Find the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled: Scala...
Read more >Structured Streaming Programming Guide - Apache Spark
This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table....
Read more >How to lose data in Apache Spark | blog
The reason Spark has not lost any data at this point is the lack of schema. Let's tell Spark these numbers are integers....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Fixed by PR #391
We indeed have an issue with nested structs