question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data misplaced when reading a table that does not have the same field positions as the spark schema

See original GitHub issue

I am trying to create a dataset from bigquery table. The table has the same fields as case class but not in the same order. When creating the dataset, we get columns mapped to the wrong fields.

Given this table: image

When loading dataset


case class NestedClass(
  int3: Int,
  int1: Int,
  int2: Int)

case class ClassA(str: String, l1: NestedClass)

 val schema = Encoders.product[ClassA].schema

    val ds2 = spark.read
      .schema(schema)
      .option("table", "customers_sale.test_table")
      .format("com.google.cloud.spark.bigquery")
      .load()
      .as[ClassA]

    ds2.map(_.l1).show(false)

NB: notice that NestedClass has the same fields as table but in different order: (int3, int1, int2) instead of (int1, int2, int3)

we got this:

+----+----+----+
|int1|int2|int3|
+----+----+----+
|2   |3   |1   |
+----+----+----+

We expect to get this

+----+----+----+
|int1|int2|int3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

=> The connector does not use fields name to assign values but correlates field position in case class with the same position in table.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
davidrabinowitzcommented, May 13, 2021

Fixed by PR #391

0reactions
LaurentValdenairecommented, Apr 27, 2021

We indeed have an issue with nested structs

Read more comments on GitHub >

github_iconTop Results From Across the Web

Incompatible schema in some files - Databricks
Problem. The Spark job fails with an exception like the following while reading Parquet files: Error in SQL statement: SparkException: Job ...
Read more >
How to change a column position in a spark dataframe?
Show activity on this post. I was wondering if it is possible to change the position of a column in a dataframe, actually...
Read more >
Incompatible schema in some files - Azure Databricks
Solution. Find the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled: Scala...
Read more >
Structured Streaming Programming Guide - Apache Spark
This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table....
Read more >
How to lose data in Apache Spark | blog
The reason Spark has not lost any data at this point is the lack of schema. Let's tell Spark these numbers are integers....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found