question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

spark-snowflake writing a dataframe ignores option error_on_column_count_mismatch=false

See original GitHub issue

Environment

  • python 2.7
  • spark 2.2.0
  • snowflake-jdbc-3.4.2.jar
  • spark-snowflake_2.11-2.2.8.jar
  • emr 5.11.1

Setup

create snowflake table that has an auto incrementing id column, using file format option error_on_column_count_mismatch=false

create table my_db.my_schema.test_table (
id integer autoincrement primary key,
user_id integer
)
stage_file_format = (TYPE=PARQUET, ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE);

create a dataframe that only has user_id

from pyspark.sql import Row

data = [Row(user_id=x) for x in [99, 98, 97]]
df = spark.createDataFrame(data)
spark.write.format('jdbc')...

Issue

Writing a dataframe that has less columns than the destination table raises an net.snowflake.client.jdbc.SnowflakeSQLException and suggests use file format option error_on_column_count_mismatch=false to ignore this error. When following the suggestion, the same Exception occurs. This operation works on mysql and redshift, but not on snowflake.

Expeceted behavior

Assume the setup in ##setup, calling a spark jdbc write on a dataframe with column user_id should write the data into the destination table that has columns id, user_id, incrementing the autoincrement id column while populating the user_id column with data from dataframe. (Confirmed this works using snowflake-sqlalchemy, and snowflake SQL).

Assume a setup where you have columns user_id, name, zip_code on a snowflake table and a dataframe with columns user_id, name. Calling a spark jdbc write should populate user_id, name columns in the destination table, while leaving zip_code NULL.

Temporary workarounds

Although not ideal, we use the following workarounds for both issues so this may be useful for other readers/users.

For behavior 1, we are writing dataframe with column user_id into a temporary table that only has user_id and then performing a insert from select statement using sqlalchemy.

For behavior 2, we are generating the missing columns in the dataframe using F.withColumn('zip_code', None) and then successfully writing the dataframe into the destination table, filling NULL for columns that we were originally missing.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:4
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
Ortizercommented, Jun 13, 2018

@etduwx Is there a way to do that automatically without specifying every column in the mapping? Having a similar issue on a table with 85 columns, the last of which is an autoincrementing key, and it would be preferable not to have to add all the columns to the hive view creation script twice.

Thanks!

2reactions
etduwxcommented, May 30, 2018

Hi @tchoedak ,

We recommend that the column-mapping option be used here. As copied from the release notes:

  • Support for column-mapping. Columns may be written out-of-order, or to an arbitrary set of equal quantity, type-compatible columns from a Dataframe to a Snowflake table. Example:
    df.write.format(SNOWFLAKE_SOURCE_NAME).options(connectorOptionsNoTable)
      .option("dbtable", dbtable)
      .option("columnmap", Map("one" -> "sf_col2", "two" -> "sf_col1").toString())
      .mode(SaveMode.Append).save()

This should allow a one-to-one mapping of Spark dataframe columns to Snowflake target table columns.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can we use File format option in Spark Connector
Is it possible to pass a FileFormat option when using the Spark Connector for Snowflake? ... No, you can't pass false for ERROR_ON_COLUMN_COUNT_MISMATCH...
Read more >
Spark to Snowflake, column number mismatch
SnowflakeSQLException : Number of columns in file (37) does not match ... option error_on_column_count_mismatch=false to ignore this error.
Read more >
Snowflake Spark Connector with Examples
Snowflake Introduction; Apache Spark; Snowflake Spark Connector; Snowflake maven dependency; Create snowflake Database & Table; Snowflake connection options ...
Read more >
How do I slice up dataframes into smaller chunks and write to ...
Option 1: pass in Snowflake Python Connector function pd_writer. from snowflake.connector.pandas_tools import pd_writer # Specify that the ...
Read more >
Snowflake Spark Integration: A Comprehensive Guide 101
It provides its users with an option for storing their data in the Cloud. Snowflake ... How to Write Spark DataFrame into Snowflake...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found