spark-snowflake writing a dataframe ignores option error_on_column_count_mismatch=false
See original GitHub issueEnvironment
- python 2.7
- spark 2.2.0
- snowflake-jdbc-3.4.2.jar
- spark-snowflake_2.11-2.2.8.jar
- emr 5.11.1
Setup
create snowflake table that has an auto incrementing id column, using file format option error_on_column_count_mismatch=false
create table my_db.my_schema.test_table (
id integer autoincrement primary key,
user_id integer
)
stage_file_format = (TYPE=PARQUET, ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE);
create a dataframe that only has user_id
from pyspark.sql import Row
data = [Row(user_id=x) for x in [99, 98, 97]]
df = spark.createDataFrame(data)
spark.write.format('jdbc')...
Issue
Writing a dataframe that has less columns than the destination table raises an net.snowflake.client.jdbc.SnowflakeSQLException
and suggests use file format option error_on_column_count_mismatch=false to ignore this error
. When following the suggestion, the same Exception occurs. This operation works on mysql and redshift, but not on snowflake.
Expeceted behavior
Assume the setup in ##setup, calling a spark jdbc write on a dataframe with column user_id
should write the data into the destination table that has columns id, user_id
, incrementing the autoincrement id column while populating the user_id
column with data from dataframe. (Confirmed this works using snowflake-sqlalchemy, and snowflake SQL).
Assume a setup where you have columns user_id, name, zip_code
on a snowflake table and a dataframe with columns user_id, name
. Calling a spark jdbc write should populate user_id, name
columns in the destination table, while leaving zip_code NULL.
Temporary workarounds
Although not ideal, we use the following workarounds for both issues so this may be useful for other readers/users.
For behavior 1, we are writing dataframe with column user_id
into a temporary table that only has user_id
and then performing a insert from select statement using sqlalchemy.
For behavior 2, we are generating the missing columns in the dataframe using F.withColumn('zip_code', None
) and then successfully writing the dataframe into the destination table, filling NULL for columns that we were originally missing.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:5 (1 by maintainers)
Top GitHub Comments
@etduwx Is there a way to do that automatically without specifying every column in the mapping? Having a similar issue on a table with 85 columns, the last of which is an autoincrementing key, and it would be preferable not to have to add all the columns to the hive view creation script twice.
Thanks!
Hi @tchoedak ,
We recommend that the column-mapping option be used here. As copied from the release notes:
This should allow a one-to-one mapping of Spark dataframe columns to Snowflake target table columns.