Connection caching does not work.
See original GitHub issueWhen connecting to snowflake using spark-connector, connection caching does not occur.
It is completely unusable if MFA/DuoPush is used on snowflake. This piece of code creates 3 connections to snowflake and requires 3 MFA authorizations!
val options = Map(
Parameters.PARAM_SF_DATABASE -> "db",
Parameters.PARAM_SF_SCHEMA -> "schema",
Parameters.PARAM_SF_QUERY -> """SELECT * FROM table""",
Parameters.PARAM_SF_URL -> fqdn,
Parameters.PARAM_SF_USER -> username,
Parameters.PARAM_SF_PASSWORD -> password,
Parameters.PARAM_SF_ROLE -> role,
Parameters.PARAM_SF_WAREHOUSE -> wh,
"ALLOW_CLIENT_MFA_CACHING" -> "true",
"CLIENT_SESSION_KEEP_ALIVE" -> "true"
)
val df: DataFrame = spark.read
.format("snowflake")
.options(options)
.load()
df.count()
While looking at the code and putting some breakpoints, I find that none of JDBCWrapper, DriverManager, SnowflakeDriver, SnowflakeConnectionV1, DefaultSFConnectionHandler will cache or reuse connections.
Versions:
- spark: 3.0.1
- scala: 2.12
- spark-snowflake: 2.10.1-spark_3.0
- snowflake-jdbc: 3.13.14
So, unless I missed some piece of documentation showing how to cache connections, I cannot use spark connector for snowflake in my environment.
I thought about subclassing DefaultSource and providing a different JDBCWrapper than DefaultJDBCWrapper in the constructor. However, it looks like DefaultJDBCWrapper is hardcoded at multiple places and is also a private class.
Issue Analytics
- State:
- Created 9 months ago
- Reactions:1
- Comments:14
Top GitHub Comments
The goal of sending data to snowflake is to be able to join/filter remotely. Lets say I have a table A where I can select some rows based on complex condition, bring that back to the spark cluster for other processing. Then, send the result to snowflake and inner join table B on it.
In my setup, we have hundreds of millions of rows in table A, processed and then filtered down to a few millions. (It’s not possible to do that filtering purely in SQL). Upload that to snowflake and join with table B containing billions of rows. If I let spark do the join locally, snowflake will have to materialize and transfer billions of rows just to discard them once on spark cluster. Table B is a view over some other data. I’d rather let snowflake do the filtering efficiently.
Ok, that makes sense. Thanks for clarifying @michellemay.