Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connection caching does not work.

See original GitHub issue

When connecting to snowflake using spark-connector, connection caching does not occur.

It is completely unusable if MFA/DuoPush is used on snowflake. This piece of code creates 3 connections to snowflake and requires 3 MFA authorizations!

    val options = Map(
      Parameters.PARAM_SF_DATABASE -> "db",
      Parameters.PARAM_SF_SCHEMA -> "schema",
      Parameters.PARAM_SF_QUERY -> """SELECT * FROM table""",
      Parameters.PARAM_SF_URL -> fqdn,
      Parameters.PARAM_SF_USER -> username,
      Parameters.PARAM_SF_PASSWORD -> password,
      Parameters.PARAM_SF_ROLE -> role,
      Parameters.PARAM_SF_WAREHOUSE -> wh,
      "ALLOW_CLIENT_MFA_CACHING" -> "true",
      "CLIENT_SESSION_KEEP_ALIVE" -> "true"
    )

    val df: DataFrame = spark.read
      .format("snowflake")
      .options(options)
      .load()

    df.count()

While looking at the code and putting some breakpoints, I find that none of JDBCWrapper, DriverManager, SnowflakeDriver, SnowflakeConnectionV1, DefaultSFConnectionHandler will cache or reuse connections.

Versions:

spark: 3.0.1
scala: 2.12
spark-snowflake: 2.10.1-spark_3.0
snowflake-jdbc: 3.13.14

So, unless I missed some piece of documentation showing how to cache connections, I cannot use spark connector for snowflake in my environment.

I thought about subclassing DefaultSource and providing a different JDBCWrapper than DefaultJDBCWrapper in the constructor. However, it looks like DefaultJDBCWrapper is hardcoded at multiple places and is also a private class.

Issue Analytics

State:
Created 9 months ago
Reactions:1
Comments:14

Top GitHub Comments

1reaction

michellemaycommented, Dec 14, 2022

The goal of sending data to snowflake is to be able to join/filter remotely. Lets say I have a table A where I can select some rows based on complex condition, bring that back to the spark cluster for other processing. Then, send the result to snowflake and inner join table B on it.

In my setup, we have hundreds of millions of rows in table A, processed and then filtered down to a few millions. (It’s not possible to do that filtering purely in SQL). Upload that to snowflake and join with table B containing billions of rows. If I let spark do the join locally, snowflake will have to materialize and transfer billions of rows just to discard them once on spark cluster. Table B is a view over some other data. I’d rather let snowflake do the filtering efficiently.

0reactions

sfc-gh-wfateemcommented, Dec 14, 2022

Ok, that makes sense. Thanks for clarifying @michellemay.