question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connection caching does not work.

See original GitHub issue

When connecting to snowflake using spark-connector, connection caching does not occur.

It is completely unusable if MFA/DuoPush is used on snowflake. This piece of code creates 3 connections to snowflake and requires 3 MFA authorizations!

    val options = Map(
      Parameters.PARAM_SF_DATABASE -> "db",
      Parameters.PARAM_SF_SCHEMA -> "schema",
      Parameters.PARAM_SF_QUERY -> """SELECT * FROM table""",
      Parameters.PARAM_SF_URL -> fqdn,
      Parameters.PARAM_SF_USER -> username,
      Parameters.PARAM_SF_PASSWORD -> password,
      Parameters.PARAM_SF_ROLE -> role,
      Parameters.PARAM_SF_WAREHOUSE -> wh,
      "ALLOW_CLIENT_MFA_CACHING" -> "true",
      "CLIENT_SESSION_KEEP_ALIVE" -> "true"
    )

    val df: DataFrame = spark.read
      .format("snowflake")
      .options(options)
      .load()

    df.count()

While looking at the code and putting some breakpoints, I find that none of JDBCWrapper, DriverManager, SnowflakeDriver, SnowflakeConnectionV1, DefaultSFConnectionHandler will cache or reuse connections.

Versions:

  • spark: 3.0.1
  • scala: 2.12
  • spark-snowflake: 2.10.1-spark_3.0
  • snowflake-jdbc: 3.13.14

So, unless I missed some piece of documentation showing how to cache connections, I cannot use spark connector for snowflake in my environment.

I thought about subclassing DefaultSource and providing a different JDBCWrapper than DefaultJDBCWrapper in the constructor. However, it looks like DefaultJDBCWrapper is hardcoded at multiple places and is also a private class.

Issue Analytics

  • State:open
  • Created 9 months ago
  • Reactions:1
  • Comments:14

github_iconTop GitHub Comments

1reaction
michellemaycommented, Dec 14, 2022

The goal of sending data to snowflake is to be able to join/filter remotely. Lets say I have a table A where I can select some rows based on complex condition, bring that back to the spark cluster for other processing. Then, send the result to snowflake and inner join table B on it.

In my setup, we have hundreds of millions of rows in table A, processed and then filtered down to a few millions. (It’s not possible to do that filtering purely in SQL). Upload that to snowflake and join with table B containing billions of rows. If I let spark do the join locally, snowflake will have to materialize and transfer billions of rows just to discard them once on spark cluster. Table B is a view over some other data. I’d rather let snowflake do the filtering efficiently.

0reactions
sfc-gh-wfateemcommented, Dec 14, 2022

Ok, that makes sense. Thanks for clarifying @michellemay.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How Do I Fix My Caching Problems Or Clear Web Browser's ...
Try holding down the Shift key while pressing the Refresh button. · Close your browser and re-open it (make sure you are NOT...
Read more >
Troubleshoot Connected Cache - Configuration Manager
Open PowerShell and run the following command: Get-DeliveryOptimizationStatus . ... Notice that the BytesFromCacheServer attribute isn't zero. If ...
Read more >
Database connection caching problem - Stack Overflow
You can't cache objects, resourses or handlers this way. Because this is pointer (or link) to the system resource, not an actual resource....
Read more >
Connection caching - Sybase Infocenter
How it works. A connection cache contains a pool of preallocated connections that components can use repeatedly as needed to communicate with a...
Read more >
Odyssey Part 2 - Caching of RESTDataSource does not work ...
I have the same issue, because the /tracks endpoint is cached, but /author/{id} isn't, the data source doesn't speed up the request that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found