question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Caching Tables in SparkCatalog via CachingCatalog by default leads to stale data

See original GitHub issue

I’ve been experiencing a few issues with refreshing table metadata in Iceberg. I think caching in Iceberg is a bit flawed in the sense that if we use spark3 via SparkCatalog with cache-enable=true leading to wrap the Iceberg Catalog with CachingCatalog - which is the default - those tables will pretty much stay stale until:

  1. They are evicted, or
  2. There’s a commit, which will trigger a refresh.

With this, it’s a bit dangerous to have long lived TableOperations objects, e.g. multiple long lived Spark sessions reading the same table that gets modified.

I don’t think the TableOperations are cache friendly unless we expect to have stale data results in different sessions, e.g. with cache-enable=true we have the following behavior:

  1. First session reads table1:
  2. Second session reads table1. <-- up to this point, both sessions see the same data
  3. First session commits to table1
  4. Second session reads table1 <-- this read is stale, due to caching - changes in 3) are not reflected

In order for this flow to work, as in Hive tables, and represent the up-to-date data in both sessions, we can’t use caching right now. While not checking for the up-to-date metadata location saves client calls, I think we should do checks in TableOperations to refresh the metadata when metadata location changes, with this we could cache the objects and have correctness on data freshness.

Caching is enabled by default in SparkCatalog (Spark 3) - For now, I think the default should be false, especially since currently it could lead to data inconsistency.

What do you think @rdblue @aokolnychyi ?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
edgarRdcommented, Mar 11, 2021

I think caching tables indefinitely in sessions (until garbage collected or a change happens) returning stale results is unintuitive (non-obvious) and inconsistent to be set enabled by default. Seems like multiple people have brought up this behavior in the past - at least @aokolnychyi and @pvary - signaling how unintuitive returning stale results by default is. Also, it’s inconsistent with how Hive and Presto handle Iceberg tables; but also how Spark handles queries to non-Iceberg tables.

I’m not arguing for removing caching altogether - as @pvary mentioned, it can be a feature in some cases - but instead that the default should be the most intuitive and consistent behavior. If there are use cases that need to fix the state of the table at some point, then cache can be used explicitly rather than using it implicitly by default. In fact, Spark provides such constructs with https://spark.apache.org/docs/3.0.1/sql-ref-syntax-aux-cache-cache-table.html - shouldn’t that be more in line with Spark’s expected handling of cached tables?

My only worry about setting it as false is breaking self joins

If caching by default is needed in a full environment and it depends on having it enabled, this can be set in spark-defaults.conf and set the cache-enable=true regardless of the default value - it’s anyways good practice to avoid depending on defaults. I think this is better rather than the other way around, expecting users to know they have to refresh their table to see if the table had changes. If the change of the default were to happen, we’d definitely need to include in docs/release notes. On the other hand, we could recommend users to set always cache-enabled=false to avoid depending on defaults and have fresh state, but as mentioned before, this seems less intuitive if you don’t rely or know of this config.

Based on this discussion my feeling is that the best solution would be to create a metadata cache around TableMetadataParser.read(FileIO io, InputFile file) where the cache key is the file.location().

I agree, this would solve caching for saving resources. However, this does not address the self-join concerns mentioned before, since they rely on looking at the same snapshot.

I think @Parth-Brahmbhatt mentioned that there’s a refresh store procedure, however I think that this goes in the wrong direction to support caching by default, i.e. users would need to know that tables are cached by default, which is problematic if the behavior is inconsistent with other compute engines or table formats. Instead, I think it’s preferable to cache explicitly (either with https://spark.apache.org/docs/3.0.1/sql-ref-syntax-aux-cache-cache-table.html or a stored procedure); this makes the default behavior intuitive and consistent with other compute engines and other non-Iceberg tables in Spark.

0reactions
nastracommented, Oct 6, 2021

I have to agree that the default caching behavior is unintuitive and rather surprising to users (as can be seen in https://github.com/projectnessie/nessie/issues/2165). Using the NessieCatalog with Spark, one can have a table X on branch dev and the CachingCatalog will eventually cache this table. Users can refer to this table also as X@dev and the NessieCatalog properly splits this name into the table name X and the branch name dev via TableReference. However, before this happens, we reach CachingCatalog#loadTable(..), where X@dev is treated as a separate table, meaning that an update to table X will leave an outdated X@dev table and users referring to that will always see stale data.

Unfortunately there’s not much control that we have over the cache invalidation procedure to handle the described use case, so I was rather thinking that Catalogs should have a way to control their default caching behavior. For the NessieCatalog I think it makes more sense to disable caching by default.

@aokolnychyi @rdblue could you take a look at my proposal in https://github.com/apache/iceberg/pull/3230?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [iceberg] aokolnychyi commented on issue #2319: Caching ...
... Tables in SparkCatalog via CachingCatalog by default leads to stale data ... that concern to Spark rather than to the caching catalog...
Read more >
Best practices for caching in Spark SQL - Towards Data Science
The default value of the storageLevel for both functions is MEMORY_AND_DISK which means that the data will be stored in memory if there...
Read more >
CACHE TABLE | Databricks on AWS
Caches contents of a table or output of a query with the given storage level in Apache Spark cache. If a query is...
Read more >
Performance Tuning - Spark 2.4.3 Documentation
Caching Data In Memory. Spark SQL can cache tables using an in-memory columnar format by calling spark.catalog.cacheTable("tableName") or dataFrame.cache() ...
Read more >
Dataset Caching and Persistence · The Internals of Spark SQL
Cache Dataset -- it is lazy and so nothing really happens val data ... splits=Some(8)) // Clear in-memory cache using SQL // Equivalent...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found