Caching Tables in SparkCatalog via CachingCatalog by default leads to stale data
See original GitHub issueI’ve been experiencing a few issues with refreshing table metadata in Iceberg. I think caching in Iceberg is a bit flawed in the sense that if we use spark3 via SparkCatalog
with cache-enable=true
leading to wrap the Iceberg Catalog with CachingCatalog
- which is the default - those tables will pretty much stay stale until:
- They are evicted, or
- There’s a commit, which will trigger a refresh.
With this, it’s a bit dangerous to have long lived TableOperations
objects, e.g. multiple long lived Spark sessions reading the same table that gets modified.
I don’t think the TableOperations
are cache friendly unless we expect to have stale data results in different sessions, e.g. with cache-enable=true
we have the following behavior:
- First session reads
table1
: - Second session reads
table1
. <-- up to this point, both sessions see the same data - First session commits to
table1
- Second session reads
table1
<-- this read is stale, due to caching - changes in 3) are not reflected
In order for this flow to work, as in Hive tables, and represent the up-to-date data in both sessions, we can’t use caching right now. While not checking for the up-to-date metadata location saves client calls, I think we should do checks in TableOperations to refresh the metadata when metadata location changes, with this we could cache the objects and have correctness on data freshness.
Caching is enabled by default in SparkCatalog
(Spark 3) - For now, I think the default should be false
, especially since currently it could lead to data inconsistency.
What do you think @rdblue @aokolnychyi ?
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (7 by maintainers)
I think caching tables indefinitely in sessions (until garbage collected or a change happens) returning stale results is unintuitive (non-obvious) and inconsistent to be set enabled by default. Seems like multiple people have brought up this behavior in the past - at least @aokolnychyi and @pvary - signaling how unintuitive returning stale results by default is. Also, it’s inconsistent with how Hive and Presto handle Iceberg tables; but also how Spark handles queries to non-Iceberg tables.
I’m not arguing for removing caching altogether - as @pvary mentioned, it can be a feature in some cases - but instead that the default should be the most intuitive and consistent behavior. If there are use cases that need to fix the state of the table at some point, then cache can be used explicitly rather than using it implicitly by default. In fact, Spark provides such constructs with https://spark.apache.org/docs/3.0.1/sql-ref-syntax-aux-cache-cache-table.html - shouldn’t that be more in line with Spark’s expected handling of cached tables?
If caching by default is needed in a full environment and it depends on having it enabled, this can be set in
spark-defaults.conf
and set thecache-enable=true
regardless of the default value - it’s anyways good practice to avoid depending on defaults. I think this is better rather than the other way around, expecting users to know they have to refresh their table to see if the table had changes. If the change of the default were to happen, we’d definitely need to include in docs/release notes. On the other hand, we could recommend users to set alwayscache-enabled=false
to avoid depending on defaults and have fresh state, but as mentioned before, this seems less intuitive if you don’t rely or know of this config.I agree, this would solve caching for saving resources. However, this does not address the self-join concerns mentioned before, since they rely on looking at the same snapshot.
I think @Parth-Brahmbhatt mentioned that there’s a
refresh
store procedure, however I think that this goes in the wrong direction to support caching by default, i.e. users would need to know that tables are cached by default, which is problematic if the behavior is inconsistent with other compute engines or table formats. Instead, I think it’s preferable to cache explicitly (either with https://spark.apache.org/docs/3.0.1/sql-ref-syntax-aux-cache-cache-table.html or a stored procedure); this makes the default behavior intuitive and consistent with other compute engines and other non-Iceberg tables in Spark.I have to agree that the default caching behavior is unintuitive and rather surprising to users (as can be seen in https://github.com/projectnessie/nessie/issues/2165). Using the
NessieCatalog
with Spark, one can have a tableX
on branchdev
and theCachingCatalog
will eventually cache this table. Users can refer to this table also asX@dev
and theNessieCatalog
properly splits this name into the table nameX
and the branch namedev
viaTableReference
. However, before this happens, we reachCachingCatalog#loadTable(..)
, whereX@dev
is treated as a separate table, meaning that an update to tableX
will leave an outdatedX@dev
table and users referring to that will always see stale data.Unfortunately there’s not much control that we have over the cache invalidation procedure to handle the described use case, so I was rather thinking that Catalogs should have a way to control their default caching behavior. For the
NessieCatalog
I think it makes more sense to disable caching by default.@aokolnychyi @rdblue could you take a look at my proposal in https://github.com/apache/iceberg/pull/3230?