[BUG] Double file scan with stats skipping
See original GitHub issueBug
Describe the problem
File stats skipping is causing two filesForScan
operations.
Steps to reproduce
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.range(5).select(F.struct(F.lit('test').alias('test'), F.col('id').cast("string").alias('id')).alias('nested'))
df.write.format('delta').save('test')
spark.read.format('delta').load('test').filter('nested.id = "2"').count()
Observed results
There are two scan stages, caused by hitting this line of code. Example output:
Prepared scan does not match actual filters. Reselecting files to query.
Prepared: ExpressionSet((nested#457.id = 2), isnotnull(nested#457))
Actual: ExpressionSet(isnotnull(nested#457), (nested#457.id = 2))
Expected results
The expression sets are the same so it shouldn’t trigger a second file scan.
Further details
Environment information
- Delta Lake version: 1.2.0
- Spark version: 3.2.1
- Scala version: 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
- Yes. I can contribute a fix for this bug independently.
- Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- No. I cannot contribute a bug fix at this time.
Issue Analytics
- State:
- Created a year ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
Windows Defender bug causes files to be skipped during ...
Windows 10 users have reported receiving messages that scans conducted by Windows Defender have been skipping files, despite not having any exclusions ...
Read more >Windows Security skipped 1 file, how can I see what it was?
I did a second quick scan and got the same message. I did an offline scan and didn't see any messages. I'm doing...
Read more >Windows Defender Bug in Windows 10 Skips Files During ...
The Windows Defender Antivirus scan skipped an item due to an exclusion or network scanning settings. In conversations with Günter Born who ...
Read more >20 Best FREE Duplicate File Finders Software for Windows
Best Free Duplicate File Finder Software for Windows: Duplicate file ... It offers a simple drag and drop interface to scan the files....
Read more >Duplicate File Finder - Benefits of Pro Mode - Nektony
Duplicate File Finder is free to download from the App Store. ... This allows you to skip scanning the smallest files and find...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ok figured some things out. There seems to be two issues when using nested columns.
prepared
filters are pre-schema pruning, andactual
filters are after. This can make theordinal
value and child schema ofGetStructField
different when comparing expressionsThethis is removed in canonicalization, so it’s just a schema pruning issue I thinkactual
seems to include the optionalname
ofGetStructField
, whileprepared
doesn’tSample code to reproduce:
And with some extra logging where I output the contents of
GetStructField
:Maybe just transform
GetStructField
to someUnresolvedAttribute
-like expression before comparing the two sets?