question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Double file scan with stats skipping

See original GitHub issue

Bug

Describe the problem

File stats skipping is causing two filesForScan operations.

Steps to reproduce

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.range(5).select(F.struct(F.lit('test').alias('test'), F.col('id').cast("string").alias('id')).alias('nested'))
df.write.format('delta').save('test')

spark.read.format('delta').load('test').filter('nested.id = "2"').count()

Observed results

There are two scan stages, caused by hitting this line of code. Example output:

Prepared scan does not match actual filters. Reselecting files to query.
Prepared: ExpressionSet((nested#457.id = 2), isnotnull(nested#457))
Actual: ExpressionSet(isnotnull(nested#457), (nested#457.id = 2))

Expected results

The expression sets are the same so it shouldn’t trigger a second file scan.

Further details

Environment information

  • Delta Lake version: 1.2.0
  • Spark version: 3.2.1
  • Scala version: 2.12

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
Kimahrimancommented, Apr 18, 2022

Ok figured some things out. There seems to be two issues when using nested columns.

  1. prepared filters are pre-schema pruning, and actual filters are after. This can make the ordinal value and child schema of GetStructField different when comparing expressions
  2. The actual seems to include the optional name of GetStructField, while prepared doesn’t this is removed in canonicalization, so it’s just a schema pruning issue I think

Sample code to reproduce:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.range(5).select(F.struct(F.lit('test').alias('test'), F.col('id').cast("string").alias('id')).alias('nested'))
df.write.format('delta').save('test')

spark.read.format('delta').load('test').filter('nested.id = "2"').count()

And with some extra logging where I output the contents of GetStructField:

22/04/17 08:52:09 INFO PreparedDeltaFileIndex: 
Prepared scan does not match actual filters. Reselecting files to query.
Prepared: ExpressionSet((nested#457.id = 2), isnotnull(nested#457))
Actual: ExpressionSet(isnotnull(nested#457), (nested#457.id = 2))
         
22/04/17 08:52:09 INFO PreparedDeltaFileIndex: Prepared:
22/04/17 08:52:09 INFO PreparedDeltaFileIndex: nested#457, 1, Some(id)
22/04/17 08:52:09 INFO PreparedDeltaFileIndex: Actual:
22/04/17 08:52:09 INFO PreparedDeltaFileIndex: nested#457, 0, None
0reactions
Kimahrimancommented, Apr 20, 2022

Maybe just transform GetStructField to some UnresolvedAttribute-like expression before comparing the two sets?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Windows Defender bug causes files to be skipped during ...
Windows 10 users have reported receiving messages that scans conducted by Windows Defender have been skipping files, despite not having any exclusions ...
Read more >
Windows Security skipped 1 file, how can I see what it was?
I did a second quick scan and got the same message. I did an offline scan and didn't see any messages. I'm doing...
Read more >
Windows Defender Bug in Windows 10 Skips Files During ...
The Windows Defender Antivirus scan skipped an item due to an exclusion or network scanning settings. In conversations with Günter Born who ...
Read more >
20 Best FREE Duplicate File Finders Software for Windows
Best Free Duplicate File Finder Software for Windows: Duplicate file ... It offers a simple drag and drop interface to scan the files....
Read more >
Duplicate File Finder - Benefits of Pro Mode - Nektony
Duplicate File Finder is free to download from the App Store. ... This allows you to skip scanning the smallest files and find...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found