S3 access via spark shows "Path does not exist" after update
See original GitHub issueI’ve been using Spark to read data from S3 for a while and it always worked with the following (pyspark) configuration:
spark.read. \
.option('pathGlobFilter', f'*.json') \
.option('mode', 'FAILFAST') \
.format('json')
.read('s3://moto-bucket-79be3fc05b/raw-data/prefix/foo=bar/partition=baz/service=subscriptions')
The moto-bucket-79be3fc05b
bucket contents:
raw-data/prefix/foo=bar/partition=baz/service=subscriptions/subscriptions.json
Before I upgraded to latest moto, Spark was able to find all files inside the “directory”, but now it fails with:
pyspark.sql.utils.AnalysisException: Path does not exist: s3://moto-bucket-79be3fc05b/raw-data/prefix/foo=bar/partition=baz/service=subscriptions
I never updated Spark and it is static to version pyspark==3.1.1
as required by AWS Glue.
I updated moto to v4.0.9 from v2.2.9 (I was forced to upgrade because some other requirement needed flask>2.2).
Unfortunately, I don’t know exactly what is the AWS http request that spark sends to S3 to list a path, but on python code I do verify the path exists before sending to spark, so I know for sure the files are there in the correct location. I tried adding a /
at the end of the read string, but nothing changes.
If I pass the full file path (with subscriptions.json
at the end), spark does indeed read it. So that narrows down a bit because I’m certain spark is using the correct bucket and really accessing moto.
Issue Analytics
- State:
- Created 10 months ago
- Comments:9
FYI @jbvsmo - 4.0.11 has just been released.
Happy to hear that @jbvsmo! There is no fixed date yet, but I’m expecting we’ll cut a release in the first week of December.