question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gcsfs==0.6.1|0.6.2 'walk()' method breaking dask dataframe

See original GitHub issue

What happened: While walking the root of a parquet folder initially created by pyspark, the fs.walk method returns an empty string '' in the files list.

('path/to/parquet/folder',
 ['Year=2019', 'Year=2020'],
 ['', '_SUCCESS'])

This behavior is breaking dask.dataframe.read_parquet('gs://...') on multiple occasions (let me know if you want these errors), that’s when I tracked the error down to fs.walk.

What you expected to happen:

The correct output should be

('path/to/parquet/folder',
 ['Year=2019', 'Year=2020'],
 [ '_SUCCESS'])

Minimal Complete Verifiable Example:

import gcsfs
next(gcsfs.GCSFileSystem().walk('gs://path/to/parquet/folder/'))

Anything else we need to know?:

Reverting to gcsfs==0.6.0, seemed to solve this problem. As far as I tested, the problem happens with 0.6.1 and 0.6.2 versions.

Environment:

  • Dask version: 2.21.0
  • GCSFS version: 0.6.2
  • Python version: 3.7.6
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jul 28, 2020

Can you check on gcsfs master, please?

0reactions
martindurantcommented, Feb 1, 2021

@rjurney , please open a new issue with the specific case that you are seeing

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found