question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

load_dataset('cnn_dalymail', '3.0.0') gives a 'Not a directory' error

See original GitHub issue
from datasets import load_dataset
dataset = load_dataset('cnn_dailymail', '3.0.0')

Stack trace:

---------------------------------------------------------------------------

NotADirectoryError                        Traceback (most recent call last)

<ipython-input-6-2e06a8332652> in <module>()
      1 from datasets import load_dataset
----> 2 dataset = load_dataset('cnn_dailymail', '3.0.0')

5 frames

/usr/local/lib/python3.6/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, save_infos, script_version, **config_kwargs)
    608         download_config=download_config,
    609         download_mode=download_mode,
--> 610         ignore_verifications=ignore_verifications,
    611     )
    612 

/usr/local/lib/python3.6/dist-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, **download_and_prepare_kwargs)
    513                     if not downloaded_from_gcs:
    514                         self._download_and_prepare(
--> 515                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    516                         )
    517                     # Sync info

/usr/local/lib/python3.6/dist-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    568         split_dict = SplitDict(dataset_name=self.name)
    569         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 570         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    571 
    572         # Checksums verification

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _split_generators(self, dl_manager)
    252     def _split_generators(self, dl_manager):
    253         dl_paths = dl_manager.download_and_extract(_DL_URLS)
--> 254         train_files = _subset_filenames(dl_paths, datasets.Split.TRAIN)
    255         # Generate shared vocabulary
    256 

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _subset_filenames(dl_paths, split)
    153     else:
    154         logging.fatal("Unsupported split: %s", split)
--> 155     cnn = _find_files(dl_paths, "cnn", urls)
    156     dm = _find_files(dl_paths, "dm", urls)
    157     return cnn + dm

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _find_files(dl_paths, publisher, url_dict)
    132     else:
    133         logging.fatal("Unsupported publisher: %s", publisher)
--> 134     files = sorted(os.listdir(top_dir))
    135 
    136     ret_files = []

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

I have ran the code on Google Colab

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (2 by maintainers)

github_iconTop GitHub Comments

10reactions
codeislife99commented, Feb 16, 2022

Has anyone solved this ? I still get this error

4reactions
davidshinncommented, Feb 20, 2022

atal(“Unsupported publisher: %s”, publisher) --> 134 files = sorted(os.listdir(top_dir)) 135 136 ret_files = []

NotADirectoryError: [Errno 20] Not a directory: ‘/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories’

Can someone please take a look ?

2 short-term workarounds:

  1. Use this line instead dataset = load_dataset('ccdv/cnn_dailymail', '3.0.0'). In a related issue, this person mentioned another data source copy that just works.
  2. Use the same data source, but edit the urls. Instead of google drive quota problems mentioned in #996, I was getting the “can’t scan this file for viruses” problem, which results in that prompted html getting downloaded instead of the files. You can get around this by:
    1. Look at the traceback and find out where cnn_dailymail.py is on your computer.
    2. Edit the cnn_stories and dm_stories url’s by adding the following to the end of them &confirm=t. This should be around line 67.
    3. You may have to remove those confirmation html files in your download directory (~/.cache/huggingface/datasets/downloads for me) so that they don’t get in the way of the new download attempts.

Either method works for me. I would’ve made a PR, but not sure if they want to go with the new ccdv/cnn_dailymail source or not.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cnn_dailymail dataset loading problem with Colab - Beginners
Most of the time when I try to load this dataset using Colab, it throws a “Not a directory” error: NotADirectoryError: [Errno 20]...
Read more >
Getting error message after trying to load dataset from seaborn
The code is done in idle. Code: import seaborn as sns planets = sns.load_dataset('planets'). Error:.
Read more >
What is a "failed to create a symbolic link: file exists" error?
if ~/Documents/saga exists and is not a directory, you will have the error ... Hope this helps anyone who still faces 'file exists'...
Read more >
ISPF messages starting with ISR - IBM
Invalid command - The command entered is not valid for BROWSE. ISRB011 Severe error - Unexpected return code from ISRCBR. ISRB012 Bad directory...
Read more >
Load dataset from OneDrive web folder - Statalist
And r(679) is not even in this list of common error messages. ... (username/password), which Stata is not in a position to give....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found