Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

load_dataset('cnn_dalymail', '3.0.0') gives a 'Not a directory' error

See original GitHub issue

from datasets import load_dataset
dataset = load_dataset('cnn_dailymail', '3.0.0')

Stack trace:

---------------------------------------------------------------------------

NotADirectoryError                        Traceback (most recent call last)

<ipython-input-6-2e06a8332652> in <module>()
      1 from datasets import load_dataset
----> 2 dataset = load_dataset('cnn_dailymail', '3.0.0')

5 frames

/usr/local/lib/python3.6/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, save_infos, script_version, **config_kwargs)
    608         download_config=download_config,
    609         download_mode=download_mode,
--> 610         ignore_verifications=ignore_verifications,
    611     )
    612 

/usr/local/lib/python3.6/dist-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, **download_and_prepare_kwargs)
    513                     if not downloaded_from_gcs:
    514                         self._download_and_prepare(
--> 515                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    516                         )
    517                     # Sync info

/usr/local/lib/python3.6/dist-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    568         split_dict = SplitDict(dataset_name=self.name)
    569         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 570         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    571 
    572         # Checksums verification

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _split_generators(self, dl_manager)
    252     def _split_generators(self, dl_manager):
    253         dl_paths = dl_manager.download_and_extract(_DL_URLS)
--> 254         train_files = _subset_filenames(dl_paths, datasets.Split.TRAIN)
    255         # Generate shared vocabulary
    256 

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _subset_filenames(dl_paths, split)
    153     else:
    154         logging.fatal("Unsupported split: %s", split)
--> 155     cnn = _find_files(dl_paths, "cnn", urls)
    156     dm = _find_files(dl_paths, "dm", urls)
    157     return cnn + dm

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _find_files(dl_paths, publisher, url_dict)
    132     else:
    133         logging.fatal("Unsupported publisher: %s", publisher)
--> 134     files = sorted(os.listdir(top_dir))
    135 
    136     ret_files = []

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

I have ran the code on Google Colab

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:12 (2 by maintainers)

Top GitHub Comments

10reactions

codeislife99commented, Feb 16, 2022

Has anyone solved this ? I still get this error

4reactions

davidshinncommented, Feb 20, 2022

atal(“Unsupported publisher: %s”, publisher) --> 134 files = sorted(os.listdir(top_dir)) 135 136 ret_files = []

NotADirectoryError: [Errno 20] Not a directory: ‘/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories’

Can someone please take a look ?

2 short-term workarounds:

Use this line instead dataset = load_dataset('ccdv/cnn_dailymail', '3.0.0'). In a related issue, this person mentioned another data source copy that just works.
Use the same data source, but edit the urls. Instead of google drive quota problems mentioned in #996, I was getting the “can’t scan this file for viruses” problem, which results in that prompted html getting downloaded instead of the files. You can get around this by:
1. Look at the traceback and find out where cnn_dailymail.py is on your computer.
2. Edit the cnn_stories and dm_stories url’s by adding the following to the end of them &confirm=t. This should be around line 67.
3. You may have to remove those confirmation html files in your download directory (~/.cache/huggingface/datasets/downloads for me) so that they don’t get in the way of the new download attempts.