question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FileNotFoundError while downloading wikipedia dataset for any language

See original GitHub issue

Describe the bug

Hi, I am currently trying to download wikipedia dataset using
load_dataset(“wikipedia”, language=“aa”, date=“20220401”, split=“train”,beam_runner=‘DirectRunner’). However, I end up in getting filenotfound error. I get this error for any language I try to download.

Environment:

Steps to reproduce the bug

from datasets import load_dataset
load_dataset("wikipedia", language="aa", date="20220401", split="train",beam_runner='DirectRunner')

Expected results

to load the dataset

Actual results

I am pasting the error trace here: Downloading builder script: 35.9kB [00:00, ?B/s] Downloading metadata: 30.4kB [00:00, 1.94MB/s] Using custom data configuration 20220401.aa-date=20220401,language=aa Downloading and preparing dataset wikipedia/20220401.aa to C:\Users\Shilpa.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559… Downloading data: 100%|████████████████████████████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 712kB/s] Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.82s/it] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] Downloading data: 100%|███████████████████████████████████████████████████████████| 35.6k/35.6k [00:00<00:00, 84.3kB/s] Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.93s/it] Traceback (most recent call last): File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py”, line 1193, in process self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 202, in open_writer return FileBasedSinkWriter(self, writer_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 419, in init self.temp_handle = self.sink.open(temp_shard_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py”, line 553, in open self._file_handle = super().open(temp_path) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 139, in open temp_path, self.mime_type, self.compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py”, line 224, in create return filesystem.create(path, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 163, in create return self._path_open(path, ‘wb’, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 140, in _path_open raw_file = io.open(path, mode) FileNotFoundError: [Errno 2] No such file or directory: ‘C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train’

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “G:/abc/temp.py”, line 32, in <module> beam_runner=‘DirectRunner’) File “G:\Python3.7\lib\site-packages\datasets\load.py”, line 1751, in load_dataset use_auth_token=use_auth_token, File “G:\Python3.7\lib\site-packages\datasets\builder.py”, line 705, in download_and_prepare dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs File “G:\Python3.7\lib\site-packages\datasets\builder.py”, line 1394, in _download_and_prepare pipeline_results = pipeline.run() File “G:\Python3.7\lib\site-packages\apache_beam\pipeline.py”, line 574, in run return self.runner.run_pipeline(self, self._options) File “G:\Python3.7\lib\site-packages\apache_beam\runners\direct\direct_runner.py”, line 131, in run_pipeline return runner.run_pipeline(pipeline, options) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 201, in run_pipeline options) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 212, in run_via_runner_api return self.run_stages(stage_context, stages) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 443, in run_stages runner_execution_context, bundle_context_manager, bundle_input) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 776, in _execute_bundle bundle_manager)) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 1000, in _run_bundle data_input, data_output, input_timers, expected_timer_output) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 1309, in process_bundle result_future = self._worker_handler.control_conn.push(process_bundle_req) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\worker_handlers.py”, line 380, in push response = self.worker.do_instruction(request) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py”, line 598, in do_instruction getattr(request, request_type), request.instruction_id) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py”, line 635, in process_bundle bundle_processor.process_bundle(instruction_id)) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py”, line 1004, in process_bundle element.data) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py”, line 227, in process_encoded self.output(decoded_value) File “apache_beam\runners\worker\operations.py”, line 526, in apache_beam.runners.worker.operations.Operation.output File “apache_beam\runners\worker\operations.py”, line 528, in apache_beam.runners.worker.operations.Operation.output File “apache_beam\runners\worker\operations.py”, line 237, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 324, in apache_beam.runners.worker.operations.GeneralPurposeConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 905, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py”, line 1193, in process self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 202, in open_writer return FileBasedSinkWriter(self, writer_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 419, in init self.temp_handle = self.sink.open(temp_shard_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py”, line 553, in open self._file_handle = super().open(temp_path) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 139, in open temp_path, self.mime_type, self.compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py”, line 224, in create return filesystem.create(path, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 163, in create return self._path_open(path, ‘wb’, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 140, in _path_open raw_file = io.open(path, mode) RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: ‘C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train’ [while running ‘train/Save to parquet/Write/WriteImpl/WriteBundles’]

Environment info

Python: 3.7.6 Windows 10 Pro datasets :2.4.0 apache_beam: 2.41.0 mwparserfromhell: 0.6.4

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
albertvillanovacommented, Aug 31, 2022

I see, sorry, I misread your issue.

We are investigating this.

0reactions
RandyAndy-bytecommented, Dec 4, 2022

I am able to start downloading the dataset when trying anything with the recent dumps for 20221201. But obviously, those are the big wiki dumps and I need the smaller preloaded version.

I am now getting some error when the files show up in my cache but it will say FileNotFoundError at the end of the download for some reason. The cache directory to the datasets\wikipedia\date.bn\ had something in it, then when the error came up it disappeared.

It is easy to test with the langauge “bn” because the amount of files is low.

dataset = load_dataset(‘wikipedia’, date=“20221201”, language=“bn”, split=‘train’, beam_runner=‘DirectRunner’)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wikipedia:Database download
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, ...
Read more >
Using a Word2Vec model pre-trained on wikipedia
You can check WebVectors to find Word2Vec models trained on various corpora. Models come with readme covering the training details.
Read more >
DataScience Judge - Developers Wiki | HackerEarth
A hidden testcase is a dataset that is not accessible to the candidates and the model is tested based on the code that...
Read more >
How to Setup Your Python Environment for Machine Learning ...
Install Deep Learning Libraries. 1. Download Anaconda. In this step, we will download the Anaconda Python package for your platform. Anaconda is ...
Read more >
13 ways to access data in Python - Towards Data Science
Local files; Databases; APIs; Dataset access libraries. The only major requirement is installing the pandas library: $ pip install pandas.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found