FileNotFoundError while downloading wikipedia dataset for any language
See original GitHub issueDescribe the bug
Hi, I am currently trying to download wikipedia dataset using
load_dataset(“wikipedia”, language=“aa”, date=“20220401”, split=“train”,beam_runner=‘DirectRunner’). However, I end up in getting filenotfound error. I get this error for any language I try to download.
Environment:
Steps to reproduce the bug
from datasets import load_dataset
load_dataset("wikipedia", language="aa", date="20220401", split="train",beam_runner='DirectRunner')
Expected results
to load the dataset
Actual results
I am pasting the error trace here: Downloading builder script: 35.9kB [00:00, ?B/s] Downloading metadata: 30.4kB [00:00, 1.94MB/s] Using custom data configuration 20220401.aa-date=20220401,language=aa Downloading and preparing dataset wikipedia/20220401.aa to C:\Users\Shilpa.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559… Downloading data: 100%|████████████████████████████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 712kB/s] Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.82s/it] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] Downloading data: 100%|███████████████████████████████████████████████████████████| 35.6k/35.6k [00:00<00:00, 84.3kB/s] Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.93s/it] Traceback (most recent call last): File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py”, line 1193, in process self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 202, in open_writer return FileBasedSinkWriter(self, writer_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 419, in init self.temp_handle = self.sink.open(temp_shard_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py”, line 553, in open self._file_handle = super().open(temp_path) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 139, in open temp_path, self.mime_type, self.compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py”, line 224, in create return filesystem.create(path, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 163, in create return self._path_open(path, ‘wb’, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 140, in _path_open raw_file = io.open(path, mode) FileNotFoundError: [Errno 2] No such file or directory: ‘C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train’
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “G:/abc/temp.py”, line 32, in <module> beam_runner=‘DirectRunner’) File “G:\Python3.7\lib\site-packages\datasets\load.py”, line 1751, in load_dataset use_auth_token=use_auth_token, File “G:\Python3.7\lib\site-packages\datasets\builder.py”, line 705, in download_and_prepare dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs File “G:\Python3.7\lib\site-packages\datasets\builder.py”, line 1394, in _download_and_prepare pipeline_results = pipeline.run() File “G:\Python3.7\lib\site-packages\apache_beam\pipeline.py”, line 574, in run return self.runner.run_pipeline(self, self._options) File “G:\Python3.7\lib\site-packages\apache_beam\runners\direct\direct_runner.py”, line 131, in run_pipeline return runner.run_pipeline(pipeline, options) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 201, in run_pipeline options) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 212, in run_via_runner_api return self.run_stages(stage_context, stages) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 443, in run_stages runner_execution_context, bundle_context_manager, bundle_input) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 776, in _execute_bundle bundle_manager)) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 1000, in _run_bundle data_input, data_output, input_timers, expected_timer_output) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py”, line 1309, in process_bundle result_future = self._worker_handler.control_conn.push(process_bundle_req) File “G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\worker_handlers.py”, line 380, in push response = self.worker.do_instruction(request) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py”, line 598, in do_instruction getattr(request, request_type), request.instruction_id) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py”, line 635, in process_bundle bundle_processor.process_bundle(instruction_id)) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py”, line 1004, in process_bundle element.data) File “G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py”, line 227, in process_encoded self.output(decoded_value) File “apache_beam\runners\worker\operations.py”, line 526, in apache_beam.runners.worker.operations.Operation.output File “apache_beam\runners\worker\operations.py”, line 528, in apache_beam.runners.worker.operations.Operation.output File “apache_beam\runners\worker\operations.py”, line 237, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 324, in apache_beam.runners.worker.operations.GeneralPurposeConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 905, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “apache_beam\runners\common.py”, line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag File “apache_beam\runners\worker\operations.py”, line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive File “apache_beam\runners\worker\operations.py”, line 907, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\worker\operations.py”, line 908, in apache_beam.runners.worker.operations.DoOperation.process File “apache_beam\runners\common.py”, line 1419, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented File “apache_beam\runners\common.py”, line 1417, in apache_beam.runners.common.DoFnRunner.process File “apache_beam\runners\common.py”, line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process File “apache_beam\runners\common.py”, line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File “apache_beam\runners\common.py”, line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs File “G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py”, line 1193, in process self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 202, in open_writer return FileBasedSinkWriter(self, writer_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 419, in init self.temp_handle = self.sink.open(temp_shard_path) File “G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py”, line 553, in open self._file_handle = super().open(temp_path) File “G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py”, line 193, in _f return fnc(self, *args, **kwargs) File “G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py”, line 139, in open temp_path, self.mime_type, self.compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py”, line 224, in create return filesystem.create(path, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 163, in create return self._path_open(path, ‘wb’, mime_type, compression_type) File “G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py”, line 140, in _path_open raw_file = io.open(path, mode) RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: ‘C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train’ [while running ‘train/Save to parquet/Write/WriteImpl/WriteBundles’]
Environment info
Python: 3.7.6 Windows 10 Pro datasets :2.4.0 apache_beam: 2.41.0 mwparserfromhell: 0.6.4
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
I see, sorry, I misread your issue.
We are investigating this.
I am able to start downloading the dataset when trying anything with the recent dumps for 20221201. But obviously, those are the big wiki dumps and I need the smaller preloaded version.
I am now getting some error when the files show up in my cache but it will say FileNotFoundError at the end of the download for some reason. The cache directory to the datasets\wikipedia\date.bn\ had something in it, then when the error came up it disappeared.
It is easy to test with the langauge “bn” because the amount of files is low.
dataset = load_dataset(‘wikipedia’, date=“20221201”, language=“bn”, split=‘train’, beam_runner=‘DirectRunner’)