VCTK missing txt file for 'p315'
See original GitHub issueπ Bug
The βp315β text is missing from the VCTK Corpus. This leads to βFileNotFoundErrorβ error when accessing β./VCTK-Corpus/txt/p315β
Please note while text files containing transcripts of the speech are provided for 109 of the 110 recordings, in the β/txtβ folder, the βp315β text was lost due to a hard disk error.
To Reproduce
Steps to reproduce the behavior:
!pip install torch>=1.2.0
!pip install torchaudio
!pip install librosa
%matplotlib inline
import torch
import torchaudio
import librosa
import matplotlib.pyplot as plt
import numpy as np
import torchaudio.datasets as dsets
vctk_data = dsets.VCTK(".", download=True)
vctk_data[1]
Expected behavior
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-25-a25a0369fbf9> in <module>()
----> 1 vctk_data[1]
1 frames
/usr/local/lib/python3.6/dist-packages/torchaudio/datasets/vctk.py in __getitem__(self, n)
106 self._ext_txt,
107 self._folder_audio,
--> 108 self._folder_txt,
109 )
110
/usr/local/lib/python3.6/dist-packages/torchaudio/datasets/vctk.py in load_vctk_item(fileid, path, ext_audio, ext_txt, folder_audio, folder_txt, downsample)
17 # Read text
18 file_txt = os.path.join(path, folder_txt, speaker_id, fileid + ext_txt)
---> 19 with open(file_txt) as file_text:
20 utterance = file_text.readlines()[0]
21
FileNotFoundError: [Errno 2] No such file or directory: './VCTK-Corpus/txt/p315/p315_041.txt'
Environment
- What commands did you used to install torchaudio (conda/pip/build from source)? pip
- If you are building from source, which commit is it? Not applicable
- What does
torchaudio.__version__
print? (If applicable) 1.4.0
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0
Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.18.2
[pip3] torch==1.4.0
[pip3] torchaudio==0.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect
Additional context
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Source code for torchaudio.datasets.vctk - PyTorch
Source code for torchaudio.datasets.vctk ... Note: * All the speeches from speaker ``p315`` will be skipped due to the lack of the corresponding...
Read more >English Multi-speaker Corpus for CSTR Voice Cloning Toolkit ...
This CSTR VCTK Corpus includes speech data uttered by 110 English ... the '/txt' folder, the 'p315' text was lost due to a...
Read more >Source code for torchaudio.datasets.vctk
[docs]class VCTK(Dataset): """Create a Dataset for VCTK. ... Directory ``p315`` is ignored because there is no corresponding text files.
Read more >README.md Β· master Β· projs / asr / en / vctk - GitLab
2 subjects (p280, p315) are missing mic2 data, and mic1 sounds a ... There were 44455 mic1 audio files and 44583 text files...
Read more >VCTK: sr16k Dataset - Kaggle
Info. Speakers: 109; Num Audio: ~44 k; Sample Rate: 48 kHz; Storage: 16 bit. Description. This CSTR VCTK Corpus (Centre for Speech Technology...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Further, to avoid incorrectly downloaded data we can look into making use of md5 checksums.
Good point about the distinction. The original issue reported here is an issue where somehow the data has not been properly downloaded/extracted but is in the original dataset. And so I do agree that this should raise an error.
That being said, the dataset itself is also missing data intrinsically.
As mentioned in the dataset documention, speaker
p315
is missing text data. In that sense, this missing file is part of the dataset.torchvision currently also skips intrinsically missing files (silently?), as you suggest @cpuhrsch.
Moreover, since the
_walker
of each of the datasets in torchaudio are walking either through some files or some csv, they may already be silently missing some other missing data that is not part of whatβs being walked on.Letβs just go for skipping intrinsically missing data. π The solution proposed in #484 (option 3 above) with proper documentation would align with this, but hopefully the list to filter out is never long (e.g. for librispeech?). If users want more control, we can always add that back as a user-facing parameter later.