Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

VCTK missing txt file for 'p315'

See original GitHub issue

🐛 Bug

The ‘p315’ text is missing from the VCTK Corpus. This leads to “FileNotFoundError” error when accessing ‘./VCTK-Corpus/txt/p315’

Please note while text files containing transcripts of the speech are provided for 109 of the 110 recordings, in the ‘/txt’ folder, the ‘p315’ text was lost due to a hard disk error.

https://datashare.is.ed.ac.uk/handle/10283/3443

To Reproduce

Steps to reproduce the behavior:

!pip install torch>=1.2.0
!pip install torchaudio
!pip install librosa
%matplotlib inline

import torch
import torchaudio
import librosa
import matplotlib.pyplot as plt
import numpy as np

import torchaudio.datasets as dsets
vctk_data = dsets.VCTK(".", download=True)
vctk_data[1]

Expected behavior

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-25-a25a0369fbf9> in <module>()
----> 1 vctk_data[1]

1 frames
/usr/local/lib/python3.6/dist-packages/torchaudio/datasets/vctk.py in __getitem__(self, n)
    106             self._ext_txt,
    107             self._folder_audio,
--> 108             self._folder_txt,
    109         )
    110 

/usr/local/lib/python3.6/dist-packages/torchaudio/datasets/vctk.py in load_vctk_item(fileid, path, ext_audio, ext_txt, folder_audio, folder_txt, downsample)
     17     # Read text
     18     file_txt = os.path.join(path, folder_txt, speaker_id, fileid + ext_txt)
---> 19     with open(file_txt) as file_text:
     20         utterance = file_text.readlines()[0]
     21 

FileNotFoundError: [Errno 2] No such file or directory: './VCTK-Corpus/txt/p315/p315_041.txt'

Environment

What commands did you used to install torchaudio (conda/pip/build from source)? pip
If you are building from source, which commit is it? Not applicable
What does torchaudio.__version__ print? (If applicable) 1.4.0

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.2
[pip3] torch==1.4.0
[pip3] torchaudio==0.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect

Additional context

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

cpuhrschcommented, Apr 1, 2020

Further, to avoid incorrectly downloaded data we can look into making use of md5 checksums.

1reaction

vincentqbcommented, Apr 1, 2020

Good point about the distinction. The original issue reported here is an issue where somehow the data has not been properly downloaded/extracted but is in the original dataset. And so I do agree that this should raise an error.

In [2]: from torchaudio.datasets import VCTK

In [3]: vctk_data = VCTK(".", download=True)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.4G/10.4G [08:12<00:00, 22.7MB/s]

In [4]: vctk_data[1]
Out[4]: 
(tensor([[-0.0045, -0.0067, -0.0061,  ...,  0.0072,  0.0067,  0.0078]]),
 48000,
 'Ask her to bring these things with her from the store.\n',
 'p225',
 '002')

That being said, the dataset itself is also missing data intrinsically.


In [5]: def raise_error(i): 
    ...:     try:
    ...:         vctk_data[i]
    ...:         return False
    ...:     except FileNotFoundError:
    ...:         return True

In [6]: l = [i for i in range(len(vctk_data)) if raise_error(i)]

In [13]: len(l)
Out[13]: 172

In [14]: min(l)
Out[14]: 33867

In [15]: max(l)
Out[15]: 34038

In [17]: max(l)-min(l)+1
Out[17]: 172

In [21]: vctk_data._walker[min(l):max(l)+1]                                                                                                                                                         
['p315_001',        
...                                                                                                                                                                                
 'p315_393',
 'p315_397',
 'p315_403',
 'p315_405',
 'p315_406',
 'p315_408',
 'p315_414',
 'p315_418',
 'p315_421']

As mentioned in the dataset documention, speaker p315 is missing text data. In that sense, this missing file is part of the dataset.

torchvision currently also skips intrinsically missing files (silently?), as you suggest @cpuhrsch.

Moreover, since the _walker of each of the datasets in torchaudio are walking either through some files or some csv, they may already be silently missing some other missing data that is not part of what’s being walked on.

Let’s just go for skipping intrinsically missing data. 😃 The solution proposed in #484 (option 3 above) with proper documentation would align with this, but hopefully the list to filter out is never long (e.g. for librispeech?). If users want more control, we can always add that back as a user-facing parameter later.