question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

VCTK missing txt file for 'p315'

See original GitHub issue

πŸ› Bug

The β€˜p315’ text is missing from the VCTK Corpus. This leads to β€œFileNotFoundError” error when accessing β€˜./VCTK-Corpus/txt/p315’

Please note while text files containing transcripts of the speech are provided for 109 of the 110 recordings, in the β€˜/txt’ folder, the β€˜p315’ text was lost due to a hard disk error.

To Reproduce

Steps to reproduce the behavior:

!pip install torch>=1.2.0
!pip install torchaudio
!pip install librosa
%matplotlib inline

import torch
import torchaudio
import librosa
import matplotlib.pyplot as plt
import numpy as np

import torchaudio.datasets as dsets
vctk_data = dsets.VCTK(".", download=True)
vctk_data[1]

Expected behavior

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-25-a25a0369fbf9> in <module>()
----> 1 vctk_data[1]

1 frames
/usr/local/lib/python3.6/dist-packages/torchaudio/datasets/vctk.py in __getitem__(self, n)
    106             self._ext_txt,
    107             self._folder_audio,
--> 108             self._folder_txt,
    109         )
    110 

/usr/local/lib/python3.6/dist-packages/torchaudio/datasets/vctk.py in load_vctk_item(fileid, path, ext_audio, ext_txt, folder_audio, folder_txt, downsample)
     17     # Read text
     18     file_txt = os.path.join(path, folder_txt, speaker_id, fileid + ext_txt)
---> 19     with open(file_txt) as file_text:
     20         utterance = file_text.readlines()[0]
     21 

FileNotFoundError: [Errno 2] No such file or directory: './VCTK-Corpus/txt/p315/p315_041.txt'

Environment

  • What commands did you used to install torchaudio (conda/pip/build from source)? pip
  • If you are building from source, which commit is it? Not applicable
  • What does torchaudio.__version__ print? (If applicable) 1.4.0
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.2
[pip3] torch==1.4.0
[pip3] torchaudio==0.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect

Additional context

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
cpuhrschcommented, Apr 1, 2020

Further, to avoid incorrectly downloaded data we can look into making use of md5 checksums.

1reaction
vincentqbcommented, Apr 1, 2020

Good point about the distinction. The original issue reported here is an issue where somehow the data has not been properly downloaded/extracted but is in the original dataset. And so I do agree that this should raise an error.

In [2]: from torchaudio.datasets import VCTK

In [3]: vctk_data = VCTK(".", download=True)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10.4G/10.4G [08:12<00:00, 22.7MB/s]

In [4]: vctk_data[1]
Out[4]: 
(tensor([[-0.0045, -0.0067, -0.0061,  ...,  0.0072,  0.0067,  0.0078]]),
 48000,
 'Ask her to bring these things with her from the store.\n',
 'p225',
 '002')

That being said, the dataset itself is also missing data intrinsically.


In [5]: def raise_error(i): 
    ...:     try:
    ...:         vctk_data[i]
    ...:         return False
    ...:     except FileNotFoundError:
    ...:         return True

In [6]: l = [i for i in range(len(vctk_data)) if raise_error(i)]

In [13]: len(l)
Out[13]: 172

In [14]: min(l)
Out[14]: 33867

In [15]: max(l)
Out[15]: 34038

In [17]: max(l)-min(l)+1
Out[17]: 172

In [21]: vctk_data._walker[min(l):max(l)+1]                                                                                                                                                         
['p315_001',        
...                                                                                                                                                                                
 'p315_393',
 'p315_397',
 'p315_403',
 'p315_405',
 'p315_406',
 'p315_408',
 'p315_414',
 'p315_418',
 'p315_421']

As mentioned in the dataset documention, speaker p315 is missing text data. In that sense, this missing file is part of the dataset.

torchvision currently also skips intrinsically missing files (silently?), as you suggest @cpuhrsch.

Moreover, since the _walker of each of the datasets in torchaudio are walking either through some files or some csv, they may already be silently missing some other missing data that is not part of what’s being walked on.

Let’s just go for skipping intrinsically missing data. πŸ˜ƒ The solution proposed in #484 (option 3 above) with proper documentation would align with this, but hopefully the list to filter out is never long (e.g. for librispeech?). If users want more control, we can always add that back as a user-facing parameter later.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for torchaudio.datasets.vctk - PyTorch
Source code for torchaudio.datasets.vctk ... Note: * All the speeches from speaker ``p315`` will be skipped due to the lack of the corresponding...
Read more >
English Multi-speaker Corpus for CSTR Voice Cloning Toolkit ...
This CSTR VCTK Corpus includes speech data uttered by 110 English ... the '/txt' folder, the 'p315' text was lost due to a...
Read more >
Source code for torchaudio.datasets.vctk
[docs]class VCTK(Dataset): """Create a Dataset for VCTK. ... Directory ``p315`` is ignored because there is no corresponding text files.
Read more >
README.md Β· master Β· projs / asr / en / vctk - GitLab
2 subjects (p280, p315) are missing mic2 data, and mic1 sounds a ... There were 44455 mic1 audio files and 44583 text files...
Read more >
VCTK: sr16k Dataset - Kaggle
Info. Speakers: 109; Num Audio: ~44 k; Sample Rate: 48 kHz; Storage: 16 bit. Description. This CSTR VCTK Corpus (Centre for Speech Technology...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found