LJspeech recipe error: exceeded the bounds of its corresponding recording
See original GitHub issueThe prepare_ljspeech
recipe on version lhotse==0.4.0
fails with the following error:
Traceback (most recent call last):
File "examples/prepare_ljspeech.py", line 9, in <module>
ljspeech = prepare_ljspeech("LJSpeech-1.1", "manifests")
File "/home/jan/aligntts/venv/lib/python3.8/site-packages/lhotse/recipes/ljspeech.py", line 96, in prepare_ljspeech
validate_recordings_and_supervisions(recording_set, supervision_set)
File "/home/jan/aligntts/venv/lib/python3.8/site-packages/lhotse/qa.py", line 52, in validate_recordings_and_supervisions
assert 0 <= s.start <= s.end <= r.duration, \
AssertionError: Supervision LJ001-0001: exceeded the bounds of its corresponding recording (supervision spans [0.0, 9.65501134]; recording spans [0, 9.65501133786848])
It runs fine on lhotse==0.3.0
. There seems to be related a issue #201 - there is also a rounding problem.
I am looking into it now, but not sure where is the problem.
Edit: It seems like the duration of a Recording does not do the same rounding that happens in Supervision.
Supervision:
@property
def end(self) -> Seconds:
return round(self.start + self.duration, ndigits=8)
Recording:
Recording(
id=recording_id if recording_id is not None else sph_path.stem,
sampling_rate=sphf.format['sample_rate'],
num_samples=sphf.format['sample_count'],
duration=sphf.format['sample_count'] / sphf.format['sample_rate'],
sources=[
AudioSource(
type='file',
channels=list(range(sphf.format['channel_count'])),
source=(
'/'.join(sph_path.parts[-relative_path_depth:])
if relative_path_depth is not None and relative_path_depth > 0
else str(sph_path)
)
)
]
)
Is it possible that there should also be rounding in the Recording? I cam make a PR for this 😃
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Command-line interface — lhotse 1.12.0.dev documentation
Fix a pair of Lhotse RECORDINGS and SUPERVISIONS manifests. It removes supervisions without corresponding recordings and vice versa, trims the supervisions that ...
Read more >The LJ Speech Dataset - Keith Ito
A public domain speech dataset consisting of 13100 short audio clips of a single speaker reading passages from 7 non-fiction books.
Read more >Bad results when training Transformer TTS from EGS2 ...
I'm experimenting with training LJSpeech Transformer TTS from the EGS2 recipe in the repo. I'm using a custom dataset based on audios from ......
Read more >Proceedings of the 13th Language Resources and Evaluation ...
the Limits of Transfer Learning with a Unified Text- to-Text Transformer. ... mon cases where more than one tag corresponds to a.
Read more >Proceedings of 1st Joint SLTU and CCURL Workshop (SLTU ...
and their corresponding phonemic featurizations are pro- ... audio for both languages was recorded at 48 kHz. The.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I thought about it but I haven’t analyzed yet how that would impact serialization/deserialization, overall speed and memory load (especially with a lot of objects around), and user experience. It could be a long-term solution though.
Yeah, as you can see this float rounding stuff is really delicate. Your suggestion sounds good to me.