question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Implement preprocessing on datasets?

See original GitHub issue

Coming from TensorFlowTTS, i find Coqui to be more functional and well-maintained. (I still encounter nan losses after 50k+ iterations, but i can leave that for later.)

One main issue is that each iteration seems to take about double the time and memory consumption is higher compared to TensorFlowTTS. From dataset.py, i can see collate_fn computes the spectrograms while batching and does not cache them (unlike the phoneme_cache).

I will rewrite some parts to save the preprocessed phonemes and spectrograms so i can train different models on the same dataset, and visually compare the ground truth spectrograms against the TTS output.

Also, i think LongTensors are not needed as sequence lengths won’t exceed 2 billion.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
iamanigeeitcommented, Jan 18, 2022

@erogol @vince62s I think i’ve found the bottleneck. For some reason, creating a phonemizer in gruut is very slow.

import gruut
text = 'this is a very very very very long sentence that you havent handled before'
language = 'en-us'
def testme(text, language, phone_sep='', word_sep=' '):
    phonemizer_args = {
        "remove_stress": True,
        "ipa_minor_breaks": False,  # don't replace commas/semi-colons with IPA |
        "ipa_major_breaks": False,  # don't replace periods with IPA ‖
    }
    ph_list = gruut.text_to_phonemes(
            text,
            lang=language,
            return_format="word_phonemes",
            phonemizer_args=phonemizer_args,
        )
    phones_words = [phone_sep.join(word_phonemes) for word_phonemes in ph_list]
    phones = word_sep.join(phones_words)
    return phones
%timeit testme(text,language)
144 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def testme(text, language, phone_sep='', word_sep=' '):
    phonemizer_args = {}
    ph_list = gruut.text_to_phonemes(
            text,
            lang=language,
            return_format="word_phonemes",
            phonemizer_args=phonemizer_args,
        )
    phones_words = [phone_sep.join(word_phonemes) for word_phonemes in ph_list]
    phones = word_sep.join(phones_words)
    return phones
%timeit testme(text,language)
1.19 ms ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

With 8 text samples per CPU, this would slow down every batch by over 1s. If we simply create phonemizer before text_to_phonemes the bottleneck goes away.

phonemizer_args = {
    "remove_stress": True,
    "ipa_minor_breaks": False,  # don't replace commas/semi-colons with IPA |
    "ipa_major_breaks": False,  # don't replace periods with IPA ‖
}
phonemizer = gruut.get_phonemizer(language, **phonemizer_args)
def testme(text, language, phone_sep='', word_sep=' ', phonemizer=phonemizer):
    ph_list = gruut.text_to_phonemes(
            text,
            lang=language,
            return_format="word_phonemes",
            phonemizer=phonemizer,
        )
    phones_words = [phone_sep.join(word_phonemes) for word_phonemes in ph_list]
    phones = word_sep.join(phones_words)
    return phones
%timeit testme(text,language)
1.2 ms ± 898 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1reaction
iamanigeeitcommented, Jan 20, 2022

https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md

Phonemizer API is gonna change soon #1079

So if you send a PR make sure you check the new API first.

@erogol Thanks for the update! I am rushing a paper for Interspeech 2022 so i might only review the latest version end March… meanwhile, i have found that gruut.Phonemizer can’t be pickled (i.e. i cannot pass it as an argument to _phoneme_worker, so every worker needs to create its own phonemizer).

My current hack is to create a global phonemizers where num_phonemizers = num_workers, then pass the worker_idx to _phoneme_worker.

phonemizers = []
def set_phonemizers(phoneme_language, phonemizer_args, use_espeak_phonemes, num_workers):
    if use_espeak_phonemes:
        # Use a lexicon/g2p model train on eSpeak IPA instead of gruut IPA.
        # This is intended for backwards compatibility with TTS<=v0.0.13
        # pre-trained models.
        phonemizer_args["model_prefix"] = "espeak"
    global phonemizers
    phonemizers = []
    if phonemizer_args:
        for i in range(num_workers):
            phonemizers.append(gruut.get_phonemizer(phoneme_language, **phonemizer_args))
    else:
        for i in range(num_workers):
            phonemizers.append(gruut.get_phonemizer(phoneme_language))
    return phoneme_language

...

tqdm(
        p.imap(_phoneme_worker,
               [(item, cache_path, cleaner_name, phoneme_language, i % num_workers,
                 custom_symbols, character_config, add_blank) for i, item in enumerate(items)]),
        total=len(items)
Read more comments on GitHub >

github_iconTop Results From Across the Web

[Question] Implement preprocessing on datasets? #1236
One main issue is that each iteration seems to take about double the time and memory consumption is higher compared to TensorFlowTTS. From ......
Read more >
Data Science Interview Preparation — Data Pre-processing
Data Processing is an important step in building Machine Learning models. In this post, we discuss interview questions on Pre-processing ...
Read more >
Data Preprocessing in Python for Machine Learning
In this post I am going to walk through the implementation of Data Preprocessing methods using Python. I will cover the following, ...
Read more >
20 Data Preprocessing Interview Questions and Answers
Here are 20 commonly asked Data Preprocessing interview questions and answers ... to use multiple imputation techniques on a single dataset.
Read more >
Easy Guide To Data Preprocessing In Python - KDnuggets
1. Train Test Split · 2. Taking Care of Missing Values · 3. Taking care of Categorical Features · 4. Normalizing the Dataset....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found