Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Audio parameters (hop_length or shift ms, rate, etc) for pre-trained models

See original GitHub issue

Hi! Thank you for the recent readme update and convenient in-python interface! I am asking docs update request. For feature extracting using any of upstream model it is crucial to know not only input data type (wav, mel, mfcc and others), but also which data type should be used for best performance with corresponding pre-trained models.

For example, as wav2vec2 model applies to wav, decreases sequence dimension by 320, in the paper authors say it corresponds to 20 ms of raw wave. Training was made on LibriSpeech dataset (sample rate is 16kHz). If I use this model to get representations for future pipelines or something, it is better to have the same sampling rate audio and use output vectors as 20ms frame representation. I wouldn’t use it with 22.05kHz (or others) with corresponding frame shift of 14.5ms.

Thus, it would be nice to understand: on which data model was pre-trained just to make acceptable plug-in of certain model into pipeline/etc.

Same notice about spectrogram features, which use hop_length, window_length and other parameters

Issue Analytics

State:
Created 3 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

2reactions

andi611commented, Jan 26, 2021

Hi,

Thus, it would be nice to understand: on which data model was pre-trained just to make acceptable plug-in of certain model into pipeline/etc.

In the Upstream README, the pre-trained data has already been listed. This link is also on the main README.

Same notice about spectrogram features, which use hop_length, window_length and other parameters

For baseline features, please see the config files here, I will add them to our documentations soon. The default values of hop and window are 10ms and 25ms, respectively.

FYI, currently, all the pre-trained models from all papers use a sample rate of 16kHz, to the best of my knowledge no other sample rate has been used yet.

I hope I’ve answered your questions.

Andy

1reaction

SolomidHerocommented, Feb 9, 2021

@trangham283, commonly, upsampling from 8kHz to 16kHz won’t lower quality (theoretically) of initial audio. Hence, I think it is OK to first up sample and then extract representations. So they would correspond to 10ms shift of initial audio

Top Results From Across the Web

Urban Environmental Audio Classification Using Mel ...

This article provides a basic introduction to audio classification using deep learning. We will build a Convolutional Neural Network (CNN) that takes Mel ......

WavLM: Large-Scale Self-Supervised Pre-Training for Full ...

Our experiments show that speech separation models trained on top of HuBERT. [6], a top-performing speech pre-trained model, achieve only marginal improvement ...

Rethinking CNN Models for Audio Classification - ResearchGate

In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification.

Automatic Behavior Assessment from Uncontrolled Everyday ...

Abstract: The manual categorization of behavior from sensory observation data to facilitate further analyses is a very expensive process.

Masked Autoencoders that Listen - arXiv Vanity

After pre-training, these models operate over audio spectrograms by deflating from 3-channels (RGB) into 1-channel (spectrogram) in the pre-trained patch ...