Audio parameters (hop_length or shift ms, rate, etc) for pre-trained models
See original GitHub issueHi! Thank you for the recent readme update and convenient in-python interface! I am asking docs update request. For feature extracting using any of upstream model it is crucial to know not only input data type (wav, mel, mfcc and others), but also which data type should be used for best performance with corresponding pre-trained models.
For example, as wav2vec2 model applies to wav, decreases sequence dimension by 320, in the paper authors say it corresponds to 20 ms of raw wave. Training was made on LibriSpeech dataset (sample rate is 16kHz). If I use this model to get representations for future pipelines or something, it is better to have the same sampling rate audio and use output vectors as 20ms frame representation. I wouldn’t use it with 22.05kHz (or others) with corresponding frame shift of 14.5ms.
Thus, it would be nice to understand: on which data model was pre-trained just to make acceptable plug-in of certain model into pipeline/etc.
Same notice about spectrogram features, which use hop_length
, window_length
and other parameters
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (5 by maintainers)
Hi,
In the Upstream README, the pre-trained data has already been listed. This link is also on the main README.
For baseline features, please see the config files here, I will add them to our documentations soon. The default values of hop and window are 10ms and 25ms, respectively.
FYI, currently, all the pre-trained models from all papers use a sample rate of 16kHz, to the best of my knowledge no other sample rate has been used yet.
I hope I’ve answered your questions.
Andy
@trangham283, commonly, upsampling from 8kHz to 16kHz won’t lower quality (theoretically) of initial audio. Hence, I think it is OK to first up sample and then extract representations. So they would correspond to 10ms shift of initial audio