Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected result from using wav2vec 2.0 features for keyword spotting

See original GitHub issue

Hi, I’m trying to extract features using wav2vec 2.0 to do keyword spotting (with Dynamic Time Warping), but I’m getting some unexpected results when computing a distance matrix between two feature matrices using the XLSR53 checkpoint for feature extraction.

Say, for example, I have two audio files hello.wav and goodbye-hello-goodbye.wav. When I use librosa to extract MFCCs from each of these files and then compute a distance matrix, I get the expected outcome: a diagonal band showing a spectro-temporal correlation where the ‘hello’ is in the middle of the 2nd phrase.

Screen Shot 2021-01-29 at 3 42 50 PM

Doing the same calculations but using features from wav2vec 2.0, however, gives me this:

Screen Shot 2021-01-29 at 3 42 42 PM

I’m not sure if I’ve misunderstood the nature of the wav2vec 2.0 features and am mis-using it (let me know if that’s the case), or I’m missing something in the feature extraction process. To do feature extraction, I’m using the code that I found in #3134.

Thanks!

Code

import fairseq
import torch
import torchaudio
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

# Downloaded from https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt on 2021-01-27
wav2vec2_checkpoint_path = "xlsr_53_56k.pt"

# Code from https://github.com/pytorch/fairseq/issues/3134#issuecomment-761110102
checkpoint = torch.load(wav2vec2_checkpoint_path)
wav2vec2_encoder = fairseq.models.wav2vec.Wav2Vec2Model.build_model(checkpoint['cfg']['model'])
wav2vec2_encoder.load_state_dict(checkpoint['model'])

q_dat, q_sr = torchaudio.load("hello.wav")
r_dat, r_sr = torchaudio.load("goodbye-hello-goodbye.wav")

# Resample to 16 kHz
q_dat = torchaudio.transforms.Resample(q_sr, 16000)(q_dat)
r_dat = torchaudio.transforms.Resample(r_sr, 16000)(r_dat)

# Extract features
query_wav2vec2     = wav2vec2_encoder(q_dat, features_only=True, mask=False)['x'].detach().numpy().squeeze()
reference_wav2vec2 = wav2vec2_encoder(r_dat, features_only=True, mask=False)['x'].detach().numpy().squeeze()

# Calculate distance matrix
qr_dists_w2v2 = cdist(query_wav2vec2, reference_wav2vec2, 'euclidean', V = None)                    # Calculate distance matrix
qr_dists_w2v2 = ((qr_dists_w2v2 - qr_dists_w2v2.min())/(qr_dists_w2v2.max() - qr_dists_w2v2.min())) # Normalized to [0, 1]

# Plot distance matrix
plt.imshow(qr_dists_w2v2, interpolation='none')
plt.show()

What’s your environment?

Google CoLab: https://colab.research.google.com/drive/1_QfbOd1UyQ4d358G54DF53p36vsuJiRp

fairseq Version (e.g., 1.0 or master): 1.0.0a0+148327d
PyTorch Version (e.g., 1.7.1): 1.7.1
OS (e.g., Linux): Ubuntu 18.04 (Google CoLab)
How you installed fairseq (pip, source): pip install git+https://github.com/pytorch/fairseq.git
Python version: 3.6.9

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:14 (1 by maintainers)

Top GitHub Comments

6reactions

Barteldscommented, Feb 12, 2021

Thanks for your suggestion @alexeib! However, it seems that probing the different transformer layers does not help in solving the problem. Similar to @fauxneticien, I computed the distance matrix between the audio samples provided in https://github.com/pytorch/fairseq/issues/3181#issue-797288871 using features from the different transformer layers. Meaningful results are obtained for the monolingual wav2vec 2.0 model, but not for the multilingual model (visualizations using features from layer 5, 10, 15, and the top layer are shown below).

Monolingual model, layer 5: dtw_layer5_mono

Multilingual model, layer 5: dtw_layer5_multi

Monolingual model, layer 10: dtw_layer10_mono

Multilingual model, layer 10: dtw_layer10_multi

Monolingual model, layer 15: dtw_layer15_mono

Multilingual model, layer 15: dtw_layer15_multi

Top layer monolingual model: dtw_toplayer_mono

Top layer multilingual model: dtw_toplayer_multi

Could you indicate if you have seen or obtained meaningful results by probing the different transformer layers in the multilingual model?

3reactions

fauxneticiencommented, Feb 2, 2021

Thanks @mdda — I’m not that technically proficient with digging inside models, but I’ve tried adapting some extraction code from a colleague of mine (see original code here).

For this code, I load the model in as follows:

wav2vec2_checkpoint_path = "xlsr_53_56k.pt"

model, cfg = fairseq.checkpoint_utils.load_model_ensemble([wav2vec2_checkpoint_path])
model = model[0]
model.eval()

Then, for feature extraction, I can use:

def extract_w2v2_feats(wav_data, model = model):
    x = model.feature_extractor(wav_data)
    x = x.transpose(1, 2)
    x = model.layer_norm(x)
    x = model.post_extract_proj(x)

    # Transformer encoder
    x_conv = model.encoder.pos_conv(x.transpose(1, 2))
    x_conv = x_conv.transpose(1, 2)
    x += x_conv

    x = model.encoder.layer_norm(x)
    x = x.squeeze(0).detach().cpu().numpy()

    return x

I don’t quite get what’s going on but this seems to produce more reasonable outputs:

Features for hello.wav (note time on x axis):

Screen Shot 2021-02-01 at 4 32 25 PM

Features for goodbye-hello-goodbye.wav:

Screen Shot 2021-02-01 at 4 32 33 PM

Distance matrix between two feature matrices (note the expected diagonal band):

Screen Shot 2021-02-01 at 4 32 38 PM

At the same time, I can’t seem to get it to work with my colleague’s layer extraction code. As reported here, Martijn found that using the middle layers (e.g. layer 10) from the wav2vec 2.0 English model offered better representations for doing automated pronunciation comparisons than the output layer, which he suspected was better suited for the original training task. In any case, trying this code with the XLSR model:

def extract_w2v2_feats(wav_data, layer_i = None, model = model):
    x = model.feature_extractor(wav_data)
    x = x.transpose(1, 2)
    x = model.layer_norm(x)
    x = model.post_extract_proj(x)

    # Transformer encoder
    x_conv = model.encoder.pos_conv(x.transpose(1, 2))
    x_conv = x_conv.transpose(1, 2)
    x += x_conv

    x = model.encoder.layer_norm(x)
    x = x.transpose(0, 1)

    layer_i = None if layer_i is None or layer_i < 0 else layer_i - 1

    for i, layer in enumerate(model.encoder.layers):
        x, z = layer(x, self_attn_padding_mask=None, need_weights=False)
        if i == layer_i:
            break

    x = x.transpose(0, 1)
    x = x.squeeze(0).detach().cpu().numpy()

    return x

Presumably extract_w2v2_feats(wav_data, None) should return the same final layer output as the code above (since the i == layer_i condition is never satisfied), but alas:

hello.wav is: Screen Shot 2021-02-01 at 4 52 10 PM

goodbye-hello-goodbye.wav is: Screen Shot 2021-02-01 at 4 52 14 PM

And the distance matrix is: Screen Shot 2021-02-01 at 4 52 22 PM

But I gather the XSLR model internals may be different to the mono-lingual models, so we might need to play with the code a bit to get this aspect to work (if any maintainers have clues on how to approach this, that’d be a big help!). Hope the first part helps, @mdda!