Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Saving and loading the downsampled audio results in a tensor with zeros.

See original GitHub issue

🐛 Bug

I was using this particular audio file that I downloaded from https://www.lynxstudio.com/ to experiment with 6 channel audio. When I downsample this from 44.1kHz 8kHz, everything seems fine and I am able to play the audio. However, if I were to save the downsampled file and load it again, the tensor I get back have all zeros in it.

import torchaudio

waveform, sample_rate = torchaudio.load('ChannelPlacement.wav')

downsample_rate=8000

downsample_resample = torchaudio.transforms.Resample(
    sample_rate, downsample_rate, resampling_method='sinc_interpolation')

down_sampled = downsample_resample(waveform)

print(down_sampled)

torchaudio.save('temp.wav', down_sampled, downsample_rate)

waveform2, sample_rate2 = torchaudio.load('temp.wav')

print(waveform2)

tensor([[-3.7585e-09, -3.3725e-09,  9.2130e-09,  ..., -4.0691e-08,
          7.2912e-09, -5.7485e-08],
        [-3.2915e-09,  7.5441e-09,  9.4772e-10,  ..., -7.3543e-09,
          3.1981e-08, -2.3025e-08],
        [ 1.5473e-08,  1.8003e-08,  2.5778e-09,  ..., -1.0129e-08,
         -2.0479e-08,  2.6770e-08],
        [-2.1108e-08, -3.9693e-08, -2.2911e-08,  ..., -2.4338e-08,
          3.7029e-08, -3.1360e-09],
        [-2.1277e-08, -1.9114e-09, -4.4245e-09,  ..., -2.3023e-08,
          6.9994e-09, -7.5472e-09],
        [ 5.2013e-09,  2.5186e-08,  2.1362e-08,  ..., -2.6036e-07,
         -1.5355e-07,  3.7919e-08]])
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

To Reproduce

Steps to reproduce the behavior:

Load ChannelPlacement.wav which is included in the attached zip file
Downsample to 8000 (This 8000 seems to be a magic number as if I were to change it to another number, like 7999, it works fine)
Save the downsampled version
Load the downsampled file we just saved
The tensor torchaudio.load returns have all zeros

Here is the gist https://gist.github.com/hiromis/3a0ce0e3b8a512465609c653364c02fe

The following zip file includes the jupyter notebook as well as the audio file: files.zip

Expected behavior

torchaudio.save('temp.wav', down_sampled, downsample_rate)
waveform, sample_rate = torchaudio.load('temp.wav')

Where waveform tensor to match the tensor I saved (down_sampled)

Environment

What commands did you used to install torchaudio (conda/pip/build from source)? pip
If you are building from source, which commit is it? N/A
What does torchaudio.__version__ print? (If applicable) ‘0.3.0’

PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 430.26
cuDNN version: /usr/local/cuda-10.0/lib64/libcudnn.so.7.5.0

Versions of relevant libraries:
[pip3] numpy==1.16.4
[pip3] torch==1.1.0
[conda] torch                     1.2.0                    pypi_0    pypi
[conda] torchaudio                0.3.0                    pypi_0    pypi
[conda] torchvision               0.4.0                    pypi_0    pypi

Additional context

I did try several other files, and it seems as though this file and sample rate of 8000 seem to hit the edge case of some sort.

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:7 (4 by maintainers)

Top GitHub Comments

3reactions

jamarshoncommented, Aug 22, 2019

I don’t think this is related to resampling. I think it’s related to the save/load which has a wide variety of configurations (precision, normalization, and scaling). The signal starts off as a 24 bits per sample and after loaded it is a float32 (32 bits per sample) which is resampled and then saved to file.

down_sampled.abs().max() == 1.0006 which is greater than 1.0 so it is not scaled and converted to a int64/long https://github.com/pytorch/audio/blob/a424509dda5b57c932fa8b5b780de93e60ed7ee2/torchaudio/__init__.py#L191 This float32 tensor is then stored in a sox_sample_t/sox_int32_t buffer before being written to a file https://github.com/pytorch/audio/blob/a424509dda5b57c932fa8b5b780de93e60ed7ee2/torchaudio/torch_sox.cpp#L37. The copy of float32 to sox_int32_t will then convert all the values to zero as it is truncated.

Tangentially related concern in the past: https://github.com/pytorch/audio/pull/119#discussion_r293929024

Some resources for references, https://en.wikipedia.org/wiki/Single-precision_floating-point_format http://soundfile.sapp.org/doc/WaveFormat/ https://github.com/kaldi-asr/kaldi/blob/master/src/feat/wave-reader.cc

sox uses various sox_int32_t buffers for reading/writing where it does some conversion to various bits per sample. https://github.com/rbouqueau/SoX/blob/e29e9ceb7c25a2d83c09bc8a601de117fc65563c/src/wavpack.c#L129

I think the solution to the issue would be rewrite/fix torchaudio load and save so that all the bits of the waveform are saved in the file (e.g. somehow convert the tensor float32 to int32 bitwise for sox buffer or not use sox at all). The current torchaudio implementation of load/save seems to lose some bits and could be improved/more clear (e.g input/outputs dtype, scale of the input/output).

down_sampled2 = (down_sampled << 32).long()
torchaudio.save('temp.wav', down_sampled2, downsample_rate, precision=32)

waveform2, sample_rate2 = torchaudio.load('temp.wav', normalization=None)

print('down_sampled2\n', down_sampled2)
print('waveform2\n', waveform2)

# note not the same as information is lost in saving
# down_sampled2
# tensor([[  -16,   -14,    39,  ...,  -174,    31,  -246],
#     [  -14,    32,     4,  ...,   -31,   137,   -98],
#     [   66,    77,    11,  ...,   -43,   -87,   114],
#     [  -90,  -170,   -98,  ...,  -104,   159,   -13],
#     [  -91,    -8,   -19,  ...,   -98,    30,   -32],
#     [   22,   108,    91,  ..., -1118,  -659,   162]])
# waveform2
# tensor([[  -16.,   -14.,    39.,  ...,  -174.,    31.,  -246.],
#     [  -14.,    32.,     4.,  ...,   -31.,   137.,   -98.],
#     [   66.,    77.,    11.,  ...,   -43.,   -87.,   114.],
#     [  -90.,  -170.,   -98.,  ...,  -104.,   159.,   -13.],
#     [  -91.,    -8.,   -19.,  ...,   -98.,    30.,   -32.],
#     [   22.,   108.,    91.,  ..., -1118.,  -659.,   162.]])

2reactions

vincentqbcommented, Aug 28, 2019

FYI: We are currently discussing standardizing the data loading in pytorch/pytorch#24915, and make any post-processing more transparent.

Top Results From Across the Web

Audio manipulation with torchaudio - PyTorch

To save audio data in the formats intepretable by common applications, you can use torchaudio.save . This function accepts path-like object and file-like...

Simple audio recognition: Recognizing keywords - TensorFlow

This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic automatic speech recognition (ASR) model...

AVAudioEngine downsample issue - swift - Stack Overflow

If I use this code it gives me all zeros in the buffer. Do you know what I'm doing wrong? I'm using iPhone...

Wav2Vec2 - Hugging Face

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the ......

Simple audio recognition: Recognizing keywords - Kaggle

You'll be using a portion of the dataset to save time with data loading. ... build your training set to extract the audio-label...