Saving and loading the downsampled audio results in a tensor with zeros.
See original GitHub issue🐛 Bug
I was using this particular audio file that I downloaded from https://www.lynxstudio.com/ to experiment with 6 channel audio. When I downsample this from 44.1kHz 8kHz, everything seems fine and I am able to play the audio. However, if I were to save the downsampled file and load it again, the tensor I get back have all zeros in it.
import torchaudio
waveform, sample_rate = torchaudio.load('ChannelPlacement.wav')
downsample_rate=8000
downsample_resample = torchaudio.transforms.Resample(
sample_rate, downsample_rate, resampling_method='sinc_interpolation')
down_sampled = downsample_resample(waveform)
print(down_sampled)
torchaudio.save('temp.wav', down_sampled, downsample_rate)
waveform2, sample_rate2 = torchaudio.load('temp.wav')
print(waveform2)
tensor([[-3.7585e-09, -3.3725e-09, 9.2130e-09, ..., -4.0691e-08,
7.2912e-09, -5.7485e-08],
[-3.2915e-09, 7.5441e-09, 9.4772e-10, ..., -7.3543e-09,
3.1981e-08, -2.3025e-08],
[ 1.5473e-08, 1.8003e-08, 2.5778e-09, ..., -1.0129e-08,
-2.0479e-08, 2.6770e-08],
[-2.1108e-08, -3.9693e-08, -2.2911e-08, ..., -2.4338e-08,
3.7029e-08, -3.1360e-09],
[-2.1277e-08, -1.9114e-09, -4.4245e-09, ..., -2.3023e-08,
6.9994e-09, -7.5472e-09],
[ 5.2013e-09, 2.5186e-08, 2.1362e-08, ..., -2.6036e-07,
-1.5355e-07, 3.7919e-08]])
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
To Reproduce
Steps to reproduce the behavior:
- Load
ChannelPlacement.wav
which is included in the attached zip file - Downsample to
8000
(This 8000 seems to be a magic number as if I were to change it to another number, like 7999, it works fine) - Save the downsampled version
- Load the downsampled file we just saved
- The tensor
torchaudio.load
returns have all zeros
Here is the gist https://gist.github.com/hiromis/3a0ce0e3b8a512465609c653364c02fe
The following zip file includes the jupyter notebook as well as the audio file: files.zip
Expected behavior
torchaudio.save('temp.wav', down_sampled, downsample_rate)
waveform, sample_rate = torchaudio.load('temp.wav')
Where waveform
tensor to match the tensor I saved (down_sampled
)
Environment
- What commands did you used to install torchaudio (conda/pip/build from source)? pip
- If you are building from source, which commit is it? N/A
- What does
torchaudio.__version__
print? (If applicable) ‘0.3.0’
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 430.26
cuDNN version: /usr/local/cuda-10.0/lib64/libcudnn.so.7.5.0
Versions of relevant libraries:
[pip3] numpy==1.16.4
[pip3] torch==1.1.0
[conda] torch 1.2.0 pypi_0 pypi
[conda] torchaudio 0.3.0 pypi_0 pypi
[conda] torchvision 0.4.0 pypi_0 pypi
Additional context
I did try several other files, and it seems as though this file and sample rate of 8000 seem to hit the edge case of some sort.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Audio manipulation with torchaudio - PyTorch
To save audio data in the formats intepretable by common applications, you can use torchaudio.save . This function accepts path-like object and file-like...
Read more >Simple audio recognition: Recognizing keywords - TensorFlow
This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic automatic speech recognition (ASR) model...
Read more >AVAudioEngine downsample issue - swift - Stack Overflow
If I use this code it gives me all zeros in the buffer. Do you know what I'm doing wrong? I'm using iPhone...
Read more >Wav2Vec2 - Hugging Face
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the ......
Read more >Simple audio recognition: Recognizing keywords - Kaggle
You'll be using a portion of the dataset to save time with data loading. ... build your training set to extract the audio-label...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I don’t think this is related to resampling. I think it’s related to the save/load which has a wide variety of configurations (
precision
,normalization
, and scaling). The signal starts off as a 24 bits per sample and after loaded it is a float32 (32 bits per sample) which is resampled and then saved to file.down_sampled.abs().max() == 1.0006
which is greater than 1.0 so it is not scaled and converted to a int64/long https://github.com/pytorch/audio/blob/a424509dda5b57c932fa8b5b780de93e60ed7ee2/torchaudio/__init__.py#L191 This float32 tensor is then stored in a sox_sample_t/sox_int32_t buffer before being written to a file https://github.com/pytorch/audio/blob/a424509dda5b57c932fa8b5b780de93e60ed7ee2/torchaudio/torch_sox.cpp#L37. The copy of float32 tosox_int32_t
will then convert all the values to zero as it is truncated.Tangentially related concern in the past: https://github.com/pytorch/audio/pull/119#discussion_r293929024
Some resources for references, https://en.wikipedia.org/wiki/Single-precision_floating-point_format http://soundfile.sapp.org/doc/WaveFormat/ https://github.com/kaldi-asr/kaldi/blob/master/src/feat/wave-reader.cc
sox uses various
sox_int32_t
buffers for reading/writing where it does some conversion to various bits per sample. https://github.com/rbouqueau/SoX/blob/e29e9ceb7c25a2d83c09bc8a601de117fc65563c/src/wavpack.c#L129I think the solution to the issue would be rewrite/fix torchaudio load and save so that all the bits of the waveform are saved in the file (e.g. somehow convert the tensor float32 to int32 bitwise for sox buffer or not use sox at all). The current torchaudio implementation of load/save seems to lose some bits and could be improved/more clear (e.g input/outputs dtype, scale of the input/output).
FYI: We are currently discussing standardizing the data loading in pytorch/pytorch#24915, and make any post-processing more transparent.