[Announcement] Improving I/O for correct and consistent experience
See original GitHub issuetl;dr: how to migrate to new backend/interface in 0.7
-
If you are using
torchaudio
in Linux/macOS environments, please usetorchaudio.set_audio_backend("sox_io")
to adopt to the upcoming changes. -
If you are in Windows environment, please set
torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = False
and reload backend to use the new interface. -
Note that this ships with some bug-fixes for formats other than 16bit signed integer WAV, so you might experience some BC-breaking changes as described in the section below.
News [UPDATE] 2021/03/06
- All the migration works have been completed on master branch.
[UPDATE] 2021/02/12
- Added
bits_per_sample
andencoding
argument (replaceddtype
) tosave
function.
[UPDATE] 2021/01/29
- Added
encoding
toAudioMetaData
[UPDATE] 2021/01/22
- Added
format
argument toload
/info
/save
function. bits_per_sample
toAudioMetaData
[UPDATE] 2020/10/21
- Added Description of
"soundfile"
backend legacy interface.
[UPDATE] 2020/09/18
- Added migration guide for
"soundfile"
backend. - Moved the phase when
"soundfile"
backend signatures change from 0.9.0 to 0.8.0 so that they match with"sox_io"
backend, which becomes default in 0.8.0.
[UPDATE] 2020/09/17
- Added information on deprecation of native
libsox
structures such assignalinfo_t
andencoding_t
.
Improving I/O for correct and consistent experience
This is an announcement for users that we are making backward-incompatible changes to I/O functions of torchaudio
backends from 0.7.0 release throughout 0.9.0 release.
What is affected?
-
Public APIs
torchaudio.load
- [Linux/macOS] By switching the default backend from
"sox"
backend to"sox_io"
backend in 0.8.0, loading audio formats other than 16bit signed integer WAV returns the correct tensor. - [Linux/macOS/Windows] The signature of
"soundfile"
backend will be change in 0.8.0 to match that of"sox_io"
backend.
- [Linux/macOS] By switching the default backend from
torchaudio.save
- [Linux/macOS] By switching to
"sox_io"
backend, saving audio files will no longer degrade the data. The supported format will be restricted to the tested formats only. (please refer to the doc for the supported formats.) - [Linux/macOS/Windows] The signature of
"soundfile"
backend will be change in 0.8.0 to match that of"sox_io"
backend.
- [Linux/macOS] By switching to
torchaudio.info
- [Linux/macOS/Windows] The signature of
"soundfile"
backend will be change in 0.8.0 to match that of"sox_io"
backend.
- [Linux/macOS/Windows] The signature of
torchaudio.load_wav
- will be removed in 0.9.0. (
load
function withnormalize=False
will provide the same functionality)
- will be removed in 0.9.0. (
-
Internal APIs The following functions/classes of
"sox"
backend were accidentally exposed and will be removed in 0.9.0. There is no replacement for them. Please usesave
/load
/info
functions.torchaudio.save_encinfo
- will be removed in 0.9.0
torchaudio.get_sox_signalinfo_t
- will be removed in 0.9.0
torchaudio.get_sox_encodinginfo_t
- will be removed in 0.9.0
torchaudio.get_sox_option_t
- will be removed in 0.9.0
torchaudio.get_sox_bool
- will be removed in 0.9.0
The signatures of the other backends are not planned to be changed within this overhaul plan.
- Classes
torchaudio.SignalInfo
andtorchaudio.EncodingInfo
- will be replaced with
AudioMetaData
in 0.8.0 for"soundfile"
backend - will be removed in 0.9.0
- will be replaced with
Why
There are currently three backends in torchaudio
. (Please refer to the documentation for the detail.)
"sox"
backend is the original backend, which binds libsox
with pybind11
. The functionalities (load
/ save
/ info
) of this backend are not well-tested and have number of issues. (See https://github.com/pytorch/audio/pull/726).
Fixing these issues in backward-compatible manner is not straightforward. Therefore while we were adding TorchScript-compatible I/O functions, we decided to deprecate this original "sox"
backend and replace it with the new backend ("sox_io"
backend), which is confirmed not to have those issues.
When we are switching the default backend for Linux/macOS from "sox"
to "sox_io"
backend, we would like to align the interface of "soundfile"
backend, therefore, we introduced the new interface (not a new backend to reduce the number of public API) to "soundfile"
backend.
When / What Changes
The following is the timeline for the planned changes;
Phase | Expected Release | Expected Changes |
---|---|---|
1 | 0.7.0(Oct 2020) |
|
2 | 0.8.0(March 2021) |
|
3 | 0.9.0 |
|
Planned signature changes of "soundfile"
backend in 0.8.0
The following is the planned signature change of "soundfile"
backend functions in 0.8.0 release.
info
function
AudioMetaData
implementation can be found here. The placement of the AudioMetaData
might be changed.
~0.7.0 | 0.8.0 |
|
|
Migration
The values returned from info
function will be changed. Please use the corresponding new attributes.
~0.7.0 | 0.8.0 |
|
|
Note If the attribute you are using is missing, file a Feature Request issue.
load
function
~0.7.0 | 0.8.0 |
|
|
Migration
Please change the argument names;
normalization
->normalize
offset
->frame_offst
~0.7.0 | 0.8.0 |
|
|
save
function
~0.7.0 | 0.8.0 |
|
|
Migration
~0.7.0 | 0.8.0 |
|
|
BC-breaking changes
Read and write operations on the formats other than WAV 16-bit signed integer were affected by small bugs.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:41 (24 by maintainers)
Top GitHub Comments
(note: I updated the save un-normalization code snippet based on the suggestion.)
Hi @f0k
Thanks for the comment. Those are very good points.
Let me first tell you the context. The design principle for the new I/O modules are
For the normalization, it is because of the principle 2 and 3 that we return the normalized value by default, and the normalization is performed on fixed coefficients. (Determined by dtypes) If we normalize the resulting tensor with the value found in the tensor, users will have questions like “what was the normalization coefficient being used?”, which they might never get an answer. Also it is because of the principle 1 we want to provide the option to return the uncompressed data without normalized. This design is influenced by
spicy.io.wavfile.read
function. If someone is working on non-DL application and wants to decode some audio data in the format other Python libraries do not support, they can usetorchaudio
asPyTorch
provides zero-overhead conversion from Tensor to NumPy NDArray type. Now, for the parameter name"normalization"
, I get that it’s confusing. (There were other users who had the same confusion.) This is kind of historical. The previous backend had similar argument and when I started workin on this module, we did not intend to introduce the BC-braking change. As of your suggestion ofas_float
orfloatify
, I think there is still an ambiguity, as for the range value of the resulting Tensor. It is more explicit about the data type, but none of them are perfect, so I am in favor of keeping it as-is. However I think the documentation should be updated so that normalization is based on data type. For thedtype
argument, it would be nice to do but that’s also something users can do easily. And since we expect floating type with [-1.0, 1.0] value range throughout the library (except kaldi module that was introduced without design review, which we plan to address), and the use of integer type is reserved for user-specific case, so I think the use-case is under defined from our perspective.About the un-normalization process. I looked into some detail and now I think you are right. Let me give you why I suggested the formula. When I started writing the new loading function in C++, I wondered how I know my code is doing the right thing the resulting Tensor has right values. I ended up with this. Internally,
libsox
represents 32 bit signed integer so normalization was needed. At the time I did not know howlibsox
internally do the conversion, so I set up the test and change the normalization strategy until I found an acceptable one. (That is, values are close to whatsox
command generates, and there should be no overflow) I ended up with this normalization, which is the reverse of what you pointed out. This achieved about4e-05
(or3e-03
for mp3) closeness, which was the best.Now, I understand the code base of
libsox
better and I digged into it to find howlibsox
does it and found the following. As you say it does normalization with single value and apply clipping.https://github.com/dmkrepo/libsox/blob/b9dd1a86e71bbd62221904e3e59dfaa9e5e72046/src/sox.h#L994
I think I can update the implementation to do the same and that should yield the result even closer to
sox
.For the saving part, as @faroit suggested above, I am thinking to include un-normalization inside of the save function and default to 16-bit signed integer. So that users are not bothered for un-normalization and to cover the most of real world use case with default.
This is great news, this will definitely improve trust and adoption of torchaudio 🙂 !