transforms.AmplitudeToDB does not handle cut-off correctly for multi-channel or batched data
See original GitHub issueš Bug
From my understanding (based on e.g. #337), all transforms should be able to operate on tensors with dimensions (batch, channels, ...)
or (channels, ...)
, with ...
being dependent on the type of data being processed, e.g.time
for waveform data and freq, time
for spectrograms. In this way, we can pass multiple chunks of data (waveforms, spectrogramsā¦) at once and expect to get the same results as if we would pass them one by one.
However, this is not the case transforms.AmplitudeToDB
: As easily traceable in the source code of the corresponding functional, this transform blindly operates on the passed tensor without taking its dimensionality and the related semantics into account in any way.
This becomes a problem in the calculation of the cut-off. The purpose of this step is to clamp low dB values to a minimum value some fixed amount of decibels below the maximum value in the respective spectrogram. However, amplitude_to_db
uses the single global maximum in the passed tensor to calculate the cut-off for all contained spectrograms. Thus, when passing batched data, the results for one spectrogram are dependent on all the other spectrograms in the same batch, which to my understanding is not the correct behavior. My conclusion is that AmplitudeToDB
silently outputs wrong (in the sense of transforms
general interface contract) data for batched or multi-channel data, which I would consider really dangerous for applications.
Ideally, this should be fixed directly in functional.amplitude_to_DB
, so that we can also pass batched data there.
Environment
TorchAudio 0.7.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (10 by maintainers)
Top GitHub Comments
HI @jcaw
You can go ahead and open a PR. That way it is easier to keep the discussion going. I would like to update our test infrastructure to catch this kind of bugs, so it will help to think of a way to fix the tests.
That would great, thanks! š
Yes, we should retain the clamping behavior.
Thatās an interesting point: per channel or spectrogram? Reading at the code, Iād say the intended behavior is likely to have been per spectrogram, since the batching came later. So letās keep this behavior.
Batches and channels were packed together to add batching support, so this does need to be fixed. In particular, this means the transform should likely not fold batches into channels to apply the transform.
The format should be
(..., freq, time)
. The function does not apply to complex tensor (which would have had a shape (ā¦, freq, time, 2).Yes, indeed, though comment above would help take care of replacing clamp by min/max.