Merging plan from torchaudio-contrib
See original GitHub issueHi all, I think it’s good timing to discuss a potential merging plan from torchaudio-contrib to here, especially because there’s going to be new features and changes by @jamarshon @cpuhrsch.
Main idea
A lot of things are well summarized in https://github.com/keunwoochoi/torchaudio-contrib. In short, we wanted to re-design torch-based audio processing so that
- things can be
Layers
, which are based on correspondingFunctionals
- names for layers and arguments are carefully chosen
- all work for multi-channel
- complex numbers are supported when it makes sense (e.g., STFTs)
Review - layers
. torchaudio-contrib already covers lots of functions that transform.py
is covering now, but not all of them. And that’s why I feel like it’s time to discuss this here.
Let me list the classes in transform.py
one by one with some notes.
1. Already in torchaudio-contrib. Hoping we’d replace.
class Spectrogram
: we have it in torchaudio-contrib. On top of this, we also haveSTFT
layer which outputs complex representations (same astorch.stft
since we’re wrapping it).class MelScale
: we have it and would like to suggest to change the name to something more general. We named itclass MelFilterbank
, assuming there can be other types of filterbanks, too. It also supportshtk
and non-htk
mel filterbanks.class SpectrogramToDB
: we would like to propose a more general approach –class AmplitudeToDb(ref=1.0, amin=1e-7)
andclass DbToAmplitude(ref=1.0)
, because decibel-scaling is about changing it’s unit, not the core content of the input.class MelSpectrogram
: we have it, which returns ann.Sequential
model consists of Spectrogram and mel-scale filter bank.class MuLawEncoding
,class MuLawExpanding
: we have it, actually a 99% copy of the implementation here.
2. Wouldn’t need these
class Compose
: we wouldn’t need it because once things are based onLayers
people can simply build ann.Sequential()
.class Scale
: It does16int
-->float
. I think we need to deprecate this because if we really need this, it should be with a more intuitive and precise name, and probably should support other conversions as well.
3. To-be-added
class DownmixMono
: I would like to have one. But we also consider having a time-frequency representation-based downmix (energy-preserving operation) (@faroit). I’m open for discussion. Personally I’d prefer to have separate classes,DownmixWaveform()
andDownmixSpecgram()
. Maybe until we have a better one, we should keep it as it is.class MFCC
: we currently don’t have it. The current torch/audio implementation usess2db (SpectrogramToDB)
, but this class seems little arbitrary for me, so we might want to re-implement it.
4. Not sure about these
class PadTrim
: I don’t actually know why we need it exactly, would love to hear about this!class LC2CL
: So far, torchaudio-contrib code hasn’t consideredchannel-first
tensors. If it’s a thing, we’d i) update our code to make them compatible and ii) have the same or a similar class to this. But, …do we really need this?class BLC2CBL
: same asLC2CL
– I’d like to know its use cases.
Review - argument and variable names
As summarised --> https://github.com/keunwoochoi/torchaudio-contrib/issues/46, we’d like to use
waveforms
for a batch of waveformsreal_specgrams
for magnitude spectrogramscomplex_specgrams
for complex spectrograms . (This is relatively less-discussed).
Audio loading
@faroit has been working on replacing Sox with others. But here in this issue, I’d like to focus on the topics above.
So,
- Any opinion on this?
- Any answers to the questions I have!
- If it looks good, what else would you like to have in the one-shot PR that would replace the current
transforms.py
?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:22 (16 by maintainers)
I’m not sure what’s the current plan. They are definitely useful, but I don’t think it’s best hosted in the part of torchaudio. The installation (= dependency) issue will be always there and add maintenance cost and potential risks. I also hardly imagine their operations can benefit from GPUs. Due to these issues, I would avoid using sox as a part of a system if it requires some reliability and efficiency. For a quick and hacky use-case, one can easily plug it in their - maybe - preprocessing stage.
Side note - maybe some of the filters (e.g., EQ) could be re-implemented in torch 😃
@vincentqb We had a long discussion about this in https://github.com/keunwoochoi/torchaudio-contrib/issues/31. I think many people agree that fast audio loading is really important and for many users of torchaudio the load functionality is probably the only place where they touch the sox lib. Therefore it would make a lot of since if only load and save would be replaced by native torch code (probably interfacing libsndfile or just reading the wav bits from scripts like its being done in tensorflow.io).