Data Augmentation
See original GitHub issueHi, I am following this paper for performing data augmentation on the musdb.
I am using librosa time_stretch and pitch_shift on each sample of the musdb dataset.
I then use spempeg to build a new stem file.
Unfortunately, the preprocessing of the wave-u-net shows these statistics that seem not to so be good for re-training the network properly:
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_bass.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_drums.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_other.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_vocals.wav
Maximum absolute deviation from source additivity constraint: 1.015533447265625
Mean absolute deviation from source additivity constraint: 0.09679516069867423
On the musdb website it is also stated that:
Since the mixture is separately encoded as AAC, there there is a small difference between the sum of all sources and the mixture. This difference has no impact on the bsseval evaluation performance.
Some of my code:
SR = 44100
R = 0.1
def timeStretch(y, rate=0):
y_right = y[:, 0]
y_left = y[:, 1]
y_stretched_R = librosa.effects.time_stretch(y_right, rate=rate)
y_stretched_L = librosa.effects.time_stretch(y_left, rate=rate)
y_stretched = np.array([y_stretched_R, y_stretched_L])
return y_stretched
# open stem and retrieve all channels
stem_path = os.path.join(ORIGINAL_STEMS_DIR, f)
info = stempeg.Info(stem_path)
S, _ = stempeg.read_stems(stem_path, info=info)
process_list = [S[0], S[1], S[2], S[3], S[4]]
for audio_to_process in process_list:
y_stretched = timeStretch(audio_to_process, rate=R)
stretched_list.append(y_stretched)
# create and save stem
S = np.array(stretched_list)
S = np.swapaxes(S,1,2) #n x samples x channels
stempeg.write_stems(S, output_mp4, rate=SR)
Do you have any idea on what could be the problem here? Thanks a lot!
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
Data augmentation - Wikipedia
Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data...
Read more >Data augmentation | TensorFlow Core
This tutorial demonstrates data augmentation: a technique to increase the diversity of your training set by applying random (but realistic) transformations, ...
Read more >What is Data Augmentation? Techniques & Examples in 2023
Data augmentation is a set of techniques to artificially increase the amount of data by generating new data points from existing data.
Read more >The Essential Guide to Data Augmentation in Deep Learning
Data augmentation is a process of artificially increasing the amount of data by generating new data points from existing data. This includes adding...
Read more >Data Augmentation in Python: Everything You Need to Know
Data augmentation is a technique that can be used to artificially expand the size of a training set by creating modified data from...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

There might be some problem with the encoding, seeing that the mean absolute deviation is not so high but the maximum one is. So it might be alright overall but locally some encoding inconsistencies produce a high error…
Solution 1: Export your audio to wave, and modify the MUSDB data loading code to load the wave files directly, then you know there should be absolutely no deviation between sum of sources and mix as you don’t have any encoding inaccuracy.
Solution 2: If you are absolutely sure you are inputting “proper” data into the system, go ahead and ignore the warning and/or use
output_type: directin the Wave-U-Net to allow it to output all sources unconstrained, so it is capable of outputting sources that do NOT add up to the original mix as well. I would definitely listen to the dataset you produced in this case though to make sure everything is alright.I am clipping the accompaniment audio just to be sure that i don’t generate values out of the [-1,1] range since it’s a sum of the individual audio signals, so the amplitudes are summed up. Should not be necessary if the dataset is proper, but doesn’t hurt either.
I am not too well versed on the ffmpeg part of the story, but I personally wouldn’t trust it to encode things to such a high degree of accuracy that we require. I also had some issues when loading encoded audio in terms of synchronisation where the audio was suddenly misaligned in time, which is obviously very bad in our setting.
But yeah it looks like ffmpeg encoding (settings) is to blame here. I decode all the stems to wave as part of data preparation anyway as it’s much faster to load the audio during training that way, so you should probably use solution 1 I proposed and cut out the whole stempeg part completely.
Another solution if time-stretching is not too cpu-intensive is to put it as part of the data augmentation piopeline on-the-fly during training. Saves disk space but might slow down training since batches take longer to be prepared.