[Bug] Restoring from checkpoint failed when synchronously separating to file within a loop
See original GitHub issueDescription
Inside of a for loop, I create a new Separator instance and then invoke separate_to_file. After the first iteration, tensorflow errors out when trying to restore from checkpoint, which I take to mean the contents of the pretrained_models stem type I chose. (Bold styling added by me for emphasis)
Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key batch_normalization_24/beta not found in checkpoint [[node save_1/RestoreV2 (defined at […]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Step to reproduce
- Create virtual environment
- Enter virtual environment
- Run
pip3 install spleeter
- Save script as spleeter_loop_error.py:
# spleeter_loop_error.py
import argparse
from pathlib import Path
from shutil import rmtree
from spleeter.separator import Separator
def debug_print(message):
print('\n----')
print(f'\t{message}')
print('----\n')
# Create temporary directory in the directory that the script is saved to
def init_outdir():
outdir = Path(__file__).parent.absolute() / '__example_temp'
if outdir.is_dir():
rmtree(outdir)
outdir.mkdir()
return outdir
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--filename', type=str, required=True)
parser.add_argument('-o', '--outdir', help='where spleeter will save output files', type=str, required=False)
args = parser.parse_args()
outdir = ''
if args.outdir:
outdir = args.outdir
else:
outdir = init_outdir()
for i in range(2):
separator = Separator('spleeter:2stems', stft_backend='librosa')
separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
debug_print(f'Successfully split ({i+1})')
- Run
python3 spleeter_loop_error.py -f [path to audio file]
- Error occurs during second loop
Output
([…] replaces path from root to virtual environment)
WARNING:tensorflow:From [...]/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From [...]/lib/python3.7/site-packages/spleeter/separator.py:144: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
WARNING:tensorflow:From [...]/lib/python3.7/site-packages/spleeter/separator.py:146: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2020-04-01 16:47:40.900998: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-01 16:47:40.901023: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-01 16:47:40.901042: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (thatman-Z68A-D3H-B3): /proc/driver/nvidia/version does not exist
2020-04-01 16:47:40.925211: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3309580000 Hz
2020-04-01 16:47:40.925541: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x63692e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-01 16:47:40.925577: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:spleeter:File [...]/__example_temp/0/vocals.wav written succesfully
INFO:spleeter:File [...]/__example_temp/0/accompaniment.wav written succesfully
----
Successfully split (1)
----
2020-04-01 16:48:00.224581: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key batch_normalization_24/beta not found in checkpoint
Traceback (most recent call last):
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key batch_normalization_24/beta not found in checkpoint
[[{{node save_1/RestoreV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1290, in restore
{self.saver_def.filename_tensor_name: save_path})
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key batch_normalization_24/beta not found in checkpoint
[[node save_1/RestoreV2 (defined at [...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'save_1/RestoreV2':
File "bare_loopy_spleeter.py", line 38, in <module>
separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 191, in separate_to_file
sources = self.separate(waveform, audio_descriptor)
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 157, in separate
return self.separate_librosa(waveform, audio_descriptor)
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 144, in separate_librosa
saver = tf.train.Saver()
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
self.build()
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1300, in restore
names_to_keys = object_graph_key_mapping(save_path)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1618, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bare_loopy_spleeter.py", line 38, in <module>
separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 191, in separate_to_file
sources = self.separate(waveform, audio_descriptor)
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 157, in separate
return self.separate_librosa(waveform, audio_descriptor)
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 147, in separate_librosa
saver.restore(sess, latest_checkpoint)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1306, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key batch_normalization_24/beta not found in checkpoint
[[node save_1/RestoreV2 (defined at [...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'save_1/RestoreV2':
File "bare_loopy_spleeter.py", line 38, in <module>
separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 191, in separate_to_file
sources = self.separate(waveform, audio_descriptor)
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 157, in separate
return self.separate_librosa(waveform, audio_descriptor)
File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 144, in separate_librosa
saver = tf.train.Saver()
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
self.build()
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Environment
OS | Ubuntu 19.10 (eoan) |
Installation type | pip 20.0.2, python 3.7.5 |
RAM available | 13G |
Hardware spec | Intel i5-2500, Radeon HD 6850 |
Additional context
How I discovered this issue:
In my actual usecase, I am invoking Separator.separate_to_file within a python based server. It works fine within my development environment. However, when I changed to an asynchronous server, spleeter errors out in very much the same way, only the key that tensorflow expects is slightly different. (Bold styling again added for emphasis)
- I start the server.
- Hitting the end point the first time works as expected.
- After hitting the same endpoint the second time,
INFO: Apply unet for vocals_spectrogram INFO: Apply unet for drums_spectrogram INFO: Apply unet for bass_spectrogram INFO: Apply unet for other_spectrogram INFO: Restoring parameters from pretrained_models/4stems/model 2020-04-01 17:13:17.556825: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key batch_normalization_48/beta not found in checkpoint
- And the third time,
INFO: Apply unet for vocals_spectrogram INFO: Apply unet for drums_spectrogram INFO: Apply unet for bass_spectrogram INFO: Apply unet for other_spectrogram INFO: Restoring parameters from pretrained_models/4stems/model 2020-04-01 17:15:57.301420: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key batch_normalization_100/beta not found in checkpoint
- As far as I can tell, every subsequent request expects the same batch_normalization_100/beta key.
It is my hope that if I can resolve the issue described above, I can solve my server-based issue in much the same way. Thank you for making this amazing project available to the public!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:15
See PR https://github.com/deezer/spleeter/pull/319
Sure, I created a new issue: #370