question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Restoring from checkpoint failed when synchronously separating to file within a loop

See original GitHub issue

Description

Inside of a for loop, I create a new Separator instance and then invoke separate_to_file. After the first iteration, tensorflow errors out when trying to restore from checkpoint, which I take to mean the contents of the pretrained_models stem type I chose. (Bold styling added by me for emphasis)

Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key batch_normalization_24/beta not found in checkpoint [[node save_1/RestoreV2 (defined at […]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Step to reproduce

  1. Create virtual environment
  2. Enter virtual environment
  3. Run pip3 install spleeter
  4. Save script as spleeter_loop_error.py:
# spleeter_loop_error.py
import argparse
from pathlib import Path
from shutil import rmtree
from spleeter.separator import Separator

def debug_print(message):
  print('\n----')
  print(f'\t{message}')
  print('----\n')


# Create temporary directory in the directory that the script is saved to
def init_outdir():
  outdir = Path(__file__).parent.absolute() / '__example_temp'
  if outdir.is_dir():
    rmtree(outdir)

  outdir.mkdir()
  return outdir


if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument('-f', '--filename', type=str, required=True)
  parser.add_argument('-o', '--outdir', help='where spleeter will save output files', type=str, required=False)
  args = parser.parse_args()

  outdir = ''
  if args.outdir:
    outdir = args.outdir
  else:
    outdir = init_outdir()

  for i in range(2):
    separator = Separator('spleeter:2stems', stft_backend='librosa') 
    separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
    debug_print(f'Successfully split ({i+1})')
  1. Run python3 spleeter_loop_error.py -f [path to audio file]
  2. Error occurs during second loop

Output

([…] replaces path from root to virtual environment)

WARNING:tensorflow:From [...]/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From [...]/lib/python3.7/site-packages/spleeter/separator.py:144: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From [...]/lib/python3.7/site-packages/spleeter/separator.py:146: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-04-01 16:47:40.900998: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-01 16:47:40.901023: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-01 16:47:40.901042: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (thatman-Z68A-D3H-B3): /proc/driver/nvidia/version does not exist
2020-04-01 16:47:40.925211: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3309580000 Hz
2020-04-01 16:47:40.925541: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x63692e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-01 16:47:40.925577: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:spleeter:File [...]/__example_temp/0/vocals.wav written succesfully
INFO:spleeter:File [...]/__example_temp/0/accompaniment.wav written succesfully

----
	Successfully split (1)
----

2020-04-01 16:48:00.224581: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key batch_normalization_24/beta not found in checkpoint
Traceback (most recent call last):
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key batch_normalization_24/beta not found in checkpoint
	 [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1290, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key batch_normalization_24/beta not found in checkpoint
	 [[node save_1/RestoreV2 (defined at [...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'save_1/RestoreV2':
  File "bare_loopy_spleeter.py", line 38, in <module>
    separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 191, in separate_to_file
    sources = self.separate(waveform, audio_descriptor)
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 157, in separate
    return self.separate_librosa(waveform, audio_descriptor)
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 144, in separate_librosa
    saver = tf.train.Saver()
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1300, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1618, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bare_loopy_spleeter.py", line 38, in <module>
    separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 191, in separate_to_file
    sources = self.separate(waveform, audio_descriptor)
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 157, in separate
    return self.separate_librosa(waveform, audio_descriptor)
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 147, in separate_librosa
    saver.restore(sess, latest_checkpoint)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1306, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key batch_normalization_24/beta not found in checkpoint
	 [[node save_1/RestoreV2 (defined at [...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'save_1/RestoreV2':
  File "bare_loopy_spleeter.py", line 38, in <module>
    separator.separate_to_file(args.filename, outdir / str(i), filename_format='{instrument}.{codec}')
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 191, in separate_to_file
    sources = self.separate(waveform, audio_descriptor)
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 157, in separate
    return self.separate_librosa(waveform, audio_descriptor)
  File "[...]/lib/python3.7/site-packages/spleeter/separator.py", line 144, in separate_librosa
    saver = tf.train.Saver()
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "[...]/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Environment

OS Ubuntu 19.10 (eoan)
Installation type pip 20.0.2, python 3.7.5
RAM available 13G
Hardware spec Intel i5-2500, Radeon HD 6850

Additional context

How I discovered this issue:

In my actual usecase, I am invoking Separator.separate_to_file within a python based server. It works fine within my development environment. However, when I changed to an asynchronous server, spleeter errors out in very much the same way, only the key that tensorflow expects is slightly different. (Bold styling again added for emphasis)

  1. I start the server.
  2. Hitting the end point the first time works as expected.
  3. After hitting the same endpoint the second time,

INFO: Apply unet for vocals_spectrogram INFO: Apply unet for drums_spectrogram INFO: Apply unet for bass_spectrogram INFO: Apply unet for other_spectrogram INFO: Restoring parameters from pretrained_models/4stems/model 2020-04-01 17:13:17.556825: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key batch_normalization_48/beta not found in checkpoint

  1. And the third time,

INFO: Apply unet for vocals_spectrogram INFO: Apply unet for drums_spectrogram INFO: Apply unet for bass_spectrogram INFO: Apply unet for other_spectrogram INFO: Restoring parameters from pretrained_models/4stems/model 2020-04-01 17:15:57.301420: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key batch_normalization_100/beta not found in checkpoint

  1. As far as I can tell, every subsequent request expects the same batch_normalization_100/beta key.

It is my hope that if I can resolve the issue described above, I can solve my server-based issue in much the same way. Thank you for making this amazing project available to the public!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:15

github_iconTop GitHub Comments

2reactions
alreadytaikeunecommented, Apr 12, 2020
0reactions
JeffreyCAcommented, May 10, 2020

HI @JeffreyCA

I suspect this is another issue than the original on which should be fixed with #319 . Can you please create a new ticket se we keep track of it?

Sure, I created a new issue: #370

Read more comments on GitHub >

github_iconTop Results From Across the Web

Restore checkpoint failed - TechNet - Microsoft
Let's troubleshoot one VM firstly, in the Hyper V manager, right click the VM, Inspect the avhd file, check if the chain between...
Read more >
Error in running save/restore model functions in a loop
InvalidArgumentError : Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from ...
Read more >
Spin the event loop - HTML Standard - WhatWG
A JavaScript value representing an error that will prevent evaluation from succeeding. It will be re-thrown by any attempts to run the script....
Read more >
Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
json file inside your checkpoint directory. Loading and restoring a trained algorithm from a checkpoint is simple. Let's assume you have a local...
Read more >
Reactor 3 Reference Guide
I Used an Operator on my Flux but it Doesn't Seem to Apply. ... In addition to pushing values, the error-handling and completion...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found