Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

tensorflow.python.framework.errors_impl.InvalidArgumentError message while trying to training a model

See original GitHub issue

(musika) C:\Users\ПК>python X:\musika\musika_train.py --train_path X:\musika\encodings --log_path X:\logs --mixed_precision False


Using GPU without mixed precision...

Calculating total number of samples in data folder...
Found 1720 total samples
Dataset is ready!
Checking if models are already available...
Models are available!
X:\cnd\envs\musika\lib\site-packages\keras\initializers\initializers_v2.py:120: UserWarning: The initializer HeUniform is unseeded and being called multiple times, which will return identical values  each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initalizer instance more than once.
  warnings.warn(
Encoders/Decoders loaded from checkpoints/ae
Networks initialized
Critic params: 20786689
Generator params: 15499530
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
CLICK ON LINK BELOW TO OPEN GRADIO INTERFACE
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
Traceback (most recent call last):
  File "X:\musika\musika_train.py", line 31, in <module>
    T.train(ds, models_ls)
  File "X:\musika\train.py", line 134, in train
    train_summary_writer = tf.summary.create_file_writer(train_log_dir)
  File "X:\cnd\envs\musika\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py", line 559, in create_file_writer_v2
    return _ResourceSummaryWriter(
  File "X:\cnd\envs\musika\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py", line 311, in __init__
    self._init_op = init_op_fn(self._resource)
  File "X:\cnd\envs\musika\lib\site-packages\tensorflow\python\ops\gen_summary_ops.py", line 145, in create_summary_file_writer
    _ops.raise_from_not_ok_status(e, name)
  File "X:\cnd\envs\musika\lib\site-packages\tensorflow\python\framework\ops.py", line 7209, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__CreateSummaryFileWriter_device_/job:localhost/replica:0/task:0/device:CPU:0}} Failed to create a NewWriteableFile: X:\logs/MUSIKA_latlen_256_latdepth_64_sr_44100/20221119-234355/train/events.out.tfevents.1668890635.??-??.6756.0.v2 : ?????????????? ?????? ? ????? ?????, ????? ????? ??? ????? ????.
; no protocol option
        Creating writable file X:\logs/MUSIKA_latlen_256_latdepth_64_sr_44100/20221119-234355/train/events.out.tfevents.1668890635.??-??.6756.0.v2
        Could not initialize events writer. [Op:CreateSummaryFileWriter]

Are there any options to solve this issue?

Issue Analytics

State:
Created 10 months ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

marcoppasinicommented, Nov 19, 2022

I will have to investigate on this If anyone else reading this thread is experiencing the same problem please report it!

In the meantime as a temporary solution, you can manually comment out from train.py the two instances of train_summary_writer:

train_summary_writer = tf.summary.create_file_writer(train_log_dir)

and

with train_summary_writer.as_default():
    tf.summary.scalar("disc_loss_r", dloss_tr, step=m)
    tf.summary.scalar("disc_loss_f", dloss_tf, step=m)
    tf.summary.scalar("gen_loss", gloss_t, step=m)
    tf.summary.scalar("gradient_penalty", dloss_id, step=m)
    tf.summary.scalar("gp_weight", -switch.value() * self.args.gp_max_weight, step=m)
    tf.summary.scalar("lr", self.args.lr, step=m)

You will not be able to use tensorboard to track losses, but in case training collapses you will still notice it (Nan values) from the loss values in the tqdm bar

0reactions

BasicAutismcommented, Nov 20, 2022

I will have to investigate on this If anyone else reading this thread is experiencing the same problem please report it!

In the meantime as a temporary solution, you can manually comment out from train.py the two instances of train_summary_writer:

train_summary_writer = tf.summary.create_file_writer(train_log_dir)

and
with train_summary_writer.as_default():
    tf.summary.scalar("disc_loss_r", dloss_tr, step=m)
    tf.summary.scalar("disc_loss_f", dloss_tf, step=m)
    tf.summary.scalar("gen_loss", gloss_t, step=m)
    tf.summary.scalar("gradient_penalty", dloss_id, step=m)
    tf.summary.scalar("gp_weight", -switch.value() * self.args.gp_max_weight, step=m)
    tf.summary.scalar("lr", self.args.lr, step=m)
    
You will not be able to use tensorboard to track losses, but in case training collapses you will still notice it (Nan values) from the loss values in the tqdm bar

This worked for me, but there was a problem with the XLA (with -- xla False musika_train.py works fine). First there was a common problem with InternalError: libdevice not found at ./libdevice.10.bc, which was solved by adding a new system variable XLA_FLAGS --xla_gpu_cuda_data_dir=X:\cnd\envs\musika. However, after that there was a problem with ptxas:

(musika) C:\Users\ПК>python X:\musika\musika_train.py --train_path X:\musika\encodings --mixed_precision False --load_path X:\musika\saveexp\MUSIKA_latlen_256_latdepth_64_sr_44100_time_20221120-162659\MUSIKA_iterations-9k_losses-0.8499346-0.5206635-0.5622146 --save_path X:\musika\saveexp


Using GPU without mixed precision...

Calculating total number of samples in data folder...
Found 1720 total samples
Dataset is ready!
Checking if models are already available...
Models are available!
X:\cnd\envs\musika\lib\site-packages\keras\initializers\initializers_v2.py:120: UserWarning: The initializer HeUniform is unseeded and being called multiple times, which will return identical values  each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initalizer instance more than once.
  warnings.warn(
Networks loaded from X:\musika\saveexp\MUSIKA_latlen_256_latdepth_64_sr_44100_time_20221120-162659\MUSIKA_iterations-9k_losses-0.8499346-0.5206635-0.5622146
Critic params: 20786689
Generator params: 15499530
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
CLICK ON LINK BELOW TO OPEN GRADIO INTERFACE
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
CLICK ON LINK BELOW TO OPEN TENSORBOARD INTERFACE
http://localhost:6006/
--------------------------------
--------------------------------
--------------------------------
--------------------------------
--------------------------------
WARNING:tensorflow:From X:\cnd\envs\musika\lib\site-packages\tensorflow\python\training\moving_averages.py:553: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Preparing for Training (this can take one or two minutes)...
Epoch 0/250:   0%|                                                                            | 0/9375 [00:00<?, ?it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.10.0 at http://localhost:6006/ (Press CTRL+C to quit)
2022-11-20 18:56:35.349899: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INVALID_ARGUMENT: Failed to create a NewWriteableFile: C:\Users\90C5~1\AppData\Local\Temp\/tempfile-??-??-428-7444-5ede8fbbf0657 : ?????????????? ?????? ? ????? ?????, ????? ????? ??? ????? ????.
; no protocol option'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

I can say for sure that there is enough disk space. Also i additionally installed cuda-nvcc in conda environment by running conda install -c nvidia cuda-nvcc. Could this be a issue with version of tensorflow, cudnn, cudatoolkit, cuda-nvcc or path issue?