Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Model loaded from checkpoint has bad accuracy

See original GitHub issue

What is your question?

I have a model that I train with EarlyStopping and ModelCheckpoint on a custom metric (MAP). The training works fine, after 2 epochs the model reaches 96% MAP however when I load it and test it with the exact same function the MAP is 16% (same as untrained model). I must be doing something wrong, but what ?

Code

def default_model(dataset: str):
	if torch.cuda.is_available():
		print("Using the GPU")
		device = torch.device("cuda")
	else:
		print("Using the CPU")
		device = torch.device("cpu")
	kwargs = {
		"dataset": dataset, "embed_size": 50, "depth": 3,
		"vmap": Graph3D.from_dataset(dataset).vocabulary,
		"neg_per_pos": 5, "max_paths": 255, "device": device
	}
	try:
		model = TAPKG.load_from_checkpoint("Checkpoints/epoch=2-step=612260.ckpt", **kwargs).to(device)
		return model
	except OSError as e:
		print(f"Couldn't load the save for the model, training instead. ({e.__class__.__name__})")
		model = TAPKG(**kwargs).to(device)
	cpt = pl.callbacks.ModelCheckpoint(monitor="MAP", mode="max", dirpath="Checkpoints", save_top_k=1)
	trainer = pl.Trainer(
		gpus=1,
		check_val_every_n_epoch=1,
		callbacks=[
			cpt,
			pl.callbacks.EarlyStopping(monitor="MAP", mode="max", min_delta=.002, patience=2)
		],
		auto_lr_find=True
	)
	# noinspection PyTypeChecker
	trainer.fit(model)
	print(cpt.best_model_path, cpt.best_model_score)
	return model

def eval_link_completion(dataset):
	model = default_model(dataset)
	ranks = model.link_completion_rank()
	MAP(ranks, plot=True)

Right after the training eval_link_completion shows a MAP of 96%, when I load the model however it’s back to 16%.

OS: KUbuntu 20.04
Packaging pip
Version 1.2.0

Issue Analytics

State:
Created 3 years ago
Comments:7

Top GitHub Comments

1reaction

Inspirateurcommented, Feb 24, 2021

Yep I’m sorry, my loading/saving code was good, I just had another issue somewhere, thanks for your time

0reactions

Inspirateurcommented, Nov 13, 2022

I’m afraid i can’t help you, it’s been more than a year and I’d be completely unable to remember what the problem was

Top Results From Across the Web

Model loaded from checkpoint has bad accuracy #6159 - GitHub

I have a model that I train with EarlyStopping and ModelCheckpoint on a custom metric (MAP). The training works fine, after 2 epochs...

Keras: A loaded checkpoint model to resume a training could ...

At epoch 15 , you have an accuracy of 88% (say you save your model according to the best validation accuracy).

How to Checkpoint Deep Learning Models in Keras

A simpler checkpoint strategy is to save the model weights to the same file if and only if the validation accuracy improves. This...

Training from Checkpoint - Performance is surprisingly bad

Using the callback ModelCheckpoint, I save the weights each epoch. Training finishes and the loss and accuracy on training set is about 0.15...

Save and load models | TensorFlow Core

An untrained model will perform at chance levels (~10% accuracy): ... To test, reset the model, and load the latest checkpoint:.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Model loaded from checkpoint has bad accuracy

What is your question?

Code

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

setting accumulate_grad_batches (accumulate_grad_steps) >1 with deepspeed plugin and use cpu offload will lead to model training incorrectly

[Hydra] Add `_target_` key to hparams.yaml when the model is instantiated by hydra.utils.instantiate().