Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does ignite save checkpoint take up video memory?

See original GitHub issue

🐛 Bug description

Environment

PyTorch Version (e.g., 1.4):1.7
Ignite Version (e.g., 0.3.0):0.4.1
OS (e.g., Linux):Linux
How you installed Ignite (conda, pip, source):pip
Python version:3.7
Any other relevant information:

for i in range(n):
    common.setup_common_training_handlers()
    common.gen_save_best_models_by_val_score()
    trainer.run()

During training, there will be a gradual increase in video memory, and eventually lead to cuda memory explosion Please, Is this a problem caused by saving checkpoint or something else?

Issue Analytics

State:
Created 10 months ago
Comments:16

Top GitHub Comments

2reactions

Chucy2020commented, Nov 18, 2022

in this case you do not need to put train_sampler into setup_common_training_handlers as it is only for DDP

Okay, I get it, thank you very much. Thank you for being professional, responsible, patient, powerful, etc. Thank you!

1reaction

Chucy2020commented, Nov 17, 2022

if you could share runnable code if would be more simple to help with the issue.

Where do you create the model and move it to cuda ?

I really want to provide it to you and solve this problem as soon as possible, but it is difficult for the code to be transmitted from the Intranet.

The way I am using it is as follows:

model = AutoModel.from_pretrained(model_path)
model = model.to(device)

optimizer = ...
###one line one training sample
lines = [json.loads(line) for line in open(data_path,"r")]


for iter in range(n):
    ####add different train_data training code
    for i in range(0,len(lines),data_onetime_train_lines)
        val_result = eval(model,val_dataloader)  #-->
        train_data_i = lines[i:i+data_onetime_train_lines]
        train_dataloder = Dataloader(train_data_i)
        trainer = Engine(update)
        to_save = {"trainer ":trainer ,"model":model,"optimizer":optimizer}
        common.setup_common_training_handlers(to_save=to_save,...)  # <---- here you add the same handlers multiple times in 
                                                                                                            #the loop

        ##Load the checkpoint trained in the previous data_train_i
        resume_from(checkpoint_train_data_(i-1))   # if i != 0:
        #load checkpoint trained in the previous data_train_i code
        """
        if checkpoint_fp.is_dir():
            checkpoint_fp = max(filter(lambda x: x.name.startswith("training_checkpoint_"), checkpoint_fp.iterdir()), key=lambda x: 
            int(x.stem.split('_')[-1]))
            checkpoint = torch.load(checkpoint_fp.as_posix(),map_="cpu")
        """

        common.gen_save_best_models_by_val_score(...)   # <---- same here
        trainer.run(train_loader,epochs)


#Finally, the result of the last loop's checkpoint is verified
result = eval(checkpoint_path_last_iter)