question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does ignite save checkpoint take up video memory?

See original GitHub issue

🐛 Bug description

Environment

  • PyTorch Version (e.g., 1.4):1.7
  • Ignite Version (e.g., 0.3.0):0.4.1
  • OS (e.g., Linux):Linux
  • How you installed Ignite (conda, pip, source):pip
  • Python version:3.7
  • Any other relevant information:
for i in range(n):
    common.setup_common_training_handlers()
    common.gen_save_best_models_by_val_score()
    trainer.run()   

During training, there will be a gradual increase in video memory, and eventually lead to cuda memory explosion image Please, Is this a problem caused by saving checkpoint or something else?

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:16

github_iconTop GitHub Comments

2reactions
Chucy2020commented, Nov 18, 2022

in this case you do not need to put train_sampler into setup_common_training_handlers as it is only for DDP

Okay, I get it, thank you very much. Thank you for being professional, responsible, patient, powerful, etc. Thank you!

1reaction
Chucy2020commented, Nov 17, 2022

if you could share runnable code if would be more simple to help with the issue.

Where do you create the model and move it to cuda ?

I really want to provide it to you and solve this problem as soon as possible, but it is difficult for the code to be transmitted from the Intranet.

The way I am using it is as follows:

model = AutoModel.from_pretrained(model_path)
model = model.to(device)

optimizer = ...
###one line one training sample
lines = [json.loads(line) for line in open(data_path,"r")]


for iter in range(n):
    ####add different train_data training code
    for i in range(0,len(lines),data_onetime_train_lines)
        val_result = eval(model,val_dataloader)  #-->
        train_data_i = lines[i:i+data_onetime_train_lines]
        train_dataloder = Dataloader(train_data_i)
        trainer = Engine(update)
        to_save = {"trainer ":trainer ,"model":model,"optimizer":optimizer}
        common.setup_common_training_handlers(to_save=to_save,...)  # <---- here you add the same handlers multiple times in 
                                                                                                            #the loop

        ##Load the checkpoint trained in the previous data_train_i
        resume_from(checkpoint_train_data_(i-1))   # if i != 0:
        #load checkpoint trained in the previous data_train_i code
        """
        if checkpoint_fp.is_dir():
            checkpoint_fp = max(filter(lambda x: x.name.startswith("training_checkpoint_"), checkpoint_fp.iterdir()), key=lambda x: 
            int(x.stem.split('_')[-1]))
            checkpoint = torch.load(checkpoint_fp.as_posix(),map_="cpu")
        """

        common.gen_save_best_models_by_val_score(...)   # <---- same here
        trainer.run(train_loader,epochs)


#Finally, the result of the last loop's checkpoint is verified
result = eval(checkpoint_path_last_iter) 
Read more comments on GitHub >

github_iconTop Results From Across the Web

Ignite Persistent Store - under the hood - Apache Ignite
We can define checkpointing as a process of storing dirty pages from RAM on a disk, with results of consistent memory state is...
Read more >
Checkpoint — PyTorch-Ignite v0.4.10 Documentation
This class can use specific save handlers to store on the disk or a cloud storage, etc. The Checkpoint handler (if used with...
Read more >
How to resume learning? · Issue #2569 · pytorch/ignite - GitHub
Hi, support teams. This is my first time asking a question. I believe the following code will load the checkpoints. ... If the...
Read more >
Distributed Training with Ignite on CIFAR10
This tutorial is a brief introduction on how you can do distributed training with Ignite on one or more CPUs, GPUs or TPUs....
Read more >
PyTorch Lightning vs Ignite: What Are the Differences?
Lightning is a high-level python framework built on top of Pytorch. ... Saving the model as a PyTorch checkpoint; Converting the model to ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found