question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when doing deepcopy of the model

See original GitHub issue

Hi, thanks for this awesome project!

I build my transformer model based on the MoeMlp layer. I use ema for better performance. However, when I trying to init my ema model with ema_model = copy.deepcopy(my_transformer_model), I encounter the error:

File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
    value = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroupNCCL' object

Could you help me with that? How can I use ema with tutel? Thanks!

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
ghostplantcommented, Aug 9, 2022

Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?

Yes, both are compatible, as mpiexec is just an alternative way to launch cross-node processes instead of torch.distributed.launch.

0reactions
yzxing87commented, Aug 9, 2022

Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python: copy.deepcopy produces an error - Stack Overflow
This occurs in many instances when one accidentally tries to clone the iterator to a class. For instance, in PIL ...
Read more >
[solved]Error with copy.deepcopy() - PyTorch Forums
I want to return model after train, so I copy it first: best_model = copy.deepcopy(model) but I got error: RuntimeError: Only Variables ......
Read more >
Deepcopy error when trying to airflow clear - Google Groups
When I try to clear tasks either through the UI or through the command line with `airflow clear`, I get an error with...
Read more >
copy — Shallow and deep copy operations — Python 3.11.1 ...
Raised for module specific errors. The difference between shallow and deep copying is only relevant for compound objects (objects that contain other objects, ......
Read more >
11. Shallow and Deep Copy | Python Tutorial
A solution to the described problem is provided by the module copy . This module provides the method "deepcopy", which allows a complete...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found