Error when doing deepcopy of the model
See original GitHub issueHi, thanks for this awesome project!
I build my transformer model based on the MoeMlp layer. I use ema for better performance. However, when I trying to init my ema model with ema_model = copy.deepcopy(my_transformer_model)
, I encounter the error:
File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
value = deepcopy(value, memo)
File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
state = deepcopy(state, memo)
File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/opt/conda/lib/python3.8/copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroupNCCL' object
Could you help me with that? How can I use ema with tutel? Thanks!
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Python: copy.deepcopy produces an error - Stack Overflow
This occurs in many instances when one accidentally tries to clone the iterator to a class. For instance, in PIL ...
Read more >[solved]Error with copy.deepcopy() - PyTorch Forums
I want to return model after train, so I copy it first: best_model = copy.deepcopy(model) but I got error: RuntimeError: Only Variables ......
Read more >Deepcopy error when trying to airflow clear - Google Groups
When I try to clear tasks either through the UI or through the command line with `airflow clear`, I get an error with...
Read more >copy — Shallow and deep copy operations — Python 3.11.1 ...
Raised for module specific errors. The difference between shallow and deep copying is only relevant for compound objects (objects that contain other objects, ......
Read more >11. Shallow and Deep Copy | Python Tutorial
A solution to the described problem is provided by the module copy . This module provides the method "deepcopy", which allows a complete...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, both are compatible, as
mpiexec
is just an alternative way to launch cross-node processes instead oftorch.distributed.launch
.Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?