Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when doing deepcopy of the model

See original GitHub issue

Hi, thanks for this awesome project!

I build my transformer model based on the MoeMlp layer. I use ema for better performance. However, when I trying to init my ema model with ema_model = copy.deepcopy(my_transformer_model), I encounter the error:

File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
    value = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroupNCCL' object

Could you help me with that? How can I use ema with tutel? Thanks!

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

ghostplantcommented, Aug 9, 2022

Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?

Yes, both are compatible, as mpiexec is just an alternative way to launch cross-node processes instead of torch.distributed.launch.

0reactions

yzxing87commented, Aug 9, 2022

Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?

Top Results From Across the Web

Python: copy.deepcopy produces an error - Stack Overflow

This occurs in many instances when one accidentally tries to clone the iterator to a class. For instance, in PIL ...

[solved]Error with copy.deepcopy() - PyTorch Forums

I want to return model after train, so I copy it first: best_model = copy.deepcopy(model) but I got error: RuntimeError: Only Variables ......

Deepcopy error when trying to airflow clear - Google Groups

When I try to clear tasks either through the UI or through the command line with `airflow clear`, I get an error with...

copy — Shallow and deep copy operations — Python 3.11.1 ...

Raised for module specific errors. The difference between shallow and deep copying is only relevant for compound objects (objects that contain other objects, ......

11. Shallow and Deep Copy | Python Tutorial

A solution to the described problem is provided by the module copy . This module provides the method "deepcopy", which allows a complete...