TypeError: cannot pickle '_LazyModule' object
See original GitHub issue@stas00 edit: please see https://github.com/huggingface/transformers/issues/12549#issuecomment-875287701 for the short reproduction script.
Environment info
transformers
version: 4.9.0.dev0- Platform: Linux with Nvidia P40
- Python version: 3.8.0
- PyTorch version (GPU?): 1.8.0
- Tensorflow version (GPU?):
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help
@stas00 @patrickvonplaten, @LysandreJik
Information
Model I am using (Bert, XLNet …): GPT2
The problem arises when using:
- the official example scripts: (give details below)
- [√] my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [√] my own task or dataset: (give details below)
To reproduce
I am running the minimal command:
python run_clm.py \
--model_name_or_path /mycheckpoin/ \
--train_file train.txt \
--validation_file eval.txt \
--do_train \
--do_eval \
--output_dir ./models/ \
--no_cuda False \
--fp16 \
--sharded_ddp simple \
--num_train_epochs 3.0 \
--disable_tqdm False \
--save_steps 100 \
--preprocessing_num_workers 32 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4
and I modified the following parts of the script ‘run_clm.py’, and the parameter rank passed in training_args.local_rank
def init_process(rank, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank)
if __name__ == "__main__":
# main()
# size = int(os.environ['WORLD_SIZE'])
size = int(torch.cuda.device_count())
print(size)
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, main))
p.start()
processes.append(p)
for p in processes:
p.join()
the traceback informations are:
Process Process-2:
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/media/cfs/gonglixing/9Nctl/gpt_v2/run_clm_v3.py", line 511, in init_process
fn(rank, size)
File "/media/cfs/gonglixing/9Nctl/gpt_v2/run_clm_v3.py", line 367, in main
tokenized_datasets = raw_datasets.map(
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 471, in map
{
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 472, in <dictcomp>
k: dataset.map(
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1736, in map
transformed_shards = [r.get() for r in results]
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1736, in <listcomp>
transformed_shards = [r.get() for r in results]
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
put(task)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
cls(buf, protocol, *args, **kwds).dump(obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
StockPickler.dump(self, obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 487, in dump
self.save(obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 901, in save_tuple
save(element)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 997, in _batch_setitems
save(v)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 1493, in save_function
pickler.save_reduce(_create_function, (obj.__code__,
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 692, in save_reduce
save(args)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 901, in save_tuple
save(element)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 997, in _batch_setitems
save(v)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 578, in save
rv = reduce(self.proto)
TypeError: cannot pickle '_LazyModule' object
I run the following command based on the original script, it works well. The reason why I don’t use this command is that our cluster doesn’t support this way of passing parameters: "-m torch.distributed.launch --nproc_per_node=4 "
python -m torch.distributed.launch --nproc_per_node=4 run_clm.py \
--model_name_or_path /mycheckpoin/ \
--train_file train.txt \
--validation_file eval.txt \
--do_train \
--do_eval \
--output_dir ./models/ \
--no_cuda False \
--fp16 \
--sharded_ddp simple \
--num_train_epochs 3.0 \
--disable_tqdm False \
--save_steps 100 \
--preprocessing_num_workers 32 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4
Expected behavior
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (10 by maintainers)
Top Results From Across the Web
Python: can't pickle module objects error - Stack Overflow
Python's inability to pickle module objects is the real problem. Is there a good reason? I don't think so. Having module objects unpicklable ......
Read more >Issues With Pickle Module — mod_wsgi 4.9.4 documentation
It occurs because the copy of the original function object is still internally identified by the name which it was assigned at the...
Read more >can't get attribute pickle - You.com | The search engine you control.
Your problem is (in part) because you only have one file in your program. pickle is lazy and does not serialize class definitions...
Read more >What is "typeerror: 'module' object is not callable"
This error statement TypeError: 'module' object is not callable is raised as you are being confused about the Class name and Module name....
Read more >TypeError : cannot pickle 'module' object - Launchpad Bugs
TypeError : cannot pickle 'module' object. Bug #1989315 reported by daniel jeans on 2022-09-12. 6. This bug affects 1 person ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Note that we can easily make
_LazyModule
picklable. I can open a PR if needed to implement a__reduce__
method for_LazyModule
. It’s the only object that preventstransformers
from being picklable.EDIT: here it is: https://github.com/huggingface/transformers/pull/12552
This is just a way to easily fix this issue, but I think we should definitely keep trying to figure out why it tried to pickle
transformers
in the first place. This might come fromdill
that pickles the globals of some environments when pickling any objectOK, here is the minimal reproducible script. Totally unrelated to
transformers
it seems except for the import oftransformers
this still fails with the same error.
But if you either:
import transformers
num_proc=1
indatasets.map
(instead ofn>1
) all is good.@lhoestq, @albertvillanova - does this ring any bells? Clearly
transformers
loads some module lazily and trips updatasets
even though transformers isn’t really used here directly. Thank you.