[examples] add `main_process_first` context manager to datasets map calls
See original GitHub issueWe need to replay this addition that has been modelled in run_translation.py
in https://github.com/huggingface/transformers/pull/12351 to all other pytorch examples
The actual changes for the model example are:
https://github.com/huggingface/transformers/pull/12351/files#diff-09777f56cee1060a535a72ce99a6c96cdb7f330c8cc3f9dcca442b3f7768237a
(just run_translation.py
)
Here is a time-saver:
find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(train_dataset = train_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/ /msg; return qq[${p}with training_args.main_process_first(desc="train dataset map pre-processing"):\n$p$t] } }' {} \;
find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(eval_dataset = eval_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/ /msg; return qq[${p}with training_args.main_process_first(desc="validation dataset map pre-processing"):\n$p$t] } }' {} \;
find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(predict_dataset = predict_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/ /msg; return qq[${p}with training_args.main_process_first(desc="prediction dataset map pre-processing"):\n$p$t] } }' {} \;
git checkout examples/pytorch/translation/run_translation.py
make fixup
I noticed other scripts may have other datasets.map
calls, which get automatically rewritten by the scripts above, so please review the changes to see if the desc
needs to be modified. But we want to use the context manager on all of these calls, it’s possible that the perl rewrite scripts didn’t catch some.
- also this template needs to have this change as well:
templates/adding_a_new_example_script/\{\{cookiecutter.directory_name\}\}/run_\{\{cookiecutter.example_shortcut\}\}.py
can do via perl or manually or whatever other way works for you.
And please validate that scripts still work, by either running:
RUN_SLOW=1 pytest examples/pytorch/test_examples.py
or running each script manually as explained in its corresponding README.md
file.
This issue is open to all and should be very simple to complete, the main effort is to validate.
And thank you for your contribution!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
I have committed changes in the open PR for the fix of this warning!
yes, except you now need to assign the return value since this is no longer an inplace edit. Therefore in both places it’ll be now be:
with the right x of course.
thank you for fixing it.
reference: https://huggingface.co/docs/datasets/processing.html#removing-one-or-several-columns-remove-columns