Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[examples] add `main_process_first` context manager to datasets map calls

See original GitHub issue

We need to replay this addition that has been modelled in run_translation.py in https://github.com/huggingface/transformers/pull/12351 to all other pytorch examples

The actual changes for the model example are: https://github.com/huggingface/transformers/pull/12351/files#diff-09777f56cee1060a535a72ce99a6c96cdb7f330c8cc3f9dcca442b3f7768237a (just run_translation.py)

Here is a time-saver:

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(train_dataset = train_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="train dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(eval_dataset = eval_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="validation dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(predict_dataset = predict_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="prediction dataset map pre-processing"):\n$p$t] } }' {} \;

git checkout examples/pytorch/translation/run_translation.py

make fixup

I noticed other scripts may have other datasets.map calls, which get automatically rewritten by the scripts above, so please review the changes to see if the desc needs to be modified. But we want to use the context manager on all of these calls, it’s possible that the perl rewrite scripts didn’t catch some.

also this template needs to have this change as well: templates/adding_a_new_example_script/\{\{cookiecutter.directory_name\}\}/run_\{\{cookiecutter.example_shortcut\}\}.py can do via perl or manually or whatever other way works for you.

And please validate that scripts still work, by either running:

RUN_SLOW=1 pytest  examples/pytorch/test_examples.py

or running each script manually as explained in its corresponding README.md file.

This issue is open to all and should be very simple to complete, the main effort is to validate.

And thank you for your contribution!

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

bhadreshpsavanicommented, Jun 26, 2021

I have committed changes in the open PR for the fix of this warning!

1reaction

stas00commented, Jun 26, 2021

yes, except you now need to assign the return value since this is no longer an inplace edit. Therefore in both places it’ll be now be:

x  = x.remove_columns("label")

with the right x of course.

thank you for fixing it.

reference: https://huggingface.co/docs/datasets/processing.html#removing-one-or-several-columns-remove-columns

Top Results From Across the Web

Process - Hugging Face

The primary purpose of map() is to speed up processing functions. It allows you to apply a processing function to each example in...

Huggingface datasets map() handles all data at a stroke and ...

1. Background. Huggingface datasets package advises using map() to process data in batches. In their example code on pretraining masked ...

Multiprocessing Pool Context Manager - Super Fast Python

Issuing tasks to the pool using methods such as apply() and map(). Then, calling the close() method on the pool and perhaps join()...

Learning Concurrency in Python | PDF - Scribd

Executor objects 160. Creating a ThreadPoolExecutor 160. Example 161. Output 161. Context manager 162. Example 162. Output 163. Maps 163. Example 164

ray.data.dataset — Ray 2.2.0 - the Ray documentation

:meth:`~Dataset.default_batch_format` Call this function to determine the ... Examples: >>> import ray >>> ds = ray.data.range_table(100) >>> # Add a new ...