question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[examples] add `main_process_first` context manager to datasets map calls

See original GitHub issue

We need to replay this addition that has been modelled in run_translation.py in https://github.com/huggingface/transformers/pull/12351 to all other pytorch examples

The actual changes for the model example are: https://github.com/huggingface/transformers/pull/12351/files#diff-09777f56cee1060a535a72ce99a6c96cdb7f330c8cc3f9dcca442b3f7768237a (just run_translation.py)

Here is a time-saver:

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(train_dataset = train_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="train dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(eval_dataset = eval_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="validation dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(predict_dataset = predict_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="prediction dataset map pre-processing"):\n$p$t] } }' {} \;

git checkout examples/pytorch/translation/run_translation.py

make fixup

I noticed other scripts may have other datasets.map calls, which get automatically rewritten by the scripts above, so please review the changes to see if the desc needs to be modified. But we want to use the context manager on all of these calls, it’s possible that the perl rewrite scripts didn’t catch some.

  • also this template needs to have this change as well: templates/adding_a_new_example_script/\{\{cookiecutter.directory_name\}\}/run_\{\{cookiecutter.example_shortcut\}\}.py can do via perl or manually or whatever other way works for you.

And please validate that scripts still work, by either running:

RUN_SLOW=1 pytest  examples/pytorch/test_examples.py

or running each script manually as explained in its corresponding README.md file.

This issue is open to all and should be very simple to complete, the main effort is to validate.

And thank you for your contribution!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
bhadreshpsavanicommented, Jun 26, 2021

I have committed changes in the open PR for the fix of this warning!

1reaction
stas00commented, Jun 26, 2021

yes, except you now need to assign the return value since this is no longer an inplace edit. Therefore in both places it’ll be now be:

x  = x.remove_columns("label")

with the right x of course.

thank you for fixing it.

reference: https://huggingface.co/docs/datasets/processing.html#removing-one-or-several-columns-remove-columns

Read more comments on GitHub >

github_iconTop Results From Across the Web

Process - Hugging Face
The primary purpose of map() is to speed up processing functions. It allows you to apply a processing function to each example in...
Read more >
Huggingface datasets map() handles all data at a stroke and ...
1. Background. Huggingface datasets package advises using map() to process data in batches. In their example code on pretraining masked ...
Read more >
Multiprocessing Pool Context Manager - Super Fast Python
Issuing tasks to the pool using methods such as apply() and map(). Then, calling the close() method on the pool and perhaps join()...
Read more >
Learning Concurrency in Python | PDF - Scribd
Executor objects 160. Creating a ThreadPoolExecutor 160. Example 161. Output 161. Context manager 162. Example 162. Output 163. Maps 163. Example 164
Read more >
ray.data.dataset — Ray 2.2.0 - the Ray documentation
:meth:`~Dataset.default_batch_format` Call this function to determine the ... Examples: >>> import ray >>> ds = ray.data.range_table(100) >>> # Add a new ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found