question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

create_pretraining_data.py kept killed..

See original GitHub issue

Hello. I’m working on pretraining BERT project using GCP (Google Cloud Platform).

Before I get started to use TPU for executing run_pretraining.py, I got stuck in creating pretraining data.

Here is the .sh scripts for create_pretraining_data.py

python3 create_pretraining_data.py \ --input_file $DATA_DIR/data_1.txt \ --output_file $OUTPUT_DIR \ --do_lower_case=True \ --do_whole_word_mask=True \ --max_seq_length 512 \ --max_predictions_per_seq 70 \ --masked_lm_prob 0.15 \ --vocab_file $VOCAB_DIR \ --codes_file $CODES_DIR \ --dupe_factor 1

The input text’s size is about 40GB but it seems too large so I splited the data into 18 files and each file’s size is about 1.2GB.

At first I tried to set dupe_factor as 10 but it also seems to raise memory issues so I just set dupe_factor as 1 and try to repeat 10 times assigning different random_seed.

Although I tried to execute create_pretraining_data.py with minimum environment, it kept killed and I’ve done only 1 file among 18 files.

It happened on both GCP and my local.

Have any idea to solve this “catastrophic” situation? This project has been delayed because of this issue and I don’t know what to do anymore…

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:7

github_iconTop GitHub Comments

15reactions
anshoomehracommented, Aug 27, 2019

If this helps anyone:

I started with data file size over 3G with over 7 million sentences, VM was running out of RAM after a couple hours of run ( I had about 102G of RAM on VM), eventually leading system to the resource starvation state with weird errors.

If you do not have infinite RAM, as remediation, you can instead shard the data files like below:

split -d -l 250000 data_file.txt data_file_shard

I chose 250k lines per file, and it worked. You can try different size based on your system configuration.

Post this, I am able to generate n number of tf_trecord files. The run_pretraining.py step can take input as globs like tf_examples.tf_record* and hence this small addition step solved the issue completing over 3G of data processing in about 2-3 hours. I can share scripts if anyone still has issues on how to split and automated way to loop over n number of files creating tfrecords …

Good luck!!

2reactions
calusbrcommented, Oct 19, 2019

If this helps anyone:

I started with data file size over 3G with over 7 million sentences, VM was running out of RAM after a couple hours of run ( I had about 102G of RAM on VM), eventually leading system to the resource starvation state with weird errors.

If you do not have infinite RAM, as remediation, you can instead shard the data files like below:

split -d -l 250000 data_file.txt data_file_shard

I chose 250k lines per file, and it worked. You can try different size based on your system configuration.

Post this, I am able to generate n number of tf_trecord files. The run_pretraining.py step can take input as globs like tf_examples.tf_record* and hence this small addition step solved the issue completing over 3G of data processing in about 2-3 hours. I can share scripts if anyone still has issues on how to split and automated way to loop over n number of files creating tfrecords …

Good luck!!

Script? Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python owner was killed by his 8ft-long pet, coroner rules
Dan Brandon kept 10 snakes and 12 tarantulas in his bedroom and was said to be responsible owner.
Read more >
What does 'killed' mean when processing a huge CSV with ...
In Python 2, items returns a list of the keys and values in the dictionary, which might require a lot of memory if...
Read more >
Florida Couple Sentenced to 12 Years in Python Strangling of ...
The little girl, Shaianna, was killed two years ago when the couple's pet Burmese python escaped from its enclosure and strangled the girl ......
Read more >
12-year-old girl dead, boy held in Muldoon shooting
A 12-year-old girl was shot and killed in Muldoon Tuesday night, and a juvenile boy is in custody, according to Anchorage police.
Read more >
African Rock Pythons: Explaining Snake That Killed Boys
An African rock python that strangled two children to death is one of the ... such as the more commonly kept Burmese python,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found