question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Doing data preprocessing in a separated run

See original GitHub issue

System Info

I am trying to run the file run_speech_recognition_ctc.py on a custom dataset. I use the argument preprocessing_only to make the data preprocessing as a separated step. My question is how to start model (training as a second step) since there is no previous checkpoint.

Thanks in advance.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

none

Expected behavior

none

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sanchit-gandhicommented, Dec 12, 2022

Hey @fahad7033! Cool to see that you’re using the CTC example script for training 🤗 The argument --preprocessing_only will run the fine-tuning script up to the end of the dataset pre-processing: https://github.com/huggingface/transformers/blob/0ba94aceb6e1ab448e0acc896764a4496759cb14/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L656

Once this run has completed, disable the flag --preprocessing_only (remove it from your args or set --preprocessing_only="False") and re-run the training script. This time, the training script will use the cached dataset (i.e. it will re-use the pre-processed dataset files that you prepared in your pre-processing run) and then commence training.

It’s worth noting that using the --preprocessing_only flag is only recommended in distributed training when there is risk of a timeout. If this happens, we switch to a non-distributed set-up and set the --preprocessing_only flag. We can then go back to the distributed training set-up and have our dataset ready in cache for training.

If you are not running distributed training or aren’t at risk of a timeout (i.e. a very large dataset), it’ll be faster and easier for you just to run the script once without the --preprocessing_only argument.

Let me know if you have any other questions, happy to help!

0reactions
sanchit-gandhicommented, Dec 19, 2022

Hey @fahad7033! I’ve tried to reproduce this behaviour with a minimum working example.

System info:

  • transformers version: 4.26.0.dev0
  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 2.0.0.dev20221210+cu117 (True)

Script uses a tiny subset of the LibriSpeech ASR dataset (~9MB) and fine-tunes on a tiny Wav2Vec2 CTC model:

python run_speech_recognition_ctc.py \
        --dataset_name="hf-internal-testing/librispeech_asr_dummy" \
        --model_name_or_path="hf-internal-testing/tiny-random-wav2vec2" \
        --dataset_config_name="clean" \
        --train_split_name="validation" \
        --eval_split_name="validation" \
        --output_dir="./" \
        --max_steps="10" \
        --per_device_train_batch_size="16" \
        --per_device_eval_batch_size="16" \
        --learning_rate="3e-4" \
        --warmup_steps="5" \
        --evaluation_strategy="steps" \
        --length_column_name="input_length" \
        --save_strategy="no" \
        --eval_steps="5" \
        --preprocessing_only="True" \
        --preprocessing_num_workers="4" \
        --freeze_feature_encoder \
        --fp16 \
        --overwrite_output_dir\
        --group_by_length \
        --do_train \
        --do_eval \
Output:
12/19/2022 15:29:32 - INFO - __main__ - Data preprocessing finished. Files cached at 
{'train': [{'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-dc486168c3937e95.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-53095567e8277865.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-e089d2a96576c6bb.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-6d3d1c061f60c29b.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00000_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00001_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00002_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00003_of_00004.arrow'}], 
'eval': [{'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-dc486168c3937e95.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-53095567e8277865.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-e089d2a96576c6bb.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-6d3d1c061f60c29b.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00000_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00001_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00002_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00003_of_00004.arrow'}]}

We can see here that the dataset has been correctly prepared and cached, so the script is working for me with this toy example. Do you have a reproducible script that I could use to re-create your run? It’s impossible for me to say what the issue is without being able to reproduce the error on my side!

Also re-iterating a point raised in my previous message: unless you’re fine-tuning using a large dataset on multiple GPUs, there is no need to use the flag --preprocessing_only. For a large dataset on a single GPU, it’s better not to use this flag and just run training directly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Preprocessing In Depth | Towards Data Science
Data preprocessing is an important task. It is a data mining technique that transforms raw data into a more understandable, useful and efficient...
Read more >
Data Preprocessing and Data Wrangling in Machine Learning
Complete Data preparation process - Data Preprocessing and Wrangling for achieving better accuracy and performance of ML Models with Tools ...
Read more >
A Comprehensive Guide to Data Preprocessing - neptune.ai
Data preprocessing is the method of analyzing, filtering, transforming and encoding data so that a machine learning algorithm can understand and work with...
Read more >
Easy Guide To Data Preprocessing In Python - KDnuggets
The easiest way to do it is by using scikit-learn, which has a built-in function train_test_split. Let's code it. from sklearn.model_selection ...
Read more >
Data Preprocessing in Data Mining -A Hands On Guide
Data cleaning: · Handling missing values: · Noisy: · Data integration: · Data reduction: · Data Transformation:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found