Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Doing data preprocessing in a separated run

See original GitHub issue

System Info

I am trying to run the file run_speech_recognition_ctc.py on a custom dataset. I use the argument preprocessing_only to make the data preprocessing as a separated step. My question is how to start model (training as a second step) since there is no previous checkpoint.

Thanks in advance.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

none

Expected behavior

none

Issue Analytics

State:
Created 9 months ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

sanchit-gandhicommented, Dec 12, 2022

Hey @fahad7033! Cool to see that you’re using the CTC example script for training 🤗 The argument --preprocessing_only will run the fine-tuning script up to the end of the dataset pre-processing: https://github.com/huggingface/transformers/blob/0ba94aceb6e1ab448e0acc896764a4496759cb14/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L656

Once this run has completed, disable the flag --preprocessing_only (remove it from your args or set --preprocessing_only="False") and re-run the training script. This time, the training script will use the cached dataset (i.e. it will re-use the pre-processed dataset files that you prepared in your pre-processing run) and then commence training.

It’s worth noting that using the --preprocessing_only flag is only recommended in distributed training when there is risk of a timeout. If this happens, we switch to a non-distributed set-up and set the --preprocessing_only flag. We can then go back to the distributed training set-up and have our dataset ready in cache for training.

If you are not running distributed training or aren’t at risk of a timeout (i.e. a very large dataset), it’ll be faster and easier for you just to run the script once without the --preprocessing_only argument.

Let me know if you have any other questions, happy to help!

0reactions

sanchit-gandhicommented, Dec 19, 2022

Hey @fahad7033! I’ve tried to reproduce this behaviour with a minimum working example.

System info:

transformers version: 4.26.0.dev0
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 2.0.0.dev20221210+cu117 (True)

Script uses a tiny subset of the LibriSpeech ASR dataset (~9MB) and fine-tunes on a tiny Wav2Vec2 CTC model:

python run_speech_recognition_ctc.py \
        --dataset_name="hf-internal-testing/librispeech_asr_dummy" \
        --model_name_or_path="hf-internal-testing/tiny-random-wav2vec2" \
        --dataset_config_name="clean" \
        --train_split_name="validation" \
        --eval_split_name="validation" \
        --output_dir="./" \
        --max_steps="10" \
        --per_device_train_batch_size="16" \
        --per_device_eval_batch_size="16" \
        --learning_rate="3e-4" \
        --warmup_steps="5" \
        --evaluation_strategy="steps" \
        --length_column_name="input_length" \
        --save_strategy="no" \
        --eval_steps="5" \
        --preprocessing_only="True" \
        --preprocessing_num_workers="4" \
        --freeze_feature_encoder \
        --fp16 \
        --overwrite_output_dir\
        --group_by_length \
        --do_train \
        --do_eval \

Output:

12/19/2022 15:29:32 - INFO - __main__ - Data preprocessing finished. Files cached at 
{'train': [{'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-dc486168c3937e95.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-53095567e8277865.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-e089d2a96576c6bb.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-6d3d1c061f60c29b.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00000_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00001_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00002_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00003_of_00004.arrow'}], 
'eval': [{'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-dc486168c3937e95.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-53095567e8277865.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-e089d2a96576c6bb.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-6d3d1c061f60c29b.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00000_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00001_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00002_of_00004.arrow'}, {'filename': '/home/ubuntu/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b/cache-41f1795b92412228_00003_of_00004.arrow'}]}

We can see here that the dataset has been correctly prepared and cached, so the script is working for me with this toy example. Do you have a reproducible script that I could use to re-create your run? It’s impossible for me to say what the issue is without being able to reproduce the error on my side!

Also re-iterating a point raised in my previous message: unless you’re fine-tuning using a large dataset on multiple GPUs, there is no need to use the flag --preprocessing_only. For a large dataset on a single GPU, it’s better not to use this flag and just run training directly.

Top Results From Across the Web

Data Preprocessing In Depth | Towards Data Science

Data preprocessing is an important task. It is a data mining technique that transforms raw data into a more understandable, useful and efficient...

Data Preprocessing and Data Wrangling in Machine Learning

Complete Data preparation process - Data Preprocessing and Wrangling for achieving better accuracy and performance of ML Models with Tools ...

A Comprehensive Guide to Data Preprocessing - neptune.ai

Data preprocessing is the method of analyzing, filtering, transforming and encoding data so that a machine learning algorithm can understand and work with...

Easy Guide To Data Preprocessing In Python - KDnuggets

The easiest way to do it is by using scikit-learn, which has a built-in function train_test_split. Let's code it. from sklearn.model_selection ...

Data Preprocessing in Data Mining -A Hands On Guide

Data cleaning: · Handling missing values: · Noisy: · Data integration: · Data reduction: · Data Transformation:.