Doing data preprocessing in a separated run
See original GitHub issueSystem Info
I am trying to run the file run_speech_recognition_ctc.py on a custom dataset. I use the argument preprocessing_only to make the data preprocessing as a separated step. My question is how to start model (training as a second step) since there is no previous checkpoint.
Thanks in advance.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
none
Expected behavior
none
Issue Analytics
- State:
- Created 9 months ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Data Preprocessing In Depth | Towards Data Science
Data preprocessing is an important task. It is a data mining technique that transforms raw data into a more understandable, useful and efficient...
Read more >Data Preprocessing and Data Wrangling in Machine Learning
Complete Data preparation process - Data Preprocessing and Wrangling for achieving better accuracy and performance of ML Models with Tools ...
Read more >A Comprehensive Guide to Data Preprocessing - neptune.ai
Data preprocessing is the method of analyzing, filtering, transforming and encoding data so that a machine learning algorithm can understand and work with...
Read more >Easy Guide To Data Preprocessing In Python - KDnuggets
The easiest way to do it is by using scikit-learn, which has a built-in function train_test_split. Let's code it. from sklearn.model_selection ...
Read more >Data Preprocessing in Data Mining -A Hands On Guide
Data cleaning: · Handling missing values: · Noisy: · Data integration: · Data reduction: · Data Transformation:.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @fahad7033! Cool to see that you’re using the CTC example script for training 🤗 The argument
--preprocessing_only
will run the fine-tuning script up to the end of the dataset pre-processing: https://github.com/huggingface/transformers/blob/0ba94aceb6e1ab448e0acc896764a4496759cb14/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L656Once this run has completed, disable the flag
--preprocessing_only
(remove it from your args or set--preprocessing_only="False"
) and re-run the training script. This time, the training script will use the cached dataset (i.e. it will re-use the pre-processed dataset files that you prepared in your pre-processing run) and then commence training.It’s worth noting that using the
--preprocessing_only
flag is only recommended in distributed training when there is risk of a timeout. If this happens, we switch to a non-distributed set-up and set the--preprocessing_only
flag. We can then go back to the distributed training set-up and have our dataset ready in cache for training.If you are not running distributed training or aren’t at risk of a timeout (i.e. a very large dataset), it’ll be faster and easier for you just to run the script once without the
--preprocessing_only
argument.Let me know if you have any other questions, happy to help!
Hey @fahad7033! I’ve tried to reproduce this behaviour with a minimum working example.
System info:
transformers
version: 4.26.0.dev0Script uses a tiny subset of the LibriSpeech ASR dataset (~9MB) and fine-tunes on a tiny Wav2Vec2 CTC model:
Output:
We can see here that the dataset has been correctly prepared and cached, so the script is working for me with this toy example. Do you have a reproducible script that I could use to re-create your run? It’s impossible for me to say what the issue is without being able to reproduce the error on my side!
Also re-iterating a point raised in my previous message: unless you’re fine-tuning using a large dataset on multiple GPUs, there is no need to use the flag
--preprocessing_only
. For a large dataset on a single GPU, it’s better not to use this flag and just run training directly.