Bad eval results on RTE and CoLA
See original GitHub issueI tried fine-tuning ALBERT-base model on the two smallest glue tasks, but got only about 66% accuracy for both. I was using GPU (2080Ti) for it. The script for glue fine-tuning has bug in the evaluation part, and I tried to fix it, but I am quite new to tensorflow so I am not sure if there is still something wrong with the script. Below is the script I am using:
set -ex
OUTPUT_DIR="glue_baseline"
# To start from a custom pretrained checkpoint, set ALBERT_HUB_MODULE_HANDLE
# below to an empty string and set INIT_CHECKPOINT to your checkpoint path.
ALBERT_HUB_MODULE_HANDLE="https://tfhub.dev/google/albert_base/1"
INIT_CHECKPOINT=""
ALBERT_ROOT=pretrained/albert_base
function run_task() {
COMMON_ARGS="--output_dir="${OUTPUT_DIR}/$1" --data_dir="${ALBERT_ROOT}/glue" --vocab_file="${ALBERT_ROOT}/vocab.txt" --spm_model_file="${ALBERT_ROOT}/30k-clean.model" --do_lower_case --max_seq_length=128 --optimizer=adamw --task_name=$1 --warmup_step=$2 --learning_rate=$3 --train_step=$4 --save_checkpoints_steps=$5 --train_batch_size=$6"
python3 -m run_classifier \
${COMMON_ARGS} \
--do_train \
--nodo_eval \
--nodo_predict \
--albert_hub_module_handle="${ALBERT_HUB_MODULE_HANDLE}" \
--init_checkpoint="${INIT_CHECKPOINT}"
python3 -m run_classifier \
${COMMON_ARGS} \
--nodo_train \
--do_eval \
--albert_hub_module_handle="${ALBERT_HUB_MODULE_HANDLE}" \
--do_predict
}
run_task RTE 200 3e-5 800 100 32
I tried printing the training loss and it seems to have converged, but somehow the eval results are nearly random. The eval accuracy for different checkpoints are different, so I think these checkpoints have been loaded.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8
Top Results From Across the Web
Bad eval results on RTE and CoLA issue - PythonTechWorld
Bad eval results on RTE and CoLA. I tried fine-tuning ALBERT-base model on the two smallest glue tasks, but got only about 66%...
Read more >VA Disability Rates 2023 With 8.7% COLA Increase! (OFFICIAL)
Disabled vets rated 10% or higher will see their VA disability pay increase by 8.7% effective Dec. 1, 2022. See VA Disability Rates...
Read more >glue · Datasets at Hugging Face
premise (string) label (class label) idx (int32)
"The cat sat on the mat." ‑1 0
"The cat did not sit on the mat." ‑1 1
"The...
Read more >Fine-Tuning Pretrained Language Models:Weight ... - arXiv
High rank correlation means that the ranking of the models is similar between the two evaluation points, and suggests we can stop the...
Read more >bert-experiments/eval_performance_analysis.ipynb at master · sai ...
... "baseline" / "results.json", # "bad-mimic-pruned, retrained": evaluation_dir / "global_bad_mask_mimic_size_retrained" / "baseline" / "results.json", ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
In addition, I am also curious about the CoLA dev result you got. I found the result in this task to be very sensitive for different random seeds. Look forward for your reply. Thanks!
Hi, I am also fine-tuning Albert-base v2 on mrpc. Could you please share what dev acc_and_f1 result you’ve got on MRPC dataset? I’m not sure if I tuned it well. Thanks!