Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bad eval results on RTE and CoLA

See original GitHub issue

I tried fine-tuning ALBERT-base model on the two smallest glue tasks, but got only about 66% accuracy for both. I was using GPU (2080Ti) for it. The script for glue fine-tuning has bug in the evaluation part, and I tried to fix it, but I am quite new to tensorflow so I am not sure if there is still something wrong with the script. Below is the script I am using:

set -ex

OUTPUT_DIR="glue_baseline"

# To start from a custom pretrained checkpoint, set ALBERT_HUB_MODULE_HANDLE
# below to an empty string and set INIT_CHECKPOINT to your checkpoint path.
ALBERT_HUB_MODULE_HANDLE="https://tfhub.dev/google/albert_base/1"
INIT_CHECKPOINT=""

ALBERT_ROOT=pretrained/albert_base


function run_task() {
  COMMON_ARGS="--output_dir="${OUTPUT_DIR}/$1" --data_dir="${ALBERT_ROOT}/glue" --vocab_file="${ALBERT_ROOT}/vocab.txt" --spm_model_file="${ALBERT_ROOT}/30k-clean.model" --do_lower_case --max_seq_length=128 --optimizer=adamw --task_name=$1 --warmup_step=$2 --learning_rate=$3 --train_step=$4 --save_checkpoints_steps=$5 --train_batch_size=$6"
  python3 -m run_classifier \
      ${COMMON_ARGS} \
      --do_train \
      --nodo_eval \
      --nodo_predict \
      --albert_hub_module_handle="${ALBERT_HUB_MODULE_HANDLE}" \
      --init_checkpoint="${INIT_CHECKPOINT}"
  python3 -m run_classifier \
      ${COMMON_ARGS} \
      --nodo_train \
      --do_eval \
      --albert_hub_module_handle="${ALBERT_HUB_MODULE_HANDLE}" \
      --do_predict
}

run_task RTE 200 3e-5 800 100 32

I tried printing the training loss and it seems to have converged, but somehow the eval results are nearly random. The eval accuracy for different checkpoints are different, so I think these checkpoints have been loaded.

Issue Analytics

State:
Created 4 years ago
Comments:8

Top GitHub Comments

2reactions

MichaelZhouwangcommented, Mar 26, 2020

Got it, thanks! I just got some reasonable results with ALBERT-base on MRPC.

By the way, we must specify either --albert_config_file or --albert_hub_module_handle in the evaluation part, which is not included in the current version.

In addition, I am also curious about the CoLA dev result you got. I found the result in this task to be very sensitive for different random seeds. Look forward for your reply. Thanks!

0reactions

MichaelZhouwangcommented, Mar 26, 2020

Got it, thanks! I just got some reasonable results with ALBERT-base on MRPC.

By the way, we must specify either --albert_config_file or --albert_hub_module_handle in the evaluation part, which is not included in the current version.

Hi, I am also fine-tuning Albert-base v2 on mrpc. Could you please share what dev acc_and_f1 result you’ve got on MRPC dataset? I’m not sure if I tuned it well. Thanks!

Top Results From Across the Web

Bad eval results on RTE and CoLA issue - PythonTechWorld

Bad eval results on RTE and CoLA. I tried fine-tuning ALBERT-base model on the two smallest glue tasks, but got only about 66%...

VA Disability Rates 2023 With 8.7% COLA Increase! (OFFICIAL)

Disabled vets rated 10% or higher will see their VA disability pay increase by 8.7% effective Dec. 1, 2022. See VA Disability Rates...

glue · Datasets at Hugging Face

premise (string) label (class label) idx (int32) "The cat sat on the mat." ‑1 0 "The cat did not sit on the mat." ‑1 1 "The...

Fine-Tuning Pretrained Language Models:Weight ... - arXiv

High rank correlation means that the ranking of the models is similar between the two evaluation points, and suggests we can stop the...

bert-experiments/eval_performance_analysis.ipynb at master · sai ...

... "baseline" / "results.json", # "bad-mimic-pruned, retrained": evaluation_dir / "global_bad_mask_mimic_size_retrained" / "baseline" / "results.json", ...