question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can not reproduce the evaluation results of small model on 6k multi-ref dataset

See original GitHub issue

I first extract contexts from test.refs.txt (6000 lines)

cat test.refs.txt | cut -f 1 > test.source

and extract multi ref files (use up to 15 per sample)

for (( i=2; i<=15; i++ ))
do
    cat test.refs.txt | cut -f $i > refs/ref_$i.txt
done

Then use the following script to predict the responses on 6k multi-ref dataset.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from nltk import word_tokenize
from tqdm import tqdm, trange

model_path = '/path/to/DialoGPT-small'
file_path = '/path/to/test.source'
out_path = '/path/to/gpt_test.txt'

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.padding_side = "left"
SEP = tokenizer.eos_token
tokenizer.add_special_tokens({'pad_token': SEP})

model = AutoModelForCausalLM.from_pretrained(model_path)
model.eval()
batch_size = 64

# read context
lines = []
with open(file_path, encoding='utf-8') as f:
    for line in f:
        new_line = SEP.join(line.strip().split(' EOS ')[-5:]) + SEP
        lines.append(new_line)

preds = []

# predict
for i in trange(0, len(lines), batch_size):
    batchs = lines[i:i+batch_size]
    batch_encoding = tokenizer.batch_encode_plus(
        batchs,
        max_length=256,
        padding=True, truncation=True,
        return_tensors="pt",
    )
    input_ids = batch_encoding['input_ids']
    attention_mask = batch_encoding['attention_mask']
    dyn_seq_len = input_ids.shape[1]
    preds_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=512, num_beams=1, pad_token_id=tokenizer.eos_token_id)
    preds_ids = preds_ids[:, dyn_seq_len:].tolist()
    batch_preds = [tokenizer.decode(ids, skip_special_tokens=True) for ids in preds_ids]
    preds.extend(batch_preds)

# write predictions
with open(out_path, 'w', encoding='utf-8') as f:
    for pred in preds:
        line = ' '.join(word_tokenize(pred)) + '\n'
        f.write(line)

But there is a big gap between the evaluation results and those described in the paper.

My evaluation results

NIST: [3.372, 3.7761, 3.8364, 3.8455]
BLEU: [0.4679, 0.1924, 0.0928, 0.0505]
METEOR: 0.10545417931305287
Entropy: [4.9949875062421425, 7.123308932861081, 8.000309028686685, 8.413536358302238]
Distinct: [0.0619184959030736, 0.22404933196300103]
avg_len: 13.811166666666667

Described in paper

Experiment NIST2 NIST4 BLEU2 BLEU4 METEOR ENT-4 DIST-1 DIST-2 Avg. Len
DialoGPT 117M 2.39 2.41 10.54% 1.55% 7.53% 10.78 8.60% 39.90% 12.8

Here are predictions of the first 20 test samples:

I 'm not fasting , I 'm fasting because I 'm fasting .
I 'm waiting for someone to say something stupid and then I can see it over a r iamverysmart
I 'm not sure if I should be excited or scared .
I 'm going to be a millionaire by the end of this .
I love this post and the art . Do I 40 love it ? Well it does come framed , and it 's so absurd ... idk I just might .
I 'm not sure I trust him .
I have a few of those . I 'll have to check out the other ones .
I 'm watching the Oilers game on TV .
How hard is it to play snooker ?
Deshaun Watson is playing tonight .
What was your time ?
Artie Burns
What 's a screwdriver ?
I 'm not sure if I 'm missing something , but I do n't get it .
I think it 's a title defense .
I 'm not sure if it 's free , but I 've been to a few parks and they 're pretty cool .
I 'm not sure what you 're trying to say .
I 'm not sure what you 're trying to say .
I have the most chromosomes .
John Wick .

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
liuslnlpcommented, Feb 25, 2021

This is the evaluation result of the medium model and the large model. It can be seen that the gap between NIST/BLEU/DIST and the official results is relatively large.

DialoGPT-medium

NIST: [3.6142, 4.1402, 4.2257, 4.2379] BLEU: [0.5054, 0.2272, 0.1161, 0.0658] METEOR: 0.11448456319410923 Entropy: [5.110969324425441, 7.4741025550415054, 8.487332812728265, 8.96638167676112] Distinct: [0.063865246873529, 0.2401520577378657] avg_len: 13.1005

DialoGPT-large

NIST: [3.9302, 4.5571, 4.6678, 4.6848] BLEU: [0.5454, 0.2555, 0.1352, 0.0788] METEOR: 0.11694036328599848 Entropy: [5.376659255260651, 8.038661195818934, 9.129731989024675, 9.630095839832428] Distinct: [0.07617776246662647, 0.29050042408821036] avg_len: 11.611

Official

Experiment NIST2 NIST4 BLEU2 BLEU4 METEOR ENT-4 DIST-1 DIST-2 Avg. Len
Human response 3.41 4.25 17.90% 7.48% 10.64% 11 14.50% 63.00% 13.1
DialoGPT 117M 2.39 2.41 10.54% 1.55% 7.53% 10.78 8.60% 39.90% 12.8
DialoGPT 345M 3 3.06 16.96% 4.56% 9.81% 9.13 6.80% 26.30% 12.2
DialoGPT 762M 2.84 2.9 18.66% 5.25% 9.66% 9.72 7.76% 29.93% 11.2
0reactions
Mayanksoni20commented, Nov 10, 2021

preds_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=512, num_beams=1, pad_token_id=tokenizer.eos_token_id)

From DialoGPT paper,

Beam search (with beam width 10) dramatically improves BLEU and DIST scores, and marginally improves NIST and METEOR.

The paper mentions that the results obtained are with beam width 10 and you ran the evaluation with beam width 1. Maybe trying generating responses with num_beams=10 and observe if there is any difference.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Announcing Evaluation on the Hub - Hugging Face
TL;DR: Today we introduce Evaluation on the Hub, a new tool powered by AutoTrain that lets you evaluate any model on any dataset...
Read more >
The Model Performance Mismatch Problem (and what to do ...
One simple (but not easy) way to diagnose whether you have overfit the training dataset, is to get another data point on model...
Read more >
Can you trust your model's uncertainty? Evaluating predictive ...
We evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets across three different modalities: images, text...
Read more >
Ray Tune FAQ — Ray 2.2.0 - the Ray documentation
Early stopping cannot be used without incremental results - in case of ... If your model is small, you can usually try to...
Read more >
A Large-Scale Comprehensive Dataset and Copy-Overlap ...
two videos contain copied parts or not, which is coarse- grained and incapable of evaluating segment-level copy detection methods. Other datasets, e.g. ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found