Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TF-IDF IR Baseline model performance on bAbI_dialog

See original GitHub issue

Hey! I’m trying to use the TF_IDF model provided by the framework to get comparable results to the bAbI Dialog Tasks paper. I had some strange results that I didn’t understand.

For the eval_model script-

$ python eval_model.py -m ir_baseline -t dialog_babi:Task:1 -dt valid

[download_path:/home/chait/ParlAI/downloads]
[parlai_home:/home/chait/ParlAI]
[datatype:valid]
[task:dialog_babi:Task:1]
[model:ir_baseline]
[model_file:]
[datapath:/home/chait/ParlAI/data]
[batchsize:1]
[display_examples:False]
[model_params:]
[numthreads:1]
[num_examples:1000]
IrBaselineAgent
[Agent initializing.]
[length_penalty:0.5]
[parlai_home:/home/chait/ParlAI]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.18181818181818182, 'total': 1, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.09090909090909091, 'total': 2, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.06060606060606061, 'total': 3, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.045454545454545456, 'total': 4, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 5, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.030303030303030304, 'total': 6, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.05194805194805195, 'total': 7, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.045454545454545456, 'total': 8, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04040404040404041, 'total': 9, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 10, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03305785123966942, 'total': 11, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04545454545454545, 'total': 12, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04195804195804195, 'total': 13, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03896103896103896, 'total': 14, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 15, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03409090909090909, 'total': 16, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04278074866310161, 'total': 17, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04040404040404041, 'total': 18, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03827751196172249, 'total': 19, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 20, 'accuracy': 0.0}
.
.
. (for each of the 1000 dialogs in the dataset)
.
.
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03135435992578862, 'total': 980, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03132239829487548, 'total': 981, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03129050175893365, 'total': 982, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03125867011930097, 'total': 983, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031226903178122812, 'total': 984, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03119520073834807, 'total': 985, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031163562603724996, 'total': 986, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03113198857879721, 'total': 987, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03128450496871562, 'total': 988, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031252872506664336, 'total': 989, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.0312213039485768, 'total': 990, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031189799101000032, 'total': 991, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03115835777126112, 'total': 992, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031126979767463273, 'total': 993, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03127858057435535, 'total': 994, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031247144814984133, 'total': 995, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031215772179627725, 'total': 996, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031184462478344246, 'total': 997, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03133539806886513, 'total': 998, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03130403130403143, 'total': 999, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031272727272727396, 'total': 1000, 'accuracy': 0.0}

And for the display_model script-

$ python display_model.py -m ir_baseline -t dialog_babi:Task:1 -dt valid

[model_params:]
[datatype:valid]
[parlai_home:/home/chait/ParlAI]
[model_file:]
[task:dialog_babi:Task:1]
[model:ir_baseline]
[numthreads:1]
[download_path:/home/chait/ParlAI/downloads]
[datapath:/home/chait/ParlAI/data]
[batchsize:1]
[num_examples:10]
IrBaselineAgent
[Agent initializing.]
[length_penalty:0.5]
[parlai_home:/home/chait/ParlAI]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
[dialog_babi:Task:1]: hello
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: can you book a table for six people with french food
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: in bombay
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: i am looking for a cheap restaurant
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
   [IRBaselineAgent]: I don't know.
- - - - - - - - - - - - - - - - - - - - -
~~
[dialog_babi:Task:1]: hi
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: can you make a restaurant reservation with italian cuisine for six people in a cheap price range
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: rome please
   [IRBaselineAgent]: I don't know.
~~

Is this how the model is expected to perform? (The per dialog accuracy is zero for all of them, is there a way to obtain the per response accuracy?)

Issue Analytics

State:
Created 6 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

alexholdenmillercommented, Dec 13, 2017

@DShijiao the current ir_baseline model has very little state (just the TFIDF dictionary), selecting the most similar candidate sentence based on the word overlap.

interactive.py doesn’t provide any candidates, of course, since it’s a human speaking to the model directly. However, you could modify the ir_baseline model to load up a set of candidates from file when it launches and select from one of those instead.

0reactions

shijiaodcommented, Dec 13, 2017

Ok, I’ll see it. Thanks for your replies. @alexholdenmiller

Top Results From Across the Web

IR Baseline — ParlAI Documentation

IR Baseline ¶. This agent is a simple information retrieval baseline. ... Evaluate the IR baseline model (without using TF-IDF) on the Persona-Chat...

Understanding TF-IDF for Machine Learning | Capital One

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine ...

lemon234071/oc_parlai - GitHub

Unified framework for evaluation of dialogue models ... Evaluate an IR baseline model on the validation set of the Movies Subreddit dataset:.

Fake papers Tf-Idf & logistic regression baseline - Kaggle

Here we are providing a simple "golden" NLP baseline – logistic regression with Tf-Idf text representation. Feel free to improve the model.

ParlAI: A Dialog Research Software Platform - arXiv Vanity

Its goal is to provide a unified framework for training and testing of dialog models, including multitask training, and integration of Amazon Mechanical...