question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TF-IDF IR Baseline model performance on bAbI_dialog

See original GitHub issue

Hey! I’m trying to use the TF_IDF model provided by the framework to get comparable results to the bAbI Dialog Tasks paper. I had some strange results that I didn’t understand.

For the eval_model script-

$ python eval_model.py -m ir_baseline -t dialog_babi:Task:1 -dt valid

[download_path:/home/chait/ParlAI/downloads]
[parlai_home:/home/chait/ParlAI]
[datatype:valid]
[task:dialog_babi:Task:1]
[model:ir_baseline]
[model_file:]
[datapath:/home/chait/ParlAI/data]
[batchsize:1]
[display_examples:False]
[model_params:]
[numthreads:1]
[num_examples:1000]
IrBaselineAgent
[Agent initializing.]
[length_penalty:0.5]
[parlai_home:/home/chait/ParlAI]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.18181818181818182, 'total': 1, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.09090909090909091, 'total': 2, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.06060606060606061, 'total': 3, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.045454545454545456, 'total': 4, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 5, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.030303030303030304, 'total': 6, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.05194805194805195, 'total': 7, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.045454545454545456, 'total': 8, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04040404040404041, 'total': 9, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 10, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03305785123966942, 'total': 11, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04545454545454545, 'total': 12, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04195804195804195, 'total': 13, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03896103896103896, 'total': 14, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 15, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03409090909090909, 'total': 16, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04278074866310161, 'total': 17, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04040404040404041, 'total': 18, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03827751196172249, 'total': 19, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 20, 'accuracy': 0.0}
.
.
. (for each of the 1000 dialogs in the dataset)
.
.
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03135435992578862, 'total': 980, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03132239829487548, 'total': 981, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03129050175893365, 'total': 982, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03125867011930097, 'total': 983, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031226903178122812, 'total': 984, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03119520073834807, 'total': 985, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031163562603724996, 'total': 986, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03113198857879721, 'total': 987, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03128450496871562, 'total': 988, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031252872506664336, 'total': 989, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.0312213039485768, 'total': 990, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031189799101000032, 'total': 991, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03115835777126112, 'total': 992, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031126979767463273, 'total': 993, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03127858057435535, 'total': 994, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031247144814984133, 'total': 995, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031215772179627725, 'total': 996, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031184462478344246, 'total': 997, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03133539806886513, 'total': 998, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03130403130403143, 'total': 999, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031272727272727396, 'total': 1000, 'accuracy': 0.0}

And for the display_model script-

$ python display_model.py -m ir_baseline -t dialog_babi:Task:1 -dt valid

[model_params:]
[datatype:valid]
[parlai_home:/home/chait/ParlAI]
[model_file:]
[task:dialog_babi:Task:1]
[model:ir_baseline]
[numthreads:1]
[download_path:/home/chait/ParlAI/downloads]
[datapath:/home/chait/ParlAI/data]
[batchsize:1]
[num_examples:10]
IrBaselineAgent
[Agent initializing.]
[length_penalty:0.5]
[parlai_home:/home/chait/ParlAI]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
[dialog_babi:Task:1]: hello
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: can you book a table for six people with french food
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: in bombay
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: i am looking for a cheap restaurant
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
   [IRBaselineAgent]: I don't know.
- - - - - - - - - - - - - - - - - - - - -
~~
[dialog_babi:Task:1]: hi
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: can you make a restaurant reservation with italian cuisine for six people in a cheap price range
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
   [IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: rome please
   [IRBaselineAgent]: I don't know.
~~

Is this how the model is expected to perform? (The per dialog accuracy is zero for all of them, is there a way to obtain the per response accuracy?)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
alexholdenmillercommented, Dec 13, 2017

@DShijiao the current ir_baseline model has very little state (just the TFIDF dictionary), selecting the most similar candidate sentence based on the word overlap.

interactive.py doesn’t provide any candidates, of course, since it’s a human speaking to the model directly. However, you could modify the ir_baseline model to load up a set of candidates from file when it launches and select from one of those instead.

0reactions
shijiaodcommented, Dec 13, 2017

Ok, I’ll see it. Thanks for your replies. @alexholdenmiller

Read more comments on GitHub >

github_iconTop Results From Across the Web

IR Baseline — ParlAI Documentation
IR Baseline ¶. This agent is a simple information retrieval baseline. ... Evaluate the IR baseline model (without using TF-IDF) on the Persona-Chat...
Read more >
Understanding TF-IDF for Machine Learning | Capital One
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine ...
Read more >
lemon234071/oc_parlai - GitHub
Unified framework for evaluation of dialogue models ... Evaluate an IR baseline model on the validation set of the Movies Subreddit dataset:.
Read more >
Fake papers Tf-Idf & logistic regression baseline - Kaggle
Here we are providing a simple "golden" NLP baseline – logistic regression with Tf-Idf text representation. Feel free to improve the model.
Read more >
ParlAI: A Dialog Research Software Platform - arXiv Vanity
Its goal is to provide a unified framework for training and testing of dialog models, including multitask training, and integration of Amazon Mechanical...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found