TF-IDF IR Baseline model performance on bAbI_dialog
See original GitHub issueHey! I’m trying to use the TF_IDF model provided by the framework to get comparable results to the bAbI Dialog Tasks paper. I had some strange results that I didn’t understand.
For the eval_model script-
$ python eval_model.py -m ir_baseline -t dialog_babi:Task:1 -dt valid
[download_path:/home/chait/ParlAI/downloads]
[parlai_home:/home/chait/ParlAI]
[datatype:valid]
[task:dialog_babi:Task:1]
[model:ir_baseline]
[model_file:]
[datapath:/home/chait/ParlAI/data]
[batchsize:1]
[display_examples:False]
[model_params:]
[numthreads:1]
[num_examples:1000]
IrBaselineAgent
[Agent initializing.]
[length_penalty:0.5]
[parlai_home:/home/chait/ParlAI]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.18181818181818182, 'total': 1, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.09090909090909091, 'total': 2, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.06060606060606061, 'total': 3, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.045454545454545456, 'total': 4, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 5, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.030303030303030304, 'total': 6, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.05194805194805195, 'total': 7, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.045454545454545456, 'total': 8, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04040404040404041, 'total': 9, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 10, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03305785123966942, 'total': 11, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04545454545454545, 'total': 12, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04195804195804195, 'total': 13, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03896103896103896, 'total': 14, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 15, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03409090909090909, 'total': 16, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04278074866310161, 'total': 17, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.04040404040404041, 'total': 18, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03827751196172249, 'total': 19, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03636363636363636, 'total': 20, 'accuracy': 0.0}
.
.
. (for each of the 1000 dialogs in the dataset)
.
.
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03135435992578862, 'total': 980, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03132239829487548, 'total': 981, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03129050175893365, 'total': 982, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03125867011930097, 'total': 983, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031226903178122812, 'total': 984, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03119520073834807, 'total': 985, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031163562603724996, 'total': 986, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03113198857879721, 'total': 987, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03128450496871562, 'total': 988, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031252872506664336, 'total': 989, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.0312213039485768, 'total': 990, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031189799101000032, 'total': 991, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03115835777126112, 'total': 992, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031126979767463273, 'total': 993, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03127858057435535, 'total': 994, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031247144814984133, 'total': 995, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031215772179627725, 'total': 996, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031184462478344246, 'total': 997, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03133539806886513, 'total': 998, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.03130403130403143, 'total': 999, 'accuracy': 0.0}
---
{'hits@k': {1: 0.0, 10: 0.0, 50: 0.0, 100: 0.0, 5: 0.0}, 'f1': 0.031272727272727396, 'total': 1000, 'accuracy': 0.0}
And for the display_model script-
$ python display_model.py -m ir_baseline -t dialog_babi:Task:1 -dt valid
[model_params:]
[datatype:valid]
[parlai_home:/home/chait/ParlAI]
[model_file:]
[task:dialog_babi:Task:1]
[model:ir_baseline]
[numthreads:1]
[download_path:/home/chait/ParlAI/downloads]
[datapath:/home/chait/ParlAI/data]
[batchsize:1]
[num_examples:10]
IrBaselineAgent
[Agent initializing.]
[length_penalty:0.5]
[parlai_home:/home/chait/ParlAI]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
[dialog_babi:Task:1]: hello
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: can you book a table for six people with french food
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: in bombay
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: i am looking for a cheap restaurant
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
[IRBaselineAgent]: I don't know.
- - - - - - - - - - - - - - - - - - - - -
~~
[dialog_babi:Task:1]: hi
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: can you make a restaurant reservation with italian cuisine for six people in a cheap price range
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: <SILENCE>
[IRBaselineAgent]: I don't know.
~~
[dialog_babi:Task:1]: rome please
[IRBaselineAgent]: I don't know.
~~
Is this how the model is expected to perform? (The per dialog accuracy is zero for all of them, is there a way to obtain the per response accuracy?)
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
IR Baseline — ParlAI Documentation
IR Baseline ¶. This agent is a simple information retrieval baseline. ... Evaluate the IR baseline model (without using TF-IDF) on the Persona-Chat...
Read more >Understanding TF-IDF for Machine Learning | Capital One
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine ...
Read more >lemon234071/oc_parlai - GitHub
Unified framework for evaluation of dialogue models ... Evaluate an IR baseline model on the validation set of the Movies Subreddit dataset:.
Read more >Fake papers Tf-Idf & logistic regression baseline - Kaggle
Here we are providing a simple "golden" NLP baseline – logistic regression with Tf-Idf text representation. Feel free to improve the model.
Read more >ParlAI: A Dialog Research Software Platform - arXiv Vanity
Its goal is to provide a unified framework for training and testing of dialog models, including multitask training, and integration of Amazon Mechanical...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@DShijiao the current ir_baseline model has very little state (just the TFIDF dictionary), selecting the most similar candidate sentence based on the word overlap.
interactive.py
doesn’t provide any candidates, of course, since it’s a human speaking to the model directly. However, you could modify the ir_baseline model to load up a set of candidates from file when it launches and select from one of those instead.Ok, I’ll see it. Thanks for your replies. @alexholdenmiller