KeyError during pseudo labeling
See original GitHub issueHi ,
I am facing a key error while pseudo labeling. Looks like pos_pid selected is not found in the corpus.
INFO [gpl.toolkit.pl.run:60] Begin pseudo labeling
.....
File ~gpl/toolkit/dataset.py:78, in HardNegativeDataset._sample_tuple(self, query_dict)
75 query_text = self.queries[query_dict['qid']]
77 pos_pid = random.choice(pos_pids)
---> 78 pos_text = concat_title_and_body(pos_pid, self.corpus, self.sep)
80 neg_pid = random.choice(list(neg_pids))
81 neg_text = concat_title_and_body(neg_pid, self.corpus, self.sep)
File ~gpl/toolkit/dataset.py:12, in concat_title_and_body(did, corpus, sep)
10 def concat_title_and_body(did, corpus, sep):
11 document = []
---> 12 title = corpus[did]['title'].strip()
13 body = corpus[did]['text'].strip()
14 if len(title):
KeyError: '92974'
The corpus, I have has the below structure. Does the order of the _id and numbers matter?
{"text":"This is the domain text","_id":3,"title":"","metadata":{}}
{"text":"This is the domain text 2","_id":4,"title":"","metadata":{}}
Code to train:
gpl.train(
path_to_generated_data=f"generated/{dataset}",
mnrl_output_dir="mnrl_output_dir",
mnrl_evaluation_output="mnrl_evaluation_output",
base_ckpt="distilbert-base-uncased",
# base_ckpt='GPL/msmarco-distilbert-margin-mse',
# The starting checkpoint of the experiments in the paper
gpl_score_function="dot",
# Note that GPL uses MarginMSE loss, which works with dot-product
batch_size_gpl=64,
gpl_steps=140000,
new_size=-1,
# Resize the corpus to `new_size` (|corpus|) if needed. When set to None (by default), the |corpus| will be the full size. When set to -1, the |corpus| will be set automatically: If QPP * |corpus| <= 250K, |corpus| will be the full size; else QPP will be set 3 and |corpus| will be set to 250K / 3
queries_per_passage=-1,
# Number of Queries Per Passage (QPP) in the query generation step. When set to -1 (by default), the QPP will be chosen automatically: If QPP * |corpus| <= 250K, then QPP will be set to 250K / |corpus|; else QPP will be set 3 and |corpus| will be set to 250K / 3
output_dir=f"output/{dataset}",
evaluation_data=f"./{dataset}",
evaluation_output=f"evaluation/{dataset}",
generator="BeIR/query-gen-msmarco-t5-base-v1",
retrievers=["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"],
retriever_score_functions=["cos_sim", "cos_sim"],
# Note that these two retriever model work with cosine-similarity
cross_encoder="cross-encoder/ms-marco-MiniLM-L-6-v2",
qgen_prefix="qgen",
# This prefix will appear as part of the (folder/file) names for query-generation results: For example, we will have "qgen-qrels/" and "qgen-queries.jsonl" by default.
do_evaluation=True,
# --use_amp # One can use this flag for enabling the efficient float16 precision
)
Could you help in what I am missing or doing wrong?
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Issues · UKPLab/gpl - GitHub
Guidance on gpl_stapes, new_size and batch_size_gpl. #21 opened on Sep 15 by MyBruso ... KeyError during pseudo labeling.
Read more >How to Fix: KeyError in Pandas - GeeksforGeeks
Usually, this error occurs when you misspell a column/row name or include an unwanted space before or after the column/row name.
Read more >KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >Key error: 'labels' during data training making a model
Hi I am using virtual env and making a train model for my project and using keras 2.3.1 and tensorflow 2.2.0 All my...
Read more >KeyError: xxxxxxxxxx when calling optimizer.state_dict()
3- cluster inputs based on k means 4- add new classifier module 5- use inputs cluster assignments as pseudo labels to train the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
o well! our mistake is that the corpus.jsonl has the ids as int not strings. The code dataloader expects it to be string so it errors at that Key.
Change the corpus.jsonl to have string _ids.
i’m exactly here 😃 still trying to figure it out some thoughts
I wonder what happens to that corpus in between being read from file and getting to that point?!