Wrong indent for "results" variable prevents data set processing of squad like json file
See original GitHub issueDescribe the bug Just a wrong indent of the variable “results”. In the current version, there is one indent and so the results just get executed if the else block is true before. That’s why in my case the results variable is not defined and I get the following error:
Error message
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts)
188 train_file = self.processor.data_dir / self.processor.train_filename
189 logger.info("Loading train set from: {} ".format(train_file))
--> 190 self.data["train"], self.tensor_names = self._get_dataset(train_file)
191
192 # dev data
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/data_silo.py in _get_dataset(self, filename, dicts)
159
160 with tqdm(total=len(dicts), unit=' Dicts', desc="Preprocessing Dataset") as pbar:
--> 161 for dataset, tensor_names in results:
162 datasets.append(dataset)
163 pbar.update(multiprocessing_chunk_size)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
746 if success:
747 return value
--> 748 raise value
749
750 __next__ = next # XXX
IndexError: Cannot choose from an empty sequence
Expected behavior sequence should not be empty
To Reproduce Just try
# and fine-tune it on your own custom dataset (should be in SQuAD like format)
train_data = "training_data"
reader.train(data_dir=train_data, train_filename="2020-02-09_answers.json", use_gpu=False, n_epochs=1)
System:
- OS: MacOS Mojave
- GPU/CPU: CPU
- FARM version: 0.4.1
FIX: Under data_handler > data_silo.py > DataSilo > _get_dataset
with ExitStack() as stack:
if self.max_processes > 1: # use multiprocessing only when max_processes > 1
p = stack.enter_context(mp.Pool(processes=num_cpus_used))
logger.info(
f"Got ya {num_cpus_used} parallel workers to convert {num_dicts} dictionaries "
f"to pytorch datasets (chunksize = {multiprocessing_chunk_size})..."
)
log_ascii_workers(num_cpus_used, logger)
results = p.imap(
partial(self._dataset_from_chunk, processor=self.processor),
grouper(dicts, multiprocessing_chunk_size),
chunksize=1,
)
else:
logger.info(
f"Multiprocessing disabled, using a single worker to convert {num_dicts}"
f"dictionaries to pytorch datasets."
)
#######################################################
####fix indent here ###############
results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts))
#########################################
datasets = []
with tqdm(total=len(dicts), unit=' Dicts', desc="Preprocessing Dataset") as pbar:
for dataset, tensor_names in results:
datasets.append(dataset)
pbar.update(multiprocessing_chunk_size)
concat_datasets = ConcatDataset(datasets)
return concat_datasets, tensor_names
I would have created a pull request but I’m not able to create a new branch. Hopefully this helps. Let me know if you have further questions.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Dumps Not Write To JSON file Pretty Print I Already using ...
I'm trying to convert a dict that I can't serialize to string type and write it to a json file. However, when using...
Read more >Main classes - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >Editor formatting reference – Gurock - TestRail
Hello there! This code is in a formatted code block. Output. The text will be wrapped in tags ...
Read more >A Guide to JSON-LD for Beginners - Moz
Structured data is a must-have for many sites, but it can be hard to get a handle on the technical considerations.
Read more >homebrew-core - Homebrew Formulae
a2ps 4.14 Any‑to‑PostScript filter
aacgain 1.8 AAC‑supporting version of mp3gain
aalib 1.4rc5 Portable ASCII art graphics library
aamath 0.3 Renders mathematical expressions as ASCII art
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@tanaysoni Should I go ahead and close the issue? That’s fine for me.
Hi @RobKnop, thanks again for raising the issue. The empty
qasis indeed a valid case that FARM and possibly the labelling tool(not exporting documents without answers) should deal with.