question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wrong indent for "results" variable prevents data set processing of squad like json file

See original GitHub issue

Describe the bug Just a wrong indent of the variable “results”. In the current version, there is one indent and so the results just get executed if the else block is true before. That’s why in my case the results variable is not defined and I get the following error:

Error message

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts)
    188             train_file = self.processor.data_dir / self.processor.train_filename
    189             logger.info("Loading train set from: {} ".format(train_file))
--> 190             self.data["train"], self.tensor_names = self._get_dataset(train_file)
    191 
    192         # dev data

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/data_silo.py in _get_dataset(self, filename, dicts)
    159 
    160             with tqdm(total=len(dicts), unit=' Dicts', desc="Preprocessing Dataset") as pbar:
--> 161                 for dataset, tensor_names in results:
    162                     datasets.append(dataset)
    163                     pbar.update(multiprocessing_chunk_size)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
    746         if success:
    747             return value
--> 748         raise value
    749 
    750     __next__ = next                    # XXX

IndexError: Cannot choose from an empty sequence

Expected behavior sequence should not be empty

To Reproduce Just try

# and fine-tune it on your own custom dataset (should be in SQuAD like format)
train_data = "training_data"
reader.train(data_dir=train_data, train_filename="2020-02-09_answers.json", use_gpu=False, n_epochs=1)

System:

  • OS: MacOS Mojave
  • GPU/CPU: CPU
  • FARM version: 0.4.1

FIX: Under data_handler > data_silo.py > DataSilo > _get_dataset

with ExitStack() as stack:
            if self.max_processes > 1:  # use multiprocessing only when max_processes > 1
                p = stack.enter_context(mp.Pool(processes=num_cpus_used))

                logger.info(
                    f"Got ya {num_cpus_used} parallel workers to convert {num_dicts} dictionaries "
                    f"to pytorch datasets (chunksize = {multiprocessing_chunk_size})..."
                )
                log_ascii_workers(num_cpus_used, logger)

                results = p.imap(
                    partial(self._dataset_from_chunk, processor=self.processor),
                    grouper(dicts, multiprocessing_chunk_size),
                    chunksize=1,
                )
            else:
                logger.info(
                    f"Multiprocessing disabled, using a single worker to convert {num_dicts}"
                    f"dictionaries to pytorch datasets."
                )
            #######################################################
            ####fix indent here ###############
            results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts))
            #########################################
            datasets = []

            with tqdm(total=len(dicts), unit=' Dicts', desc="Preprocessing Dataset") as pbar:
                for dataset, tensor_names in results:
                    datasets.append(dataset)
                    pbar.update(multiprocessing_chunk_size)

            concat_datasets = ConcatDataset(datasets)
            return concat_datasets, tensor_names

I would have created a pull request but I’m not able to create a new branch. Hopefully this helps. Let me know if you have further questions.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
RobKnopcommented, Mar 28, 2020

@tanaysoni Should I go ahead and close the issue? That’s fine for me.

0reactions
tanaysonicommented, Feb 18, 2020

Hi @RobKnop, thanks again for raising the issue. The empty qas is indeed a valid case that FARM and possibly the labelling tool(not exporting documents without answers) should deal with.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dumps Not Write To JSON file Pretty Print I Already using ...
I'm trying to convert a dict that I can't serialize to string type and write it to a json file. However, when using...
Read more >
Main classes - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >
Editor formatting reference – Gurock - TestRail
Hello there! This code is in a formatted code block. Output. The text will be wrapped in tags ...
Read more >
A Guide to JSON-LD for Beginners - Moz
Structured data is a must-have for many sites, but it can be hard to get a handle on the technical considerations.
Read more >
homebrew-core - Homebrew Formulae
a2ps 4.14 Any‑to‑PostScript filter aacgain 1.8 AAC‑supporting version of mp3gain aalib 1.4rc5 Portable ASCII art graphics library aamath 0.3 Renders mathematical expressions as ASCII art
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found