question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BertLMDataBunch.from_raw_corpus : `ValueError: num_samples should be a positive integer value, but got num_samples=0`

See original GitHub issue

I am trying to fine tune a model, but I am encountering a ValueError when creating the dataBunch from the raw corpus.

With the following syntactic data :

text_list = ['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
             'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in',
             'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
             ]

databunch_lm = BertLMDataBunch.from_raw_corpus(
    data_dir=DATA_PATH,
    text_list=text_list,
    tokenizer='bert-base-uncased',
    batch_size_per_gpu=16,
    max_seq_length=128,
    multi_gpu=True, 
    model_type='bert',
    logger=logger)

I get the following ValueError

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>

~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
    198             logger=logger,
    199             clear_cache=clear_cache,
--> 200             no_cache=no_cache,
    201         )
    202 

~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
    275             self.train_batch_size = self.batch_size_per_gpu * max(1, self.n_gpu)
    276 
--> 277             train_sampler = RandomSampler(train_dataset)
    278             self.train_dl = DataLoader(
    279                 train_dataset, sampler=train_sampler, batch_size=self.train_batch_size

~/envs/my_env/lib/python3.7/site-packages/torch/utils/data/sampler.py in __init__(self, data_source, replacement, num_samples)
     92         if not isinstance(self.num_samples, int) or self.num_samples <= 0:
     93             raise ValueError("num_samples should be a positive integer "
---> 94                              "value, but got num_samples={}".format(self.num_samples))
     95 
     96     @property

ValueError: num_samples should be a positive integer value, but got num_samples=0

The intermediate files lm_train.txt and lm_val.txt are created, so I suspect something is going wrong at the level of the tokenizer.

My env has python 3.7.6 and contains

pytorch-lamb              1.0.0                    pypi_0    pypi
torch                     1.4.0                    pypi_0    pypi
torchvision               0.5.0                    pypi_0    pypi
fast-bert                 1.6.2                    pypi_0    pypi
tokenizers                0.5.2                    pypi_0    pypi
transformers              2.5.1                    pypi_0    pypi

Anyway, let me know if you need any further information from my side!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
godefvcommented, Sep 28, 2020

Clear your cache ! This function silently uses cache if available, totally ignoring the data you pass as input. In my case, creating the whole dataset was too slow, so I tried to pass just a few lines of text, which created an empty dataset in my cache (because only a few lines of text is too small). Then, I got this error whatever data I used, until I cleared the cache.

I strongly recommend to activate the ‘info’ logging, as follow, so that you see whether the function uses cache or not.

logger.setLevel('INFO')
consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.INFO)
logger.addHandler(consoleHandler)

By the way, I consider that this is a bug. Calling BertLMDataBunch.from_raw_corpus should never read from the cache.

1reaction
Q-ldscommented, Mar 9, 2020

So, in my case the issue was that the text … was too small.

Basically line 137 in fast_bert/data_lm.py

            while len(tokenized_text) >= block_size:  # Truncate in block of block_size

                self.examples.append(
                    tokenizer.build_inputs_with_special_tokens(
                        tokenized_text[:block_size]
                    )
                )
                tokenized_text = tokenized_text[block_size:]

never execute if len(tokenized_text) is smaller than the given block size.

Bear in mind the process may also take a really long time, since it runs on a single core. In my case it ended up being 14 hours 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

num_samples should be a positive integer value, but got ...
Hi, I checked the input data path and it was correct. But still i'm getting "ValueError: num_samples should be a positive integer value,...
Read more >
ValueError: num_samples should be a positive integer value ...
I have the error of the title ValueError: num_samples should be a positive integer value, but got num_samples=0 because basically I am ...
Read more >
num_samples should be a positive integer value, but got ...
ValueError : num_samples should be a positive integer value, but got num_samples=0 ... May I ask what is the cause of the error...
Read more >
num_samples should be a positive integeral value, but got ...
Has anyone encountered and solved the below error: Error: ValueError: num_samples should be a positive integeral value, but got ...
Read more >
Camembert,'Charmap' Codec Can'T Encode Character ...
BertLMDataBunch.fromrawcorpus : ValueError: numsamples should be a positive integer value but got numsamples0 #181. Also if the end of the file is reached ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found