BertLMDataBunch.from_raw_corpus : `ValueError: num_samples should be a positive integer value, but got num_samples=0`
See original GitHub issueI am trying to fine tune a model, but I am encountering a ValueError when creating the dataBunch from the raw corpus.
With the following syntactic data :
text_list = ['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in',
'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
]
databunch_lm = BertLMDataBunch.from_raw_corpus(
data_dir=DATA_PATH,
text_list=text_list,
tokenizer='bert-base-uncased',
batch_size_per_gpu=16,
max_seq_length=128,
multi_gpu=True,
model_type='bert',
logger=logger)
I get the following ValueError
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<timed exec> in <module>
~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
198 logger=logger,
199 clear_cache=clear_cache,
--> 200 no_cache=no_cache,
201 )
202
~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
275 self.train_batch_size = self.batch_size_per_gpu * max(1, self.n_gpu)
276
--> 277 train_sampler = RandomSampler(train_dataset)
278 self.train_dl = DataLoader(
279 train_dataset, sampler=train_sampler, batch_size=self.train_batch_size
~/envs/my_env/lib/python3.7/site-packages/torch/utils/data/sampler.py in __init__(self, data_source, replacement, num_samples)
92 if not isinstance(self.num_samples, int) or self.num_samples <= 0:
93 raise ValueError("num_samples should be a positive integer "
---> 94 "value, but got num_samples={}".format(self.num_samples))
95
96 @property
ValueError: num_samples should be a positive integer value, but got num_samples=0
The intermediate files lm_train.txt and lm_val.txt are created, so I suspect something is going wrong at the level of the tokenizer.
My env has python 3.7.6 and contains
pytorch-lamb 1.0.0 pypi_0 pypi
torch 1.4.0 pypi_0 pypi
torchvision 0.5.0 pypi_0 pypi
fast-bert 1.6.2 pypi_0 pypi
tokenizers 0.5.2 pypi_0 pypi
transformers 2.5.1 pypi_0 pypi
Anyway, let me know if you need any further information from my side!
Issue Analytics
- State:
- Created 4 years ago
- Comments:7
Top Results From Across the Web
num_samples should be a positive integer value, but got ...
Hi, I checked the input data path and it was correct. But still i'm getting "ValueError: num_samples should be a positive integer value,...
Read more >ValueError: num_samples should be a positive integer value ...
I have the error of the title ValueError: num_samples should be a positive integer value, but got num_samples=0 because basically I am ...
Read more >num_samples should be a positive integer value, but got ...
ValueError : num_samples should be a positive integer value, but got num_samples=0 ... May I ask what is the cause of the error...
Read more >num_samples should be a positive integeral value, but got ...
Has anyone encountered and solved the below error: Error: ValueError: num_samples should be a positive integeral value, but got ...
Read more >Camembert,'Charmap' Codec Can'T Encode Character ...
BertLMDataBunch.fromrawcorpus : ValueError: numsamples should be a positive integer value but got numsamples0 #181. Also if the end of the file is reached ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Clear your cache ! This function silently uses cache if available, totally ignoring the data you pass as input. In my case, creating the whole dataset was too slow, so I tried to pass just a few lines of text, which created an empty dataset in my cache (because only a few lines of text is too small). Then, I got this error whatever data I used, until I cleared the cache.
I strongly recommend to activate the ‘info’ logging, as follow, so that you see whether the function uses cache or not.
By the way, I consider that this is a bug. Calling
BertLMDataBunch.from_raw_corpus
should never read from the cache.So, in my case the issue was that the text … was too small.
Basically line 137 in
fast_bert/data_lm.py
never execute if
len(tokenized_text)
is smaller than the given block size.Bear in mind the process may also take a really long time, since it runs on a single core. In my case it ended up being 14 hours 😄