Runtime error when training Kitty with custom training data
See original GitHub issue- Contextualized Topic Models version: 2.2.0
- Python version: 3.7.12
- Operating System: Google Colab
Description
I’ve been following the Kitty tutorial, and adapted it to use my own training data, read from a CSV file in my Google Drive which contains text data in the title
column:
import csv
with open("/content/drive/MyDrive/mydata.csv") as f:
reader = csv.DictReader(f)
training = [line["title"].strip() for line in reader]
The training data size is 1270 entries, and there are no empty lines:
len(training)
1270
all(t for t in training)
True
Asserting that training
is a list of strings:
type(training)
list
all(type(t) == str for t in training)
True
What I Did
So I run the training cell, unchanged from the tutorial with my own training
data:
kt = Kitty()
kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")
It seems to process the batches fine, but then raises a RuntimeError:
Batches: 100%
7/7 [00:02<00:00, 3.18it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-22-8486cab2bc57> in <module>()
1 kt = Kitty()
----> 2 kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")
3 frames
/usr/local/lib/python3.7/dist-packages/contextualized_topic_models/models/ctm.py in _loss(self, inputs, word_dists, prior_mean, prior_variance, posterior_mean, posterior_variance, posterior_log_variance)
159
160 # Reconstruction term
--> 161 RL = -torch.sum(inputs * torch.log(word_dists + 1e-10), dim=1)
162
163 #loss = self.weights["beta"]*KL + RL
RuntimeError: The size of tensor a (1989) must match the size of tensor b (2000) at non-singleton dimension 1
Are there any particular requirements on the training data in terms of size, content etc.?
Unfortunately, I cannot share my training data, but I am very much willing to check for other particularities.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Custom dataset runtime error · Issue #44 · dbolya/yolact - GitHub
When I start training I encouter the following error: RuntimeError: cannot perform reduction function max on tensor with no elements because ...
Read more >Runtime error when reading data from a training dataset ...
I have a sample of data in my training dataset which I am able to view if I print the data, but when...
Read more >Unable to understand the Runtime Error - PyTorch Forums
I am new to Pytorch and trying to run a simple CNN on CIFAR10 dataset in Pytorch. However I am getting the error...
Read more >5 Google Colab Hacks One Should Be Aware Of
Here are some hacks and tricks that can enhance and streamline a user's experience with Google Colab, and also broaden their knowledge.
Read more >Troubleshoot why your ECS or EC2 instance can't join ... - AWS
The instance user data for your ECS container isn't configured properly. ... If the AMI used for the EC2 instance is a copied...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi! Can you please check if you have some special Unicode characters in your text? For example these characters.
The removal of these special characters might solve your issue.
Yea, that is indeed something we might want to cover. I am still wondering if this would be something to handle internally in Kitty or something to provide as an additional preprocessing class.
Anyway! thanks a lot 😃