question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Runtime error when training Kitty with custom training data

See original GitHub issue
  • Contextualized Topic Models version: 2.2.0
  • Python version: 3.7.12
  • Operating System: Google Colab

Description

I’ve been following the Kitty tutorial, and adapted it to use my own training data, read from a CSV file in my Google Drive which contains text data in the title column:

import csv
with open("/content/drive/MyDrive/mydata.csv") as f:
  reader = csv.DictReader(f)
  training = [line["title"].strip() for line in reader]

The training data size is 1270 entries, and there are no empty lines:

len(training)
1270

all(t for t in training)
True

Asserting that training is a list of strings:

type(training) 
list

all(type(t) == str for t in training)
True

What I Did

So I run the training cell, unchanged from the tutorial with my own training data:

kt = Kitty()
kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")

It seems to process the batches fine, but then raises a RuntimeError:

Batches: 100%
7/7 [00:02<00:00, 3.18it/s]

0it [00:00, ?it/s]

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-22-8486cab2bc57> in <module>()
      1 kt = Kitty()
----> 2 kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")

3 frames

/usr/local/lib/python3.7/dist-packages/contextualized_topic_models/models/ctm.py in _loss(self, inputs, word_dists, prior_mean, prior_variance, posterior_mean, posterior_variance, posterior_log_variance)
    159 
    160         # Reconstruction term
--> 161         RL = -torch.sum(inputs * torch.log(word_dists + 1e-10), dim=1)
    162 
    163         #loss = self.weights["beta"]*KL + RL

RuntimeError: The size of tensor a (1989) must match the size of tensor b (2000) at non-singleton dimension 1

Are there any particular requirements on the training data in terms of size, content etc.?

Unfortunately, I cannot share my training data, but I am very much willing to check for other particularities.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
silviatticommented, Oct 13, 2021

Hi! Can you please check if you have some special Unicode characters in your text? For example these characters.

The removal of these special characters might solve your issue.

0reactions
vinidcommented, Oct 15, 2021

Yea, that is indeed something we might want to cover. I am still wondering if this would be something to handle internally in Kitty or something to provide as an additional preprocessing class.

Anyway! thanks a lot 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Custom dataset runtime error · Issue #44 · dbolya/yolact - GitHub
When I start training I encouter the following error: RuntimeError: cannot perform reduction function max on tensor with no elements because ...
Read more >
Runtime error when reading data from a training dataset ...
I have a sample of data in my training dataset which I am able to view if I print the data, but when...
Read more >
Unable to understand the Runtime Error - PyTorch Forums
I am new to Pytorch and trying to run a simple CNN on CIFAR10 dataset in Pytorch. However I am getting the error...
Read more >
5 Google Colab Hacks One Should Be Aware Of
Here are some hacks and tricks that can enhance and streamline a user's experience with Google Colab, and also broaden their knowledge.
Read more >
Troubleshoot why your ECS or EC2 instance can't join ... - AWS
The instance user data for your ECS container isn't configured properly. ... If the AMI used for the EC2 instance is a copied...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found