Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Runtime error when training Kitty with custom training data

See original GitHub issue

Contextualized Topic Models version: 2.2.0
Python version: 3.7.12
Operating System: Google Colab

Description

I’ve been following the Kitty tutorial, and adapted it to use my own training data, read from a CSV file in my Google Drive which contains text data in the title column:

import csv
with open("/content/drive/MyDrive/mydata.csv") as f:
  reader = csv.DictReader(f)
  training = [line["title"].strip() for line in reader]

The training data size is 1270 entries, and there are no empty lines:

len(training)
1270

all(t for t in training)
True

Asserting that training is a list of strings:

type(training) 
list

all(type(t) == str for t in training)
True

What I Did

So I run the training cell, unchanged from the tutorial with my own training data:

kt = Kitty()
kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")

It seems to process the batches fine, but then raises a RuntimeError:

Batches: 100%
7/7 [00:02<00:00, 3.18it/s]

0it [00:00, ?it/s]

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-22-8486cab2bc57> in <module>()
      1 kt = Kitty()
----> 2 kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")

3 frames

/usr/local/lib/python3.7/dist-packages/contextualized_topic_models/models/ctm.py in _loss(self, inputs, word_dists, prior_mean, prior_variance, posterior_mean, posterior_variance, posterior_log_variance)
    159 
    160         # Reconstruction term
--> 161         RL = -torch.sum(inputs * torch.log(word_dists + 1e-10), dim=1)
    162 
    163         #loss = self.weights["beta"]*KL + RL

RuntimeError: The size of tensor a (1989) must match the size of tensor b (2000) at non-singleton dimension 1

Are there any particular requirements on the training data in terms of size, content etc.?

Unfortunately, I cannot share my training data, but I am very much willing to check for other particularities.

Issue Analytics

State:
Created 2 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

silviatticommented, Oct 13, 2021

Hi! Can you please check if you have some special Unicode characters in your text? For example these characters.

The removal of these special characters might solve your issue.

0reactions

vinidcommented, Oct 15, 2021

Yea, that is indeed something we might want to cover. I am still wondering if this would be something to handle internally in Kitty or something to provide as an additional preprocessing class.

Anyway! thanks a lot 😃