question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How do I load a dataset? How to do multi-label classification with OCTIS?

See original GitHub issue
  • OCTIS version: any
  • Python version: any
  • Operating System: any

Description

I am trying to evaluate topic model algorithms with a provided dataset, without success.

What I Did

I am trying to run the following code:

from octis.evaluation_metrics.classification_metrics import AccuracyScore
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA


dataset = Dataset(corpus=X, labels=y)
model = LDA(num_topics=5, alpha=0.1)

acc = AccuracyScore(dataset)
output = model.train_model(dataset)

Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-99a4fc73752b> in <module>
      1 acc = AccuracyScore(dataset)
----> 2 output = model.train_model(dataset)

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
    164 
    165         if self.use_partitions:
--> 166             train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
    167         else:
    168             train_corpus = dataset.get_corpus()

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
     41     # Partitioned Corpus getter
     42     def get_partitioned_corpus(self, use_validation=True):
---> 43         last_training_doc = self.__metadata["last-training-doc"]
     44         # gestire l'eccezione se last_validation_doc non è definito, restituire
     45         # il validation vuoto

TypeError: 'NoneType' object is not subscriptable

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
silviatticommented, May 25, 2021

Hi! I tried to add the multi-label classification functionality. You can find it in the branch dev-multilabel-classification. Note that this is an experimental feature, so there may be bugs.

If you have to preprocess your dataset, you need to have a document file (where each line represents a document) and a label file, where each line corresponds to the labels, separated by whitespace. To preprocess the dataset you need to do something like this:

p = Preprocessing()
dataset = p.preprocess_dataset('path/corpus.txt','path/labels.txt', multilabel=True)
dataset.save('multilabel_dataset/')

Make sure to set multilabel to True. If you already have the preprocessed dataset, you can load it with the usual method dataset.load_custom_dataset_from_folder(path, multilabel=True) but just make sure that the parameter “multilabel” is set to True. In other words, a code snippet to load the dataset and run classification should look like this:

d = Dataset()
d.load_custom_dataset_from_folder('path/to/preprocessed/dataset/', multilabel=True)
model = LDA(num_topics=100)
output = model.train_model(d)

metric = AccuracyScore(dataset=d)
score = metric.score(output)  

The format of the preprocessed dataset is like described above (the .tsv file for the corpus and the .txt file for the vocabulary), except that the labels are separated by the whitespace.

I used the RandomForestClassifier class from scikit-learn because it supports multilabel classification according to the documentation. But you can use any other multilabel classifier by modifying line 63 of the file classification_metrics.py. In that case, you should clone the repo (branch: dev-multilabel-classification), modify the file and run pip install -e . to install the library (the “.” is important).

Let me know if this helped 😃

Silvia

0reactions
jadermcscommented, May 31, 2021

Hi Silvia, it was pretty fast, thank you very much!

I tested with the same code, modifying only the labels separator, and it worked nicely, the corpus.tsv file have the columns: text, split, and label. So the labels aren’t in a different file. I will analyze further but at that point I didn’t detected any problems.

Sorry for the late reply, I had a busy week and thanks again for the help!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modules — octis 1.10.4 documentation
Load the filenames and data from a dataset. Parameters ———- dataset_name: name of the dataset to download or retrieve data_home : optional, default:...
Read more >
How to create a multilabel classification dataset and predict on ...
In this post, therefore, I intend to take the reader through the process of creating the multilabel dataset, preprocessing it, and then making ......
Read more >
Multi-Label Classification with Scikit-MultiLearn - Section.io
We shall use pandas to read our dataset and numpy to perform mathematical computations. import pandas as pd import numpy as np.
Read more >
Deep dive into multi-label classification..! (With detailed Case ...
DISCLAIMER FROM THE DATA SOURCE: the dataset contains text that may be ... Whereas, an instance of multi-label classification can be that a ......
Read more >
Machine Learning Toolbox - Amit Chaudhary
hub, Prebuild datasets for PyTorch and Tensorflow ... iterstrat, Cross-validation for multi-label data ... octis, Evaluate topic models.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found