How do I load a dataset? How to do multi-label classification with OCTIS?
See original GitHub issue- OCTIS version: any
- Python version: any
- Operating System: any
Description
I am trying to evaluate topic model algorithms with a provided dataset, without success.
What I Did
I am trying to run the following code:
from octis.evaluation_metrics.classification_metrics import AccuracyScore
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
dataset = Dataset(corpus=X, labels=y)
model = LDA(num_topics=5, alpha=0.1)
acc = AccuracyScore(dataset)
output = model.train_model(dataset)
Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-99a4fc73752b> in <module>
1 acc = AccuracyScore(dataset)
----> 2 output = model.train_model(dataset)
~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
164
165 if self.use_partitions:
--> 166 train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
167 else:
168 train_corpus = dataset.get_corpus()
~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
41 # Partitioned Corpus getter
42 def get_partitioned_corpus(self, use_validation=True):
---> 43 last_training_doc = self.__metadata["last-training-doc"]
44 # gestire l'eccezione se last_validation_doc non è definito, restituire
45 # il validation vuoto
TypeError: 'NoneType' object is not subscriptable
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top Results From Across the Web
Modules — octis 1.10.4 documentation
Load the filenames and data from a dataset. Parameters ———- dataset_name: name of the dataset to download or retrieve data_home : optional, default:...
Read more >How to create a multilabel classification dataset and predict on ...
In this post, therefore, I intend to take the reader through the process of creating the multilabel dataset, preprocessing it, and then making ......
Read more >Multi-Label Classification with Scikit-MultiLearn - Section.io
We shall use pandas to read our dataset and numpy to perform mathematical computations. import pandas as pd import numpy as np.
Read more >Deep dive into multi-label classification..! (With detailed Case ...
DISCLAIMER FROM THE DATA SOURCE: the dataset contains text that may be ... Whereas, an instance of multi-label classification can be that a ......
Read more >Machine Learning Toolbox - Amit Chaudhary
hub, Prebuild datasets for PyTorch and Tensorflow ... iterstrat, Cross-validation for multi-label data ... octis, Evaluate topic models.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hi! I tried to add the multi-label classification functionality. You can find it in the branch
dev-multilabel-classification. Note that this is an experimental feature, so there may be bugs.If you have to preprocess your dataset, you need to have a document file (where each line represents a document) and a label file, where each line corresponds to the labels, separated by whitespace. To preprocess the dataset you need to do something like this:
Make sure to set multilabel to True. If you already have the preprocessed dataset, you can load it with the usual method
dataset.load_custom_dataset_from_folder(path, multilabel=True)but just make sure that the parameter “multilabel” is set to True. In other words, a code snippet to load the dataset and run classification should look like this:The format of the preprocessed dataset is like described above (the .tsv file for the corpus and the .txt file for the vocabulary), except that the labels are separated by the whitespace.
I used the RandomForestClassifier class from scikit-learn because it supports multilabel classification according to the documentation. But you can use any other multilabel classifier by modifying line 63 of the file classification_metrics.py. In that case, you should clone the repo (branch: dev-multilabel-classification), modify the file and run
pip install -e .to install the library (the “.” is important).Let me know if this helped 😃
Silvia
Hi Silvia, it was pretty fast, thank you very much!
I tested with the same code, modifying only the labels separator, and it worked nicely, the corpus.tsv file have the columns: text, split, and label. So the labels aren’t in a different file. I will analyze further but at that point I didn’t detected any problems.
Sorry for the late reply, I had a busy week and thanks again for the help!