Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How should one modify the code to successfully run text classification?

See original GitHub issue

Hi,

I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)

I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.

These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):

Create two sets snippets_val.csv and snippets_test.csv containing two columns, text (string) and class (an int equal to 0 or 1).
In datasets.py create two new functions:
- _snippets returning two lists st, y, and
- snippets defined with different values of n_train and n_valid and whose return statement looks like return (trX, trY), (vaX, vaY), (teX, )
In train.py, rewrite transform_roc into a transform_snippet that doesn’t use [delimiter] and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance?
In train.py, in the encoding bit and afterwards:
- modify the tuple in output of encode_dataset to match the output of the function of snippets redefined above.
- get rid of encoder['_delimiter_'] = len(encoder)
- set n_special = 2 as we got rid of ['_delimiter_']
- get rid of the vars containing 2 and 3 in their name (?) e.g. in the definition of n_ctx <- somewhat tricky to me can anyone provide some guidance?
In train.py:
- modify the call to dh_model to use ('classification', 2) instead of 'multiple_choice'
- use (unless it’s bugged!) ClassificationLossCompute instead of MultipleChoiceLossCompute
In analysis.py:
- create a new function snippets so to invoke _snippets (from datasets.py) and read in snippets_test.csv and adjust its call to _snippets so to take into account that it outputs two lists (not 4)
Modify imports in train.py coherently with all of the above.

Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of “sub-steps” ? Also, can someone provide some guidance on how to rewrite transform_roc (comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)

Thanks to anyone patiently reading this!

Issue Analytics

State:
Created 5 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

2reactions

davidefioccocommented, Oct 6, 2018

Hi @thomwolf, thanks for your reply and tip!

As advertised I forked the code, and you find the result at https://github.com/huggingface/pytorch-openai-transformer-lm/compare/master...davidefiocco:master and that specific edit can be found at https://github.com/davidefiocco/pytorch-openai-transformer-lm/blob/e9945725603544cdebaec91937d4a16f14db0ad8/train.py#L26

In the fork namings ,news stands for “newsgroup”, as I tried to classify snippets of text coming from a (2-newsgroup) subset of the 20 newsgroup dataset (http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). I haven’t been successful in using the algorithm yet (the code now runs without errors, but iterations don’t seem to converge).

I will update this issue if I manage to get it sorted, and if someone is keen on giving feedback on what needs to be changed in the code I’ll be very happy to work on it.

2reactions

thomwolfcommented, Oct 5, 2018

Hi @davidefiocco, Your transform_snippet function should be the way to go. I think it’s just a python typo. Looks like your l12 is equal to one. Probably comes from this line: for i, (x1), in enumerate(X1). Try using for i, x1 in enumerate(X1)

Top Results From Across the Web

6 Practices to enhance the performance of a Text ...

Here are best 6 practices to implement text classification model to improve accuracy of a text classifier model using useful set of corpus....

Text Classification is Your New Secret Weapon - Medium

First we split text into sentences, then we break sentences down into nouns and verbs, then we figure out the relationships between those...

Text Classification: What it is And Why it Matters - MonkeyLearn

Text classification is a machine learning technique that assigns a set of predefined categories to text data. Text classification is used to organize, ......

Basic text classification | TensorFlow Core

This tutorial demonstrates text classification starting from plain text files stored on disk. You'll train a binary classifier to perform sentiment analysis ...

Building a Supervised Text Classification Model - YouTube

Presented by WWCode Data Science ‍ Speaker: Rishika Singh, Jayeeta Putatunda✓ Topics: Intro to Machine Learning, Text Mining, ...