question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How should one modify the code to successfully run text classification?

See original GitHub issue

Hi,

I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)

I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.

These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):

  1. Create two sets snippets_val.csv and snippets_test.csv containing two columns, text (string) and class (an int equal to 0 or 1).
  2. In datasets.py create two new functions:
    • _snippets returning two lists st, y, and
    • snippets defined with different values of n_train and n_valid and whose return statement looks like return (trX, trY), (vaX, vaY), (teX, )
  3. In train.py, rewrite transform_roc into a transform_snippet that doesn’t use [delimiter] and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance?
  4. In train.py, in the encoding bit and afterwards:
  5. In train.py:
  6. In analysis.py:
    • create a new function snippets so to invoke _snippets (from datasets.py) and read in snippets_test.csv and adjust its call to _snippets so to take into account that it outputs two lists (not 4)
  7. Modify imports in train.py coherently with all of the above.

Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of “sub-steps” ? Also, can someone provide some guidance on how to rewrite transform_roc (comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)

Thanks to anyone patiently reading this!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
davidefioccocommented, Oct 6, 2018

Hi @thomwolf, thanks for your reply and tip!

As advertised I forked the code, and you find the result at https://github.com/huggingface/pytorch-openai-transformer-lm/compare/master...davidefiocco:master and that specific edit can be found at https://github.com/davidefiocco/pytorch-openai-transformer-lm/blob/e9945725603544cdebaec91937d4a16f14db0ad8/train.py#L26

In the fork namings ,news stands for “newsgroup”, as I tried to classify snippets of text coming from a (2-newsgroup) subset of the 20 newsgroup dataset (http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). I haven’t been successful in using the algorithm yet (the code now runs without errors, but iterations don’t seem to converge).

I will update this issue if I manage to get it sorted, and if someone is keen on giving feedback on what needs to be changed in the code I’ll be very happy to work on it.

2reactions
thomwolfcommented, Oct 5, 2018

Hi @davidefiocco, Your transform_snippet function should be the way to go. I think it’s just a python typo. Looks like your l12 is equal to one. Probably comes from this line: for i, (x1), in enumerate(X1). Try using for i, x1 in enumerate(X1)

Read more comments on GitHub >

github_iconTop Results From Across the Web

6 Practices to enhance the performance of a Text ...
Here are best 6 practices to implement text classification model to improve accuracy of a text classifier model using useful set of corpus....
Read more >
Text Classification is Your New Secret Weapon - Medium
First we split text into sentences, then we break sentences down into nouns and verbs, then we figure out the relationships between those...
Read more >
Text Classification: What it is And Why it Matters - MonkeyLearn
Text classification is a machine learning technique that assigns a set of predefined categories to text data. Text classification is used to organize, ......
Read more >
Basic text classification | TensorFlow Core
This tutorial demonstrates text classification starting from plain text files stored on disk. You'll train a binary classifier to perform sentiment analysis ...
Read more >
Building a Supervised Text Classification Model - YouTube
Presented by WWCode Data Science ‍ Speaker: Rishika Singh, Jayeeta Putatunda✓ Topics: Intro to Machine Learning, Text Mining, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found