How should one modify the code to successfully run text classification?
See original GitHub issueHi,
I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)
I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.
These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):
- Create two sets
snippets_val.csv
andsnippets_test.csv
containing two columns,text
(string) andclass
(an int equal to 0 or 1). - In
datasets.py
create two new functions:_snippets
returning two listsst, y
, andsnippets
defined with different values ofn_train
andn_valid
and whose return statement looks likereturn (trX, trY), (vaX, vaY), (teX, )
- In
train.py
, rewritetransform_roc
into atransform_snippet
that doesn’t use[delimiter]
and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance? - In
train.py
, in the encoding bit and afterwards:- modify the tuple in output of
encode_dataset
to match the output of the function ofsnippets
redefined above. - get rid of
encoder['_delimiter_'] = len(encoder)
- set
n_special = 2
as we got rid of['_delimiter_']
- get rid of the vars containing
2
and3
in their name (?) e.g. in the definition ofn_ctx
<- somewhat tricky to me can anyone provide some guidance?
- modify the tuple in output of
- In
train.py
:- modify the call to
dh_model
to use('classification', 2)
instead of'multiple_choice'
- use (unless it’s bugged!)
ClassificationLossCompute
instead ofMultipleChoiceLossCompute
- modify the call to
- In
analysis.py
:- create a new function
snippets
so to invoke_snippets
(fromdatasets.py
) and read insnippets_test.csv
and adjust its call to_snippets
so to take into account that it outputs two lists (not 4)
- create a new function
- Modify imports in
train.py
coherently with all of the above.
Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of “sub-steps” ?
Also, can someone provide some guidance on how to rewrite transform_roc
(comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)
Thanks to anyone patiently reading this!
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (1 by maintainers)
Top GitHub Comments
Hi @thomwolf, thanks for your reply and tip!
As advertised I forked the code, and you find the result at https://github.com/huggingface/pytorch-openai-transformer-lm/compare/master...davidefiocco:master and that specific edit can be found at https://github.com/davidefiocco/pytorch-openai-transformer-lm/blob/e9945725603544cdebaec91937d4a16f14db0ad8/train.py#L26
In the fork namings ,
news
stands for “newsgroup”, as I tried to classify snippets of text coming from a (2-newsgroup) subset of the 20 newsgroup dataset (http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). I haven’t been successful in using the algorithm yet (the code now runs without errors, but iterations don’t seem to converge).I will update this issue if I manage to get it sorted, and if someone is keen on giving feedback on what needs to be changed in the code I’ll be very happy to work on it.
Hi @davidefiocco, Your transform_snippet function should be the way to go. I think it’s just a python typo. Looks like your
l12
is equal to one. Probably comes from this line:for i, (x1), in enumerate(X1)
. Try usingfor i, x1 in enumerate(X1)