Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does max_seq_length specify the maxium number of words

See original GitHub issue

I’m trying to figure out how the --max_seq_length parameter works in run_classifier. Based on the source, it seems like it represents the number of words? Is that correct?

Issue Analytics

State:
Created 5 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

9reactions

rodgzillacommented, Dec 10, 2018

max_seq_length specifies the maximum number of tokens of the input. The number of token is superior or equal to the number of words of an input.

For example, the following sentence:

The man hits the saxophone and demonstrates how to properly use the racquet.

is tokenized as follows:

the man hits the saxophone and demonstrates how to properly use the ra ##c ##quet .

And depending on the task 2 to 3 additional special tokens ([CLS] and [SEP]) are added to the input to format it.

2reactions

thomwolfcommented, Apr 23, 2019

@tsungruihon yes, just use smaller sentences

@echan00 no automatic cut off but there is a warning from the tokenizer that your inputs are too long and the model will throw an error. You have to limit the size manually.