question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot specify max_length more than 512

See original GitHub issue

Hello,

I’ve tried to use max_length more than 512 to featurize text:

model = finetune.Classifier()
trn_X_q_vecs = model.featurize(trn_X_q, max_length=1000)

But I got the following exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-d3d9e8b820e5> in <module>()
----> 1 trn_X_q_vecs = model.featurize(trn_X_q, max_length=1000)

/opt/conda/lib/python3.6/site-packages/finetune/classifier.py in featurize(self, X, max_length)
     24         :returns: np.array of features of shape (n_examples, embedding_size).
     25         """
---> 26         return super().featurize(X, max_length=max_length)
     27 
     28     def predict(self, X, max_length=None):

/opt/conda/lib/python3.6/site-packages/finetune/base.py in featurize(self, *args, **kwargs)
    386         These features are the same that are fed into the target_model.
    387         """
--> 388         return self._featurize(*args, **kwargs)
    389 
    390     @classmethod

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _featurize(self, Xs, max_length)
    371             warnings.filterwarnings("ignore")
    372             max_length = max_length or self.config.max_length
--> 373             for xmb, mmb in self._infer_prep(Xs, max_length=max_length):
    374                 feature_batch = self.sess.run(self.features, {
    375                     self.X: xmb,

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _infer_prep(self, Xs, max_length)
    400     def _infer_prep(self, Xs, max_length=None):
    401         max_length = max_length or self.config.max_length
--> 402         arr_encoded = self._text_to_ids(Xs, max_length=max_length)
    403         n_batch_train = self.config.batch_size * max(len(self.config.visible_gpus), 1)
    404         self._build_model(n_updates_total=0, target_dim=self.target_dim, train=False)

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _text_to_ids(self, Xs, Y, max_length)
    156         else:
    157             encoder_out = self.encoder.encode_multi_input(Xs, Y=Y, max_length=max_length)
--> 158             return self._array_format(encoder_out)
    159 
    160 

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _array_format(self, encoded_output)
    421         for i, seq_length in enumerate(seq_lengths):
    422             # BPE embedding
--> 423             x[i, :seq_length, 0] = encoded_output.token_ids[i]
    424             # masking: value of 1 means "consider this in cross-entropy LM loss"
    425             mask[i, 1:seq_length] = 1

ValueError: cannot copy sequence with size 667 to array axis with dimension 512

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
madisonmaycommented, Sep 7, 2018

@thinline72 also thanks for the recommendation to take a peek at sentencepiece! Might have to use that the next time we’re training a model from scratch – tokenization + de-tokenization has been a huge pain in this repository, perhaps a project like sentencepiece could help clean that up.

1reaction
madisonmaycommented, Sep 7, 2018

@thinline72 correct, this was a necessity inherited from the original OpenAI repository. We’re keeping our handling of tokenization as close to possible to the source implementation to prevent performance regressions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

token indices sequence length is longer than the specified ...
When I use Bert, the "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)"...
Read more >
Dictionary property 'Max Length' is not enforced
Check that the dictionary entry for Description specifies a max length of 512. Edit one of the sys_properties descriptions with text that is...
Read more >
How to use Bert for long text classification? - nlp
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Save this answer. Show activity on this post....
Read more >
HTML attribute: maxlength - MDN Web Docs
The maxlength attribute defines the maximum number of characters (as UTF-16 code units) the user can enter into an <input> or <textarea> ....
Read more >
Password length is 128 but error message says it is 255
When the password length was 129 this message is shown, 'Password cannot be longer than 128 characters but is currently 129 characters long....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found