Cannot specify max_length more than 512
See original GitHub issueHello,
I’ve tried to use max_length more than 512 to featurize text:
model = finetune.Classifier()
trn_X_q_vecs = model.featurize(trn_X_q, max_length=1000)
But I got the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-d3d9e8b820e5> in <module>()
----> 1 trn_X_q_vecs = model.featurize(trn_X_q, max_length=1000)
/opt/conda/lib/python3.6/site-packages/finetune/classifier.py in featurize(self, X, max_length)
24 :returns: np.array of features of shape (n_examples, embedding_size).
25 """
---> 26 return super().featurize(X, max_length=max_length)
27
28 def predict(self, X, max_length=None):
/opt/conda/lib/python3.6/site-packages/finetune/base.py in featurize(self, *args, **kwargs)
386 These features are the same that are fed into the target_model.
387 """
--> 388 return self._featurize(*args, **kwargs)
389
390 @classmethod
/opt/conda/lib/python3.6/site-packages/finetune/base.py in _featurize(self, Xs, max_length)
371 warnings.filterwarnings("ignore")
372 max_length = max_length or self.config.max_length
--> 373 for xmb, mmb in self._infer_prep(Xs, max_length=max_length):
374 feature_batch = self.sess.run(self.features, {
375 self.X: xmb,
/opt/conda/lib/python3.6/site-packages/finetune/base.py in _infer_prep(self, Xs, max_length)
400 def _infer_prep(self, Xs, max_length=None):
401 max_length = max_length or self.config.max_length
--> 402 arr_encoded = self._text_to_ids(Xs, max_length=max_length)
403 n_batch_train = self.config.batch_size * max(len(self.config.visible_gpus), 1)
404 self._build_model(n_updates_total=0, target_dim=self.target_dim, train=False)
/opt/conda/lib/python3.6/site-packages/finetune/base.py in _text_to_ids(self, Xs, Y, max_length)
156 else:
157 encoder_out = self.encoder.encode_multi_input(Xs, Y=Y, max_length=max_length)
--> 158 return self._array_format(encoder_out)
159
160
/opt/conda/lib/python3.6/site-packages/finetune/base.py in _array_format(self, encoded_output)
421 for i, seq_length in enumerate(seq_lengths):
422 # BPE embedding
--> 423 x[i, :seq_length, 0] = encoded_output.token_ids[i]
424 # masking: value of 1 means "consider this in cross-entropy LM loss"
425 mask[i, 1:seq_length] = 1
ValueError: cannot copy sequence with size 667 to array axis with dimension 512
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
token indices sequence length is longer than the specified ...
When I use Bert, the "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)"...
Read more >Dictionary property 'Max Length' is not enforced
Check that the dictionary entry for Description specifies a max length of 512. Edit one of the sys_properties descriptions with text that is...
Read more >How to use Bert for long text classification? - nlp
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Save this answer. Show activity on this post....
Read more >HTML attribute: maxlength - MDN Web Docs
The maxlength attribute defines the maximum number of characters (as UTF-16 code units) the user can enter into an <input> or <textarea> ....
Read more >Password length is 128 but error message says it is 255
When the password length was 129 this message is shown, 'Password cannot be longer than 128 characters but is currently 129 characters long....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@thinline72 also thanks for the recommendation to take a peek at
sentencepiece
! Might have to use that the next time we’re training a model from scratch – tokenization + de-tokenization has been a huge pain in this repository, perhaps a project likesentencepiece
could help clean that up.@thinline72 correct, this was a necessity inherited from the original OpenAI repository. We’re keeping our handling of tokenization as close to possible to the source implementation to prevent performance regressions.