Custom tokenizer layer
See original GitHub issueHi, I would like to incorporate the tokenization process into a model which is using bert layer. Here is my custom layer:
class TokenizationLayer(tf.keras.layers.Layer):
def __init__(self, vocab_path, max_length, **kwargs):
self.vocab_path = vocab_path
self.length = max_length
self.tokenizer = bert.bert_tokenization.FullTokenizer(vocab_path, do_lower_case=False)
super(TokenizationLayer, self).__init__(**kwargs)
def call(self,inputs):
tokens = self.tokenizer.tokenize(inputs)
ids = self.tokenizer.convert_tokens_to_ids(tokens)
ids += [self.tokenizer.vocab['[PAD]']] * (self.length-len(ids))
return ids
And here is my code to test the custom layer within a dummy model:
inputs = tf.keras.layers.Input(shape=(), dtype='string')
tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
I get the following traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-68-8df4885e5c7a> in <module>
1 inputs = tf.keras.layers.Input(shape=(), dtype='string')
2 tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
----> 3 outputs = tokenization_layer(inputs)
4 model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
924 if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
925 return self._functional_construction_call(inputs, args, kwargs,
--> 926 input_list)
927
928 # Maintains info about the `Layer.call` stack.
/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
1115 try:
1116 with ops.enable_auto_cast_variables(self._compute_dtype_object):
-> 1117 outputs = call_fn(cast_inputs, *args, **kwargs)
1118
1119 except errors.OperatorNotAllowedInGraphError as e:
/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
256 except Exception as e: # pylint:disable=broad-except
257 if hasattr(e, 'ag_error_metadata'):
--> 258 raise e.ag_error_metadata.to_exception(e)
259 else:
260 raise
ValueError: in user code:
<ipython-input-60-d6c12f7d1b14>:17 call *
tokens = self.tokenizer.tokenize(inputs)
/s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:172 tokenize *
for token in self.basic_tokenizer.tokenize(text):
/s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:198 tokenize *
text = convert_to_unicode(text)
/s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:86 convert_to_unicode *
raise ValueError("Unsupported string type: %s" % (type(text)))
ValueError: Unsupported string type: <class 'tensorflow.python.framework.ops.Tensor'>
Can you lease help how to solve this issue? I think the problem is that the tokenizer gets tensors not string and that is why it can’t tokenize it. But if that is the case how should I mkae this work? Thanks
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Create a custom architecture - Hugging Face
Create a model architecture. Create a slow and fast tokenizer for text. Create an image processor for vision tasks. Create a feature extractor...
Read more >Transformers From Scratch: Training a Tokenizer
We put together a custom tokenizer trained on the Latin subset of the huge OSCAR dataset. That's all for this article! I hope...
Read more >Issue when saving TF model with a tokenizer as a custom layer
Hi, I am trying to create a tensorflow model with keras api, when I include the tokenizing process inside the model.
Read more >Custom layers | TensorFlow Core
Implementing custom layers. The best way to implement your own layer is extending the tf.keras.Layer class and implementing:.
Read more >WordPieceTokenizer - Keras
A word piece tokenizer layer. This layer provides an efficient, in graph, implementation of the WordPiece algorithm used by BERT and other models....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
hope that helps:
and then try something along those lines:
it didn’t work, still throw OperatorNotAllowedInGraphError