How properly apply a tokenizer map function to a Tensorflow batched dataset?
See original GitHub issueConsidering the following batched_dataset
:
samples = ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"},
{"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
{"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
{"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
])
dataset = tf.data.Dataset.from_generator(
lambda: samples, {"query": tf.string, "doc": tf.string})
batched_dataset = dataset.batch(2)
#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
# [b'this is one relevant document regarding query 1',
# b'this is one relevant document regarding query 2'], dtype=object)>,
#
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
# [b'this is a query 1',
# b'this is a query 2'], dtype=object)>
#}
and a map function to tokenize this batched_dataset
:
def tokenize(sample):
tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
return (tokenized_query, tokenized_doc)
I could tokenize the entire batched_dataset using a for-loop:
for batch in batched_dataset:
tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[ 101, 2023, 2003, 1037, 23032, 1015, 102, 0],
# [ 101, 2023, 2003, 1037, 23032, 1016, 102, 0]],
# dtype=int32)>,
# 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[1, 1, 1, 1, 1, 1, 1, 0],
# [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>},
# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
# array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953, 102],
# [ 101, 2023, 2003, 2028, 7882, 6254, 4953, 102]], dtype=int32)>,
# 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
# ...
However, when using tf.data.Dataset.map
the following error arises:
tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'
Then, how properly apply a tokenizer map function to a batched dataset?
Note: I published a working example on Google Colab
.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (4 by maintainers)
Top Results From Across the Web
How properly apply a tokenizer map function ...
I could tokenize the entire batched_dataset using a for-loop: for batch in batched_dataset: tokenize(batch) # ( # {' ...
Read more >Tokenize dataset using map on tf.data. ...
Note: I am using the free TPU provided on Kaggle. I want to tokenize the text using transformers such that I tokenize only...
Read more >Processing the data - Hugging Face Course
Here is how we apply the tokenization function on all our datasets at once. We're using batched=True in our call to map so...
Read more >tf.data.Dataset | TensorFlow v2.11.0
Apply dataset transformations to preprocess the data. Iterate over the dataset and process the elements.
Read more >Load and preprocess images | TensorFlow Core
This tutorial shows how to load and preprocess an image dataset in three ways: First, you will use high-level Keras preprocessing utilities (such...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.
Tensorflow when dealing with texts generates string tensors that are stored as byte string:
However, I didn’t find an efficient way to decode this kind of tensor as a list of strings. It’s even worse if the byte string containing a non-ascii character.
What I really need is one of these two options:
Thank you very much for all your help.
This seems like more of a TF-related question rather than a Transformers-related question. The issue seems to stem from your code trying to get the value of a tensor which is not eager, using numpy. I believe the
tf.data.Dataset.map
method must trace inputs, resulting in the Tensors not being eager.Couldn’t you build the
tf.data.Dataset
with already tokenized inputs instead?