question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How properly apply a tokenizer map function to a Tensorflow batched dataset?

See original GitHub issue

Considering the following batched_dataset:

samples =  ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"}, 
              {"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
              {"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
              {"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
              ])
dataset = tf.data.Dataset.from_generator( 
    lambda: samples, {"query": tf.string, "doc": tf.string})

batched_dataset = dataset.batch(2)

#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is one relevant document regarding query 1',
#      b'this is one relevant document regarding query 2'], dtype=object)>,
# 
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is a query 1', 
#      b'this is a query 2'], dtype=object)>
#}

and a map function to tokenize this batched_dataset:

def tokenize(sample):
    tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
    tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
    return (tokenized_query, tokenized_doc) 

I could tokenize the entire batched_dataset using a for-loop:

for batch in batched_dataset:
    tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[  101,  2023,  2003,  1037, 23032,  1015,   102,     0],
#          [  101,  2023,  2003,  1037, 23032,  1016,   102,     0]],
#      dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 0],
#          [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}, 

# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
#   array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102],
#          [ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102]], dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 1],
#          [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
#  ...

However, when using tf.data.Dataset.map the following error arises:

tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'

Then, how properly apply a tokenizer map function to a batched dataset?

Note: I published a working example on Google Colab.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
celsofranssacommented, Apr 24, 2020

The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.

Tensorflow when dealing with texts generates string tensors that are stored as byte string:

<tf.Tensor: shape=(2,), dtype=string, numpy=array(
     [b'Thê first utf-8 string of the batçh.',
      b'Thê secônd utf-8 string of the batçh.'], dtype=object)>

However, I didn’t find an efficient way to decode this kind of tensor as a list of strings. It’s even worse if the byte string containing a non-ascii character.

What I really need is one of these two options:

  1. a tokenizer which is able to accept aforementioned byte string tensor as input to tokenize; or
  2. a vectorized approach to transforming a byte string tensor into a string list.

Thank you very much for all your help.

2reactions
LysandreJikcommented, Apr 23, 2020

This seems like more of a TF-related question rather than a Transformers-related question. The issue seems to stem from your code trying to get the value of a tensor which is not eager, using numpy. I believe the tf.data.Dataset.map method must trace inputs, resulting in the Tensors not being eager.

Couldn’t you build the tf.data.Dataset with already tokenized inputs instead?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How properly apply a tokenizer map function ...
I could tokenize the entire batched_dataset using a for-loop: for batch in batched_dataset: tokenize(batch) # ( # {' ...
Read more >
Tokenize dataset using map on tf.data. ...
Note: I am using the free TPU provided on Kaggle. I want to tokenize the text using transformers such that I tokenize only...
Read more >
Processing the data - Hugging Face Course
Here is how we apply the tokenization function on all our datasets at once. We're using batched=True in our call to map so...
Read more >
tf.data.Dataset | TensorFlow v2.11.0
Apply dataset transformations to preprocess the data. Iterate over the dataset and process the elements.
Read more >
Load and preprocess images | TensorFlow Core
This tutorial shows how to load and preprocess an image dataset in three ways: First, you will use high-level Keras preprocessing utilities (such...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found