Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ANTIQUE tokenization technique

See original GitHub issue

In the new “Tutorial: TF-Ranking for sparse features” notebook an interesting QA dataset, ANTIQUE, is introduced. I was inspecting the proto TFRecords and saw an interesting tokenization routine was used. For example, consider an original query from the dataset and the tokenized version:

Original: why do human beeing hv to belive in god ? Tokenized: ['why' 'do' 'human' 'bee' '##ing' 'h' '##v' 'to' 'bel' '##ive' 'in' 'god' '?']

# characters are inserted, words are split, etc. What is this technique called?

Issue Analytics

State:
Created 4 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

eggie5commented, Aug 6, 2019

Also, here’s some code to create EIE for python:

import tensorflow as tf

def make_eie(context:dict, examples:dict) -> str:
    """make example-in-example proto for TFR"""
    context_feature = {
      'qid': tf.train.Feature(bytes_list=tf.train.BytesList(value=[context["qid"].encode('ascii')])),
      'uid': tf.train.Feature(bytes_list=tf.train.BytesList(value=[context["uid"].encode('ascii')])),
    }
    context_proto = tf.train.Example(features=tf.train.Features(feature=context_feature))
    serialized_context = context_proto.SerializeToString()
    
    serialized_examples=[]
    for example in examples:
        example_feature = {
          'rid': tf.train.Feature(int64_list=tf.train.Int64List(value=[example["rid"]])),
          'rel': tf.train.Feature(int64_list=tf.train.Int64List(value=[example["rel"]])),
        }

        example_proto = tf.train.Example(features=tf.train.Features(feature=example_feature))
        serialized_example = example_proto.SerializeToString()
        serialized_examples.append(serialized_example)
        
        
    #put context and examples in container
    eie_features={
        "serialized_context": tf.train.Feature(bytes_list=tf.train.BytesList(value=[serialized_context])),
        "serialized_examples": tf.train.Feature(bytes_list=tf.train.BytesList(value=serialized_examples)),
    }
    
    eie_proto = tf.train.Example(features=tf.train.Features(feature=eie_features))
    return eie_proto
    
context = {"qid": "asdf", "uid": "fdsa"}
examples = [{"rid": 1, "rel": 1}]
eie = make_eie(context, examples)
print(eie)
print(eie.SerializeToString())

First Serialization

features {
  feature {
    key: "serialized_context"
    value {
      bytes_list {
        value: "\n\"\n\017\n\003qid\022\010\n\006\n\004asdf\n\017\n\003uid\022\010\n\006\n\004fdsa"
      }
    }
  }
  feature {
    key: "serialized_examples"
    value {
      bytes_list {
        value: "\n\034\n\014\n\003rid\022\005\032\003\n\001\001\n\014\n\003rel\022\005\032\003\n\001\001"
      }
    }
  }
}

Second Serialization

b'\n{\n>\n\x12serialized_context\x12(\n&\n$\n"\n\x0f\n\x03qid\x12\x08\n\x06\n\x04asdf\n\x0f\n\x03uid\x12\x08\n\x06\n\x04fdsa\n9\n\x13serialized_examples\x12"\n \n\x1e\n\x1c\n\x0c\n\x03rid\x12\x05\x1a\x03\n\x01\x01\n\x0c\n\x03rel\x12\x05\x1a\x03\n\x01\x01'

2reactions

eggie5commented, Aug 6, 2019

@rbrackney I don’t think there is any discrepancy here. As you can see one of the features from your proto above is document_tokens and this is the same feature that the categorical_column_with_vocabulary_file function looks for in the ANTIQUE notebook.

Maybe the confusion comes from the fact that you have to deserialize the EIE twice to get what you pasted above. The first deserialization gives you two features: serialized_context and serialized_examples which are both serialized proto Examples. The output you posted above look like they are from deserializing the serialized_examples right?

Top Results From Across the Web

Training a new tokenizer from an old one - Hugging Face

Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and...

Tokenization (data security) - Wikipedia

Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a...

An Ancient Token System: The Precursor to Numerals ... - CORE

Prior to the invention of Sumerian writing, accounting was practiced in the ancient Middle East by means of small clay counters called tokens...

NFT vs. Tokenization - tokenex

How Tokenization Works? Tokenization is a data security method that involves exchanging sensitive data with equivalent nonsensitive information ...

Tokenizing Real World Assets the FinNexus Way: Part 2

Here, Part 2 presents different methods of tokenization. ... CDP open using the old Maker smart contracts) towards Multi-Collateral Dai (to ...