question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ANTIQUE tokenization technique

See original GitHub issue

In the new “Tutorial: TF-Ranking for sparse features” notebook an interesting QA dataset, ANTIQUE, is introduced. I was inspecting the proto TFRecords and saw an interesting tokenization routine was used. For example, consider an original query from the dataset and the tokenized version:

Original: why do human beeing hv to belive in god ? Tokenized: ['why' 'do' 'human' 'bee' '##ing' 'h' '##v' 'to' 'bel' '##ive' 'in' 'god' '?']

# characters are inserted, words are split, etc. What is this technique called?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
eggie5commented, Aug 6, 2019

Also, here’s some code to create EIE for python:

import tensorflow as tf

def make_eie(context:dict, examples:dict) -> str:
    """make example-in-example proto for TFR"""
    context_feature = {
      'qid': tf.train.Feature(bytes_list=tf.train.BytesList(value=[context["qid"].encode('ascii')])),
      'uid': tf.train.Feature(bytes_list=tf.train.BytesList(value=[context["uid"].encode('ascii')])),
    }
    context_proto = tf.train.Example(features=tf.train.Features(feature=context_feature))
    serialized_context = context_proto.SerializeToString()
    
    serialized_examples=[]
    for example in examples:
        example_feature = {
          'rid': tf.train.Feature(int64_list=tf.train.Int64List(value=[example["rid"]])),
          'rel': tf.train.Feature(int64_list=tf.train.Int64List(value=[example["rel"]])),
        }

        example_proto = tf.train.Example(features=tf.train.Features(feature=example_feature))
        serialized_example = example_proto.SerializeToString()
        serialized_examples.append(serialized_example)
        
        
    #put context and examples in container
    eie_features={
        "serialized_context": tf.train.Feature(bytes_list=tf.train.BytesList(value=[serialized_context])),
        "serialized_examples": tf.train.Feature(bytes_list=tf.train.BytesList(value=serialized_examples)),
    }
    
    eie_proto = tf.train.Example(features=tf.train.Features(feature=eie_features))
    return eie_proto
    
context = {"qid": "asdf", "uid": "fdsa"}
examples = [{"rid": 1, "rel": 1}]
eie = make_eie(context, examples)
print(eie)
print(eie.SerializeToString())

First Serialization

features {
  feature {
    key: "serialized_context"
    value {
      bytes_list {
        value: "\n\"\n\017\n\003qid\022\010\n\006\n\004asdf\n\017\n\003uid\022\010\n\006\n\004fdsa"
      }
    }
  }
  feature {
    key: "serialized_examples"
    value {
      bytes_list {
        value: "\n\034\n\014\n\003rid\022\005\032\003\n\001\001\n\014\n\003rel\022\005\032\003\n\001\001"
      }
    }
  }
}

Second Serialization

b'\n{\n>\n\x12serialized_context\x12(\n&\n$\n"\n\x0f\n\x03qid\x12\x08\n\x06\n\x04asdf\n\x0f\n\x03uid\x12\x08\n\x06\n\x04fdsa\n9\n\x13serialized_examples\x12"\n \n\x1e\n\x1c\n\x0c\n\x03rid\x12\x05\x1a\x03\n\x01\x01\n\x0c\n\x03rel\x12\x05\x1a\x03\n\x01\x01'

2reactions
eggie5commented, Aug 6, 2019

@rbrackney I don’t think there is any discrepancy here. As you can see one of the features from your proto above is document_tokens and this is the same feature that the categorical_column_with_vocabulary_file function looks for in the ANTIQUE notebook.

Maybe the confusion comes from the fact that you have to deserialize the EIE twice to get what you pasted above. The first deserialization gives you two features: serialized_context and serialized_examples which are both serialized proto Examples. The output you posted above look like they are from deserializing the serialized_examples right?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training a new tokenizer from an old one - Hugging Face
Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and...
Read more >
Tokenization (data security) - Wikipedia
Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a...
Read more >
An Ancient Token System: The Precursor to Numerals ... - CORE
Prior to the invention of Sumerian writing, accounting was practiced in the ancient Middle East by means of small clay counters called tokens...
Read more >
NFT vs. Tokenization - tokenex
How Tokenization Works? Tokenization is a data security method that involves exchanging sensitive data with equivalent nonsensitive information ...
Read more >
Tokenizing Real World Assets the FinNexus Way: Part 2
Here, Part 2 presents different methods of tokenization. ... CDP open using the old Maker smart contracts) towards Multi-Collateral Dai (to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found