ANTIQUE tokenization technique
See original GitHub issueIn the new “Tutorial: TF-Ranking for sparse features” notebook an interesting QA dataset, ANTIQUE, is introduced. I was inspecting the proto TFRecords and saw an interesting tokenization routine was used. For example, consider an original query from the dataset and the tokenized version:
Original: why do human beeing hv to belive in god ?
Tokenized: ['why' 'do' 'human' 'bee' '##ing' 'h' '##v' 'to' 'bel' '##ive' 'in' 'god' '?']
# characters are inserted, words are split, etc. What is this technique called?
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (4 by maintainers)
Top Results From Across the Web
Training a new tokenizer from an old one - Hugging Face
Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and...
Read more >Tokenization (data security) - Wikipedia
Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a...
Read more >An Ancient Token System: The Precursor to Numerals ... - CORE
Prior to the invention of Sumerian writing, accounting was practiced in the ancient Middle East by means of small clay counters called tokens...
Read more >NFT vs. Tokenization - tokenex
How Tokenization Works? Tokenization is a data security method that involves exchanging sensitive data with equivalent nonsensitive information ...
Read more >Tokenizing Real World Assets the FinNexus Way: Part 2
Here, Part 2 presents different methods of tokenization. ... CDP open using the old Maker smart contracts) towards Multi-Collateral Dai (to ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Also, here’s some code to create
EIEfor python:First Serialization
Second Serialization
@rbrackney I don’t think there is any discrepancy here. As you can see one of the features from your proto above is
document_tokensand this is the same feature that thecategorical_column_with_vocabulary_filefunction looks for in the ANTIQUE notebook.Maybe the confusion comes from the fact that you have to deserialize the
EIEtwice to get what you pasted above. The first deserialization gives you two features:serialized_contextandserialized_exampleswhich are both serialized protoExamples. The output you posted above look like they are from deserializing theserialized_examplesright?