allow objects to define their own tokenize
See original GitHub issueRecently, I found it convenient for objects to define their own tokenization. Basically, by checking if it has the attribute tokenize
. Would this be something that could be supported by dask?
If this has already been discussed, I apologize in advance. I searched very quickly.
Thanks!
example:
def tokenize(*args, **kwargs):
# now check if each argument/kwarg handle their own tokenization
argslist = list()
for arg in args:
if hasattr(arg, 'tokenize'):
argslist.append(arg.tokenize())
else:
argslist.append(arg)
kwargsdict = dict()
for key, val in kwargs.items():
if hasattr(val, 'tokenize'):
kwargsdict[key] = val.tokenize()
else:
kwargsdict[key] = val
hsh = tokenize(*argslist, **kwargsdict, pure=True)
return hsh
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
The tokenization pipeline - Hugging Face
Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to what your tokens will be at...
Read more >How Do You Tokenize Assets to Create Non-Fungible Tokens ...
1. Selecting the Asset ... You can tokenize anything that you own or can create, but it's worth choosing something that might hold...
Read more >What is Tokenization | Methods to Perform Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or ...
Read more >5 Simple Ways to Tokenize Text in Python | by Frank Andrade
Tokenization is a common task a data scientist comes across when working with text data. It consists of splitting an entire text into...
Read more >Building a Tokenizer and a Sentencizer | Analytics Vidhya
Understanding the underlying bones that give NLP its structure ... But, since we do not create a Sentence/Token with its own string, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For those like me coming here after Oct 2017, the method
__dask_tokenize__
was added. (http://docs.dask.org/en/latest/custom-collections.html#deterministic-hashing)I’m not 100% set on this being public api yet, but I wouldn’t expect it to change anytime soon so you should be fine relying on it.