Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for custom Encoders

See original GitHub issue

If you currently use word-level embeddings (e.g. fastText), whatlies supports embeddings for sentences by summing the individual word embeddings. While this is reasonable default behaviour, its also an arbitrary and inflexible choice. Ideally whatlies can support standard encoding schemes such as sum, average or max, and otherwise offer the use of callables for any custom operation that users want.

Issue Analytics

State:
Created 3 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

mkazecommented, Aug 31, 2020

It might be worth following what scikit-learn does when both strings and callables are supported. For example in sklearn.metrics.pairwise_distances, metric can be a string such as 'cosine' (as this is supported by the library internally), or it can be a callable. In the latter case, the API also has a **kwds parameter, which passes its contents on to the custom metric callable.

Supporting string metric values as well as callables is not an issue at all and could be done easily. However, I am not in favor of adding that **kwds argument mainly because we already have that in some of the language classes (for example here in HFTransformers) for passing additional arguments to underlying language backend constructor. So if we want to use __init__ for this purpose and also support the callables as well as custom keyword arguments for them, we should either:

introduce another argument for __init__ like combiner_kwd, or
put the burden on the user and expect that the callable has been already augmented with additional keyword arguments (e.g. by wrapping it in lambda or using functools.partial).

1reaction

mkazecommented, Aug 31, 2020

That’s why we should also have a factory setting.

That’s not different from having a default value of None for the relevant config value and therefore use whatever the language backend does by default in that case; so no worries there!

For BERT style models you wouldn’t want the representation for the entire utterance, say [play ping pong], to be defined as the mean/sum/max of it’s parts.

Actually, that’s not entirely true. The __CLS__ token which is present in some of the transformer models has been added and finetuned on downstream tasks so that it could be a good representation of the entire input sequence; however, nothing prevents you to use other alternative representation for the entire sequence based on the contextualized token embeddings of sequence, and also there is no guarantee that the __CLS__ token would out-perform all of the other representations. Actually, as I have already mentioned this in #92, the spacy package uses average of token embeddings and the spacy-transformers uses the sum of token embeddings (not the __CLS__ token). You can even go further, as I also mentioned this in #92, and use the representation given by the intermediate transformer layers or even the embedding layer itself. So this usually depends on the downstream task, the data or the analysis you want to perform (just as an example, see the result of different contextualized token embedding combinations in BERT paper for NER; here is a visual summary).

Top Results From Across the Web

Custom Encoder Mapping - YouTube

Encoder Mapping allows you to determine how parameters are assigned or populated on the encoders. Specific mappings can be created based on ...

Custom Encoder / Decoder supporting RawRepresentable

I give up figuring this out myself. A struct that conforms to RawRepresentable and to Codable whose rawValue conforms to Codable should have ......

Encoding and Decoding Custom Types - Apple Developer

To support both encoding and decoding, declare conformance to Codable , which combines the Encodable and Decodable protocols. This process is known as...

Custom Encoders - WCF - Microsoft Learn

This topic describes how to extend the Text , Binary , and MTOM message encoders that are included in WCF, or create your...

Support for adding custom encoders · Issue #330 - GitHub

I would like a way to add custom encoders using the config here. There is an Encoding option here that allows both console...