Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Research(?) : Alternative missing-value masks

See original GitHub issue

Feature request

The current RandomObfuscator implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.

But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token THE as your [MASK] for an English text model pre-training task.

I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).

What is the expected behavior?

I suggest two primary options:

Offer configurable alternative masking strategies (e.g. different constants) for users to select
(Preferred) Implement embedding-aware attention per #122 and offer option to embed fields with an additional mask column so e.g. scalars become 2-vectors of [value, mask]

Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality… Whereas if it’s done in a model-aware way results could be much better.

What is motivation or use case for adding/changing the behavior?

I’ve lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven’t yet bothered to “fix” to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).

As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).

…So I’m super-suspicious from background playing with this dataset, that the poor pre-training losses I’m currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked… And have seen some good performance from the flag-column treatment in past testing.

How should this be implemented in your opinion?

Implement per-field / “embedding-aware” attention, perhaps something like #217
Implement missing/masked value handling as logic in the embedding layer (perhaps something like athewsey/feat/tra) so users can control how missing values are embedded per-field similarly to how they control categorical embeddings, and one of these options is to add an extra flag column to the embedding
Modify RandomObfuscator to use a non-finite value like nan as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps in X.

Are you willing to work on this yourself?

yes

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

athewseycommented, Apr 6, 2021

Thanks both for your insights! Very useful as I try to wrap my head around it all too.

To @Optimox 's first point, I think that’s my bad: I used “embedding-aware attention” above to refer quite narrowly to a #217 -like implementation (rather than the range of perhaps different ways you could think about doing that)… And also “embedding” to refer quite broadly to the general translation from training dataset X to initial batch-norm inputs. I’d maybe characterize the #217 method further as:

Reducing the output dimension of AttentiveTransformer (the attention masks) from post_embed_dim to the raw input n_features, so the model can only (conversely only needs to) learn to attend to features… Regardless of how many dimensions the feature is internally represented by - because these columns will share the same mask/attention weight.

…So although adding “is missing” flag columns for scalars would still double FeatureTransformer input dimensionality, it need not complicate the AttentiveTransformer’s task at all (assuming constant n_a, n_d) - so the impact on task complexity need not be as bad as, say, it would be to double up your input columns for a plain XGBoost model.

I do hear & agree with the point about zero being intrinsically special as a “no contribution” value at many points in the network (especially e.g. summing up the output contributions and at attention-weighted FeatureTransformer inputs)… And I’d maybe think of the input side of this as a limitation closely related to what I’m trying to alleviate with the masking?

I wonder if e.g. is_present indicator fields would work measurably better than is_missing in practice? Or even if +1/-1 indicator fields would perform better than 1/0, so FeatureTransformers see an obvious difference between a zero-valued feature that is present, absent, or not currently attended-to.

The idea of swap-based noise rather than masking is also an interesting possibility - I wonder if there’s a way it could be implemented that still works nicely & naturally on input datasets with missing values? I’m particularly interested in pre-training as a potential treatment for missing values, since somehow every dataset always seems to be at least a little bit garbage 😂

On the nan piece, I would add that AFAIK it doesn’t necessarily need to be a blocker for backprop-based training:

To my understanding if your network takes some non-finite input values, but masks them out before reaching any trainable parameters then it shouldn’t be a problem?
You can’t just do it with maths of course because e.g. nan * 0 = nan, But you can still do normalizing operations performantly in-graph e.g. using functions like torch.isfinite() and tensor indexing?
I’m not saying this athewsey/tabnet/feat/tra draft is particularly performant, but the gross for loops come from iterating over features (because they might have different masking configurations) rather than the logic for handling an individual feature (feature[~torch.isfinite(x_feat)] = feat_nonfinite_mask). It could probably be vectorized with some more time/brains.

…But encapsulating this in the PyTorch module itself would hopefully be more easily usable with nan-containing inputs (e.g. gappy Pandas dataframes) and not a noticeable performance hit over e.g. having to do it in a DataLoader anyway? Just a bit less optimal than if you wanted to e.g. pre-process the dataset once and then run many training jobs with it.

Of course I guess the above assumes the Obfuscator comes before the “embedding” layer and has no backprop-trainable parameters - I’d have to take a closer look at the new pretraining loss function stuff to understand that a bit more and follow your comments on that & the impact to decoder!

0reactions

Optimoxcommented, Apr 7, 2021

@eduardocarvp something like this (from https://www.kaggle.com/davidedwards1/tabularmarch21-dae-starter) :

class SwapNoiseMasker(object):
    def __init__(self, probas):
        self.probas = torch.from_numpy(np.array(probas))

    def apply(self, X):
        should_swap = torch.bernoulli(self.probas.to(X.device) * torch.ones((X.shape)).to(X.device))
        corrupted_X = torch.where(should_swap == 1, X[torch.randperm(X.shape[0])], X)
        mask = (corrupted_X != X).float()
        return corrupted_X, mask

Top Results From Across the Web

Reusable Face Masks as Alternative for Disposable Medical ...

This study aims to examine the influential factors that affect the comfort of reusable face masks, but not to assess the antimicrobial or ......

Mask Alternative

The University of Florida Health's department of anesthesiology has developed 2 prototypes for masks that can be produced in large quantities ...

mask

Masks a multi-dimensional array against another given a single mask value. ... The missing value used will be array@_FillValue if it is set,...

EPA Researchers Test Effectiveness of Face Masks ...

In one study, the researchers sought to determine whether alternatives to high-efficiency N95 masks reserved for health care workers could ...

Wearing Face Masks Strongly Confuses Counterparts in ...

1 Department of General Psychology and Methodology, University of Bamberg, Bamberg, Germany; 2Research Group EPÆG (Ergonomics, Psychological Aesthetics, ...