Research(?) : Alternative missing-value masks
See original GitHub issueFeature request
The current RandomObfuscator
implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.
But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token THE
as your [MASK]
for an English text model pre-training task.
I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).
What is the expected behavior?
I suggest two primary options:
- Offer configurable alternative masking strategies (e.g. different constants) for users to select
- (Preferred) Implement embedding-aware attention per #122 and offer option to embed fields with an additional mask column so e.g. scalars become 2-vectors of [value, mask]
Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality… Whereas if it’s done in a model-aware way results could be much better.
What is motivation or use case for adding/changing the behavior?
I’ve lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven’t yet bothered to “fix” to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).
As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).
…So I’m super-suspicious from background playing with this dataset, that the poor pre-training losses I’m currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked… And have seen some good performance from the flag-column treatment in past testing.
How should this be implemented in your opinion?
- Implement per-field / “embedding-aware” attention, perhaps something like #217
- Implement missing/masked value handling as logic in the embedding layer (perhaps something like athewsey/feat/tra) so users can control how missing values are embedded per-field similarly to how they control categorical embeddings, and one of these options is to add an extra flag column to the embedding
- Modify
RandomObfuscator
to use a non-finite value likenan
as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps inX
.
Are you willing to work on this yourself?
yes
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Thanks both for your insights! Very useful as I try to wrap my head around it all too.
To @Optimox 's first point, I think that’s my bad: I used “embedding-aware attention” above to refer quite narrowly to a #217 -like implementation (rather than the range of perhaps different ways you could think about doing that)… And also “embedding” to refer quite broadly to the general translation from training dataset
X
to initial batch-norm inputs. I’d maybe characterize the #217 method further as:…So although adding “is missing” flag columns for scalars would still double
FeatureTransformer
input dimensionality, it need not complicate theAttentiveTransformer
’s task at all (assuming constantn_a
,n_d
) - so the impact on task complexity need not be as bad as, say, it would be to double up your input columns for a plain XGBoost model.I do hear & agree with the point about zero being intrinsically special as a “no contribution” value at many points in the network (especially e.g. summing up the output contributions and at attention-weighted
FeatureTransformer
inputs)… And I’d maybe think of the input side of this as a limitation closely related to what I’m trying to alleviate with the masking?I wonder if e.g.
is_present
indicator fields would work measurably better thanis_missing
in practice? Or even if+1/-1
indicator fields would perform better than1/0
, soFeatureTransformer
s see an obvious difference between a zero-valued feature that is present, absent, or not currently attended-to.The idea of swap-based noise rather than masking is also an interesting possibility - I wonder if there’s a way it could be implemented that still works nicely & naturally on input datasets with missing values? I’m particularly interested in pre-training as a potential treatment for missing values, since somehow every dataset always seems to be at least a little bit garbage 😂
On the
nan
piece, I would add that AFAIK it doesn’t necessarily need to be a blocker for backprop-based training:nan * 0 = nan
, But you can still do normalizing operations performantly in-graph e.g. using functions like torch.isfinite() and tensor indexing?for
loops come from iterating over features (because they might have different masking configurations) rather than the logic for handling an individual feature (feature[~torch.isfinite(x_feat)] = feat_nonfinite_mask
). It could probably be vectorized with some more time/brains.…But encapsulating this in the PyTorch module itself would hopefully be more easily usable with
nan
-containing inputs (e.g. gappy Pandas dataframes) and not a noticeable performance hit over e.g. having to do it in a DataLoader anyway? Just a bit less optimal than if you wanted to e.g. pre-process the dataset once and then run many training jobs with it.Of course I guess the above assumes the Obfuscator comes before the “embedding” layer and has no backprop-trainable parameters - I’d have to take a closer look at the new pretraining loss function stuff to understand that a bit more and follow your comments on that & the impact to decoder!
@eduardocarvp something like this (from https://www.kaggle.com/davidedwards1/tabularmarch21-dae-starter) :