How to provide class weights for an imbalanced datasets?
See original GitHub issueHi,
I have a fairly imbalanced dataset. I am using this to learn a feature and am replying on dependency learning to get better results, i.e. as explained here, I have a coarse feature, who identification helps in identifying a fine feature.
Below is my model configuration for the above goal:
input_features:
-
name: nhl
type: sequence
encoder: rnn
cell_type: lstm
num_layers: 4
reduce_output: null
output_features:
-
name: mode
type: category
num_fc_layers: 2
-
name: volpiano
type: sequence
decoder: generator
cell_type: lstm
attention: bahdanau
num_fc_layers: 1
dependencies:
- mode
loss:
type: sampled_softmax_cross_entropy
The problem is (or at least in my opinion) that the mode
output on which the volpiano
output depends is very imabalanced. Below is the distribution of this feature:
As can be seen mode 8, 1 and 7
are much better represented than the other categories in the dataset.
Is there any way to perform weighted learning to reduce this class imbalance? I found this issue discussing this as well: https://github.com/ludwig-ai/ludwig/issues/615 and know that I can use class_weights
to achieve this… but am not very use how to do this?
How do I find the class_weights
values? and if I have this many classes, how can I be sure which weight to associate with which? I mean if I write: class_wrights: [8, 7, 6, 6, 3, ....]
how can I be sure which weight is associated with which label?
Thanks
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top GitHub Comments
Hi @farazk86 , you may be right and the dealing with the class imbalance of the mode can potentially improve your results.
My suggestions in that case would be to do either of two things:
class_weights
parameter. Regarding the association, you can provide them in the same order, from more to least frequent class, that Ludwig figures out when mapping string to integer. To recover it check thetraining_set_metadata.json
file, which containsidx2str
. An alternative is to provide a dictionary instead, like:{"class_1": 2, "class_2": 0.3, ...}
so that the mapping is explicit. Regarding figuring out what those values should actually be, there’s no exact way. If you want to compensate for the long tail distribution, you could assign amin(frequencies)/class_frequency
weight for each class, but it maybe a bit too strong for very frequent classes. I would say that ideally what you want to do though is to give smaller weights to more frequent classes.Hopefully this helps!
Closing this issue since original issue is resolved.