SoftmaxLikelihood and building multiclass classification model
See original GitHub issueHi,
My issue is very similar to #994 by @cherepanovic, but I still don’t understand multiclass classification after reading #994. I’m opening a new issue not to clutter #994 with my question.
I am trying to build a multiclass classification model using the SoftmaxLikelihood
. However, I am not sure what the arguments of this likelihood function mean (the documentation on this function is very succint), and I haven’t been able to figure it out by myself because I can’t reproduce the only example I’ve found (SVDKL on CIFAR). Without any modification, I get the error
---> 50 from densenet import DenseNet
51
52 class DenseNetFeatureExtractor(DenseNet):
ModuleNotFoundError: No module named 'densenet'
If I define DenseNet as in your file densenet.py, I obtain the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-aa721b7a7db0> in <module>
177 for epoch in range(1, n_epochs + 1):
178 with gpytorch.settings.use_toeplitz(False):
--> 179 train(epoch)
180 test()
181 scheduler.step()
<ipython-input-8-aa721b7a7db0> in train(epoch)
153 optimizer.zero_grad()
154 output = model(data)
--> 155 loss = -mll(output, target)
156 loss.backward()
157 optimizer.step()
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/module.py in __call__(self, *inputs, **kwargs)
22
23 def __call__(self, *inputs, **kwargs):
---> 24 outputs = self.forward(*inputs, **kwargs)
25 if isinstance(outputs, list):
26 return [_validate_module_outputs(output) for output in outputs]
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/mlls/variational_elbo.py in forward(self, variational_dist_f, target, **kwargs)
75 :return: Variational ELBO. Output shape corresponds to batch shape of the model/input data.
76 """
---> 77 return super().forward(variational_dist_f, target, **kwargs)
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/mlls/_approximate_mll.py in forward(self, approximate_dist_f, target, **kwargs)
55 # Get likelihood term and KL term
56 num_batch = approximate_dist_f.event_shape.numel()
---> 57 log_likelihood = self._log_likelihood_term(approximate_dist_f, target, **kwargs).div(num_batch)
58 kl_divergence = self.model.variational_strategy.kl_divergence().div(self.num_data / self.beta)
59
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/mlls/variational_elbo.py in _log_likelihood_term(self, variational_dist_f, target, **kwargs)
59
60 def _log_likelihood_term(self, variational_dist_f, target, **kwargs):
---> 61 return self.likelihood.expected_log_prob(target, variational_dist_f, **kwargs).sum(-1)
62
63 def forward(self, variational_dist_f, target, **kwargs):
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/likelihoods/likelihood.py in expected_log_prob(self, observations, function_dist, *args, **kwargs)
37
38 def expected_log_prob(self, observations, function_dist, *args, **kwargs):
---> 39 likelihood_samples = self._draw_likelihood_samples(function_dist, *args, **kwargs)
40 res = likelihood_samples.log_prob(observations).mean(dim=0)
41 return res
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/likelihoods/likelihood.py in _draw_likelihood_samples(self, function_dist, sample_shape, *args, **kwargs)
34 function_dist = base_distributions.Independent(function_dist, num_event_dims - 1)
35 function_samples = function_dist.rsample(sample_shape)
---> 36 return self.forward(function_samples, *args, **kwargs)
37
38 def expected_log_prob(self, observations, function_dist, *args, **kwargs):
~/housekeeping/virtualenv/ml2/lib/python3.6/site-packages/gpytorch/likelihoods/softmax_likelihood.py in forward(self, function_samples, *params, **kwargs)
36 num_features, num_data = function_samples.shape[-2:]
37 if num_features != self.num_features:
---> 38 raise RuntimeError("There should be %d features" % self.num_features)
39
40 if self.mixing_weights is not None:
RuntimeError: There should be 132 features
I may well be making a mistake, but I am copying and pasting everything in the example, without modifying any code myself.
My questions are:
-
What does the argument
num_features
toSoftmaxLikelihood
mean? Is this the number of features of the inputs (e.g. if the input are MNIST images, isnum_features
28*28=784? If so, why does the likelihood function need access to the number of features of the inputs, if it doesn’t deal with the inputs themselves? -
What does the argument
num_classes
toSoftmaxLikelihood
mean? I’m assuming it is just the number of classes? E.g. in MNISTnum_classes
would be 10? -
What are the mixing weights in a softmax function? The softmax that I know doesn’t need any weights, but simply normalizes the logits by exponentiating them and dividing them by the sum of all the exponentiated logits.
-
What does the argument to
num_tasks
toMultitaskVariationalStrategy
? This also has to be the number of classes, right? -
What are the appropriate target dimensions when there are multiple classes? Should they be:
-
One-dimensional, with number of elements
num_samples
, and each target represented by a number? (e.g. 0-9 in MNIST)? -
Two-dimensional, same as before, but with vertical vector, so dimensions are
num_samples x 1
? -
Two-dimensional, but with a one-hot encoding, so the dimensions would be
num_samples x num_classes
?
-
Thanks a lot in advance.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
See equation 1 of the SVDKL paper. The output of the GP is a multi-output multivariate normal distribution of
n x f
, wheref
isnum_features
. The likelihood uses a linear mixing parameter (A
in equation 1) to reduce thisf
dimensional output into ac
dimensional output, wherec
is the number of classesnum_classes
.num_features
refers tof=132
: the number of (independent) features that are output from the GP. If you look at the output of the GP layer, it will be a MultitaskMultivariateNormal distribution with event shapen x 132
.num_classes
isc
, the number of classification classes (e.g. 10 for mnist)Again, see equation 1 of the paper
num_tasks
here refers tof
, the number of output dimension in the GP (so 132).You are correct - the output of the likelihood is a
Categorical
distribution, which is represented by ac
dimensional vector (one output for each class).For more details, please refer to the SVDKL paper
n
is the number of data, and it shouldn’t influence the number of neural network features or the dimensionality of the data.f
refers to the function that the GP learns - it is not a counting variable.Q = J
in our case - i.e. there is one Gaussian process for each neural network feature. (According to the experiments and the authors of the paper - this is the best setup to use).