Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using custom input image shape

See original GitHub issue

Can we use a custom input image shape while training? I am looking forward to set an input shape of (512, 512, 3) but anything else that (32, 32, 3) throws a mismatch error. Can you explain how to determine the encoder and decoder network parameters? Thanks!

Issue Analytics

State:
Created a year ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

ascillitoecommented, Aug 2, 2022

@pranjal-joshi-cc has @mauicv answered your question above? If so we shall close this issue 🙂

1reaction

mauicvcommented, Jul 6, 2022

Hey @pranjal-joshi-cc,

I’ve understood the encoder part that through the strides parameter, we control the dimensionality reduction and with encoder_net.summary() we can see the size of last convolution operation i.e. N x N x Filters. However, is it necessary to always map the encoder into 32 x 32 for alibi-detect to work or the choice of autoencoder is purely arbitrary?

I’m not completely sure what you mean here? The choice of the autoencoder is arbitrary except that:

The architecture needs to be sufficient to model the data well. What I mean by this is when it’s trained in the detector fit method it needs to reduce the reconstruction error well. This might not be possible if you don’t have enough capacity in the network. As an example, if you don’t choose a big enough latent dimension you might have difficulty. I don’t think this should be an issue for the models defined above though.
The VAE needs to give the same shape as output as it takes as input. For the purposes of the VAEOutlier this really only applies to the decoder. It needs to ensure that the decoder maps from the latent space of size latent_dim to the same shape as the original input image, so in your case (512, 512, 3).

In terms of the output shape of the encoder, it doesn’t really matter as long as the capacity is sufficient, basically that you don’t reduce the dimensionality too much. For the architecture I provided above for instance we have:

encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=IMAGE_SHAPE),
      Conv2D(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(256, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(516, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(1024, 4, strides=2, padding='same', activation=tf.nn.relu),
  ])

and the summary is:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 256, 256, 32)      1568      
                                                                 
 conv2d_1 (Conv2D)           (None, 128, 128, 64)      32832     
                                                                 
 conv2d_2 (Conv2D)           (None, 64, 64, 128)       131200    
                                                                 
 conv2d_3 (Conv2D)           (None, 32, 32, 256)       524544    
                                                                 
 conv2d_4 (Conv2D)           (None, 16, 16, 516)       2114052   
                                                                 
 conv2d_5 (Conv2D)           (None, 8, 8, 1024)        8455168   
                                                                 
=================================================================
Total params: 11,259,364
Trainable params: 11,259,364
Non-trainable params: 0
_________________________________________________________________

So the output shape of the encoder_net is (8, 8, 1024). Note that the VAEOutlier adds some Dense layers to the encoder_net to transform the (8, 8, 1024) output to the latent space of dimension 1024 where you’ve chosen latent_dim=1024.

Also, Please explain how to calculate and reshape dense layers in decoder net as its quite confusing for me. How to determine the number of Dense units i.e. 8*8*1024 and how to determine the reshaping in the next layer?

The decoder_net maps from the latent space of dimension 1024 (in our case) to the output shape (512, 512, 3). So it is going to take a vector of length latent_dim. We want to transform this to a shape that can then easily be scaled up to (512, 512, 3). You can do this a number of ways but it’s easiest if we set up the Conv2dTranspose operation to double the size of the height and width at each layer of the network. The reason we choose 8*8*1024 is just that this can then be reshaped into (8, 8, 1024). We can then upscale this to obtain the output image by applying each of the transpose layers. For instance, given the architecture I suggested above:

latent_dim = 1024

decoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(latent_dim,)),
      Dense(8*8*1024),
      Reshape(target_shape=(8, 8, 1024)),
      Conv2DTranspose(1024, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(516, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(256, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(3, 1, strides=1, padding='same', activation='sigmoid')
  ])

The latent vector of shape (1, 1024) is mapped to a vector of shape (1, 8*8*1024) which is then reshaped to (1, 8, 8, 1024) and then upscaled by each of the transpose layers: (1, 8, 8, 1024) -> (1, 16, 16, 1024) -> (1, 32, 32, 516) -> (1, 64, 64, 256) -> (1, 128, 128, 128) -> (1, 256, 256, 64) -> (1, 512, 512, 32) -> (1, 512, 512, 3). So (8*8*1024) is really chosen as a convenience in order to reshape the tensor. Typically we choose image height and width sizes to be powers of 2 just becuase it makes this operation of scaling up and down simpler but in general this doesn’t have to be the case. The formula for the output size of a transpose convolution is documented here.