This was my learning process for the simple group convolution from scratch, which I did back in 2021 along with the Geometric Deep Learning (GDL) course. This is part of the tutorial 2, which can be found here. I was trying to remind myself of some of the key concepts by cleaning my previous codes and summarizing here for myself.
The key of the group equivariant neural network is to identify the spaces of input, hidden and output because the network will perform like a function mapping from one space to another. I use \(X\), \(Y\) as input and output spaces and \(H_i\) as the i-th hidden space. The network consists a number of layers, transforming the input to hidden to output:
Each of the layers is represented as \(\to\) in the above and depneding on which spaces the layers map from and to, we need to consider the design of the layers. This will be the focus of this post.
0. Dependencies
Let’s just load some dependencies for later use.
1. Group
A group \((G, \circ )\) is a set \(G\) with a binary operation: \(\circ : G\times G \to G\), which satisfies the following (Ignored the \(\circ\)).
Associativity: \(a (bc) = (ab) c\)
Identity: \(e \in G\), \(g e = e g = g\)
Inverse: \(\forall g \in G\), \(gg^{-1} = g^{-1}g = e\)
Consider the group \(G = C_4\), the cyclic group with \(\pi / 2\) planar rotation. \(\|C_4\| = 4\). Let \(X\) be the set of some \(n\times n\) gray images. An image \(x \in X\) is a function \(x: p \to x[p] \in R\) which maps each pixel \(p = (h, w)\) to a real number.
An element \(g \in G = C_4\), transforms an image \(x \in X\) into the image \(gx \in X\) through rotation. The rotated image \(gx\) is \([gx](p) = x(g^{-1}p)\) where \(g^{-1}p\) is the pixel in the unrotated image. This action is the following:
2. Equivariant Convolution Layers
An equivariant layer (or function) \(\psi: X \to Y\) from an input G-space \(X\) to an output G-space \(Y\). The input space is the pixel space and we choose the same output space: \(Y=X\). So both input and output are the gray-scale image (they might be of different sizes / dimensions.)
The equivariant layer will be \(3 \times 3\) filter. We will verify that a random filter is not a rotational equivariant layer.
Not equivariant!!
Clearly, if the filter \(\psi\) has no constraint on the weights,
\[\psi(gx) \neq g\psi(x)\]
There must be some \(C_4\) symmetry boiled in the filter. Let’s first try the isotropic filter where there are 2 trainable weights: one in the middle, the the other in the ring.
Equivariant
Let’s use this filter to build a equivariant convolution layer:
Since this layer maps from input space \(X\) to input space \(X\), so we define the following equivariance checker.
Let’s check if the IsotropicEConv2d is acturally equivariant:
Equivariant
Unfortunately, isotropic filters are not very expressive. Instead, we would like to use more general, unconstrained filters. To do so, we need to rely on group convolution.
Let \(X\) be the space of grayscale images, \(\psi \in X\) be a filter and \(x \in x\) be an input image. The group convolution is
The output of the convolution is not a grayscale image in \(X\). It is now a function over the rotational group. The use of the filter \(\psi\) in the group convolution maps the input space \(X\) into a new larger space $Y$, where \(Y\) is the space of all function \(y: p_4 \to \mathbf{R}\).
This is the lifting convolution since it maps the space \(X\) to the more complex space \(Y\). Note that a function \(y \in Y\) can be implemented as a 4-channel image, where the ith channel is defined as \(y_i(t) = y(y, r=i) \in \mathbf{R}, i \in \{0, 1, 2, 3\}\)
In the end of the day, we want to have a network
\[X \to H_1 \to H_2 \to ... \to Y\]
where \(X\) is the grayscale image or the 3-channel image space and \(H\)’s are the group (hidden) space and \(Y\) can be a pooled invariant output or equivariant output space.
See the rotation p4 group by \(r = 1\):
Next, we will build a lifting convoltuion. The input is a grayscale image \(x \in X\) and the output is a function \(y \in Y\). This can berealized by exploiting the usual convolution using 4 rotated copies of a SINGLE learnable filter. The image is colvolved with each copy independently by stacking 4 copies into a unique filter with 4 output channels.
Finally, a convolutional layer usually includes a bias term. In a normal convolutional network, it is common to share the same bias over all pixels, i.e. the same bias is summed to the features at each pixel. Similarly, when we use a lifting convolution, we share the bias over all pixels but also over all rotations, i.e. over the output channels.
Since this layer maps from input space \(X\) to group space \(Y\), so we define the following equivariance checker.
Let’s check if the LiftingConv2d is equivariant.
Equivariant
The lifting convolution is only the first piece of building the \(C_4\) equivariant neural network. Remember we are after
\[X \to H_1 \to H_2 \to ... \to Y\]
and lifting convolution is the first “\(\to\)”, we still need to build equivariant layers from \(H_i \to H_j\).
As compared to the usual CNN: \(X \to X \to X ... \to Y\).
We will construct the convolution on the OUTPUT from the lifting convolution.
So we simply use additional 4-rotated filters for the convolution of the input, i.e. 4 rotated filters for 4 channels from lift convolution. The output of this group convolution is also 4 channels and we can stack this group convolution to build a deep G-CNN.
Now, GroupConv2d maps from group space \(Y\) to the same group space \(Y\). We will define another equivariance checker.
And ckeck the equivariance.
Equivariant
3. Implement A Deep Rotation Equivariant CNN
Fianlly, you can combine the layers you have implemented earlier to build a rotation equivariant CNN. You model will take in input batches of $33 \times 33$ images with a single input channel.
The network performs a first lifting layer with $8$ output channels and is followed by $4$ group convolution with, respectively, $16$, $32$, $64$ and $128$ output channels. All convolutions have kernel size $3$, padding $1$ and stride $1$ and should use the bias. All convolutions are followed by torch.nn.MaxPool3d and torch.nn.ReLU. Note that we use MaxPool3d rather than MaxPool2d since our feature tensors have $5$ dimensions (there is an additional dimension of size $4$). In all pooling layers, we will use a kernel of size $(1, 3, 3)$, a stride of $(1, 2, 2)$ and a padding of $(0, 1, 1)$. This ensures pooling is done only on the spatial dimensions, while the rotational dimension is preserved. The last pooling layer, however, will also pool over the rotational dimension so it will use a kernel of size $(4, 3, 3)$, stride $(1, 1, 1)$ and padding $(0, 0, 0)$.
Finally, the features extracted from the convolutional network are used in a linear layer to classify the input in $10$ classes.
So C4CNN is invariant to \(C_4\) rotation, meaning that it can recognize an image even though it’s rotated in \(C_4\). the C4CNN, though invariant, it contains a lot to hidden features that are equivariant. In other words:
Rotated image -> rotated features -> rotated features -> … -> invariant output
Allowing the equivariant hidden features make the model more powerful and data efficient because the model is already symmetry-restricted.
Let’s check if the C4CNN is invariant.
4. Compare CNN to GCNN on rotated MNIST dataset
After buidling the C4CNN as \(C_4\)-invariant model with \(C_4\)-equivariant features, we want to compare it with typical CNN on the rotated MNIST dataset.
We define functions to train and test models, which is typical pytorch forward pass with/without gradients.
Next, we define a CNN model similar to C4CNN by recycling the code.
Let’s finally get the models trained and report the accuracies.
5. Final Note
The performance of C4CNN is significantly higher. I also did 25 repeated runs and C4CNN and CNN averaged 92% and 81% in accuracy. However, C4CNN took about 5x more time to train. Considering the (maybe) 4x data one needs to augment, this might not be beneficial in this case and also natural equivariance shows up in the trained filters.
However, equivariance makes a lot difference in the continuous group where one cannot just rotate the filters to achieve equivariance.