The training of the discriminator amounts to the training of a good estimator for the density ratio between the model distribution and the target which is also employed in implicit models for variational optimization.
In high dimensional spaces,
- the density ratio estimation by the discriminator is often inaccurate & unstable during the training
- generator networks fail to learn the multimodal structure of the target distribution
Moreover, when the support of the model distribution & of the target are disjoint, we can have a perfect discriminator, which results in halting of the training of the generator.
So we need a better weight normalization method that can stabilize the training of discriminator networks, which is what we will study now “spectral normalization”.
Basics
Lets us have a simple neural network with input $x$ that gives output $f(x, \theta)$ where $\theta = \{W^1, W^2, … W^{L+1}\}$ is the set of weights and $a_i$ are the non-linear activation functions
Here we have omitted the bias terms. Discriminator gives:
where $\cal A$ is the activation function corresponding to the divergence of distance measure. In GAN we will have:
with $V(G, D)$ as described in the introduction section. For a fixed generator G, the optimal discriminator is known to be $\displaystyle D_G^*(x)=\frac{q_{\rm data}(x)}{q_{data}(x)+p_G(x)}$ which can be written as:
whose derivate is:
which can be unbounded or incomputable. So we need to add some regularize condition to derivate of $f(x)$. One successful approach was to control the Lipschitz constant of the discriminator by adding regularization terms. Spectral normalization is based on a similar approach.
Spectral Normalization
Spectral Norm: is the largest singular value in a matrix $A$ and we can think of it as the largest amount by which it can scale a value, i.e.
Here $h$ are the values that are used to find the scaling.
In spectral normalization we constrain the spectral norm of each layer $g:h_{in}\mapsto h_{out}$
Lipschitz Norm: is the supremum of the spectral norm of gradient of $g$
- For a linear layer we have $\vert\vert g\vert\vert_{\rm Lip}=\sup_h\sigma(W)=\sigma (W)$
- We assume that the Lipschitz norm of the activation function $\vert\vert a_l\vert\vert_{\rm Lip}=1$
- We have $\vert\vert g_1\circ g_2\vert\vert_{\rm Lip}\le \vert\vert g_1\vert\vert_{\rm Lip}\cdot \vert\vert g_2\vert\vert_{\rm Lip} $
So we have:
Finally we normalize the spectral norm of the weight matrix $W$ so that we get $\sigma(W)=1$, i.e.
We can normalize each weight using this and we will finally get that $\vert\vert f\vert\vert_{\rm Lip}\le 1$
Note: If we use SVD then it would become very inefficient to compute $\sigma(W)$, instead we can use the power iteration method.