The training of the discriminator amounts to the training of a good estimator for the density ratio between the model distribution and the target which is also employed in implicit models for variational optimization.

In high dimensional spaces,

  • the density ratio estimation by the discriminator is often inaccurate & unstable during the training
  • generator networks fail to learn the multimodal structure of the target distribution

Moreover, when the support of the model distribution & of the target are disjoint, we can have a perfect discriminator, which results in halting of the training of the generator.

So we need a better weight normalization method that can stabilize the training of discriminator networks, which is what we will study now “spectral normalization”.

Basics

Lets us have a simple neural network with input $x$ that gives output $f(x, \theta)$ where $\theta = \{W^1, W^2, … W^{L+1}\}$ is the set of weights and $a_i$ are the non-linear activation functions

Here we have omitted the bias terms. Discriminator gives:

where $\cal A$ is the activation function corresponding to the divergence of distance measure. In GAN we will have:

with $V(G, D)$ as described in the introduction section. For a fixed generator G, the optimal discriminator is known to be $\displaystyle D_G^*(x)=\frac{q_{\rm data}(x)}{q_{data}(x)+p_G(x)}$ which can be written as:

whose derivate is:

which can be unbounded or incomputable. So we need to add some regularize condition to derivate of $f(x)$. One successful approach was to control the Lipschitz constant of the discriminator by adding regularization terms. Spectral normalization is based on a similar approach.

Spectral Normalization

Spectral Norm: is the largest singular value in a matrix $A$ and we can think of it as the largest amount by which it can scale a value, i.e.

Here $h$ are the values that are used to find the scaling.

In spectral normalization we constrain the spectral norm of each layer $g:h_{in}\mapsto h_{out}$

Lipschitz Norm: is the supremum of the spectral norm of gradient of $g$

  • For a linear layer we have $\vert\vert g\vert\vert_{\rm Lip}=\sup_h\sigma(W)=\sigma (W)​$
  • We assume that the Lipschitz norm of the activation function $\vert\vert a_l\vert\vert_{\rm Lip}=1​$
  • We have $\vert\vert g_1\circ g_2\vert\vert_{\rm Lip}\le \vert\vert g_1\vert\vert_{\rm Lip}\cdot \vert\vert g_2\vert\vert_{\rm Lip} ​$

So we have:

Finally we normalize the spectral norm of the weight matrix $W$ so that we get $\sigma(W)=1$, i.e.

We can normalize each weight using this and we will finally get that $\vert\vert f\vert\vert_{\rm Lip}\le 1​$

Note: If we use SVD then it would become very inefficient to compute $\sigma(W)$, instead we can use the power iteration method.

References