Why I should 1 : Divide by spectral norm [ARTICLE IN PROGRESS DONT TAKE IT SERIOUSLY]

The idea of this series of very short articles (Why I should) is to explain some commonplace in machine learning.

Simply explain common tricks with mathematical arguments.

Let’s dive into the spectral norm !


The interest of the spectral norm

If you work on GAN or robustness, you probably deal with spectral norm.

In both cases, the network weight matrices are divided by their spectral norm.The goal of this operation is to make 1-lipschitz continuous the network.

Lipschitz continuous application

If (E,dE) and (F,dF) are two metric spaces and l:EF an application from E to F, l is called lipschtitz continuous if : K>0 | x,yE, dF(l(x),l(y))KdE(x,y)

In practice dE and dF are norms and lL(E;F) is a linear application.

As l is linear the Lipschitz condition can be rewritten as :

K>0 | xE, WxFKxE with W the associated matrix of the linear application l.

This equivalence is only true for linear applications. It is natural to be interested only in the linear application. In fact it is only the weight matrices of the models that interest us here.

How to interpret this property ? : we can see the norm of a vector as its energy E. We can therefore rewrite Lipschitz’s condition as :K>0 | xE, E(Wx)E(x)K In a sense the energy ratio between input x and output Wx is bounded, this property ensures that the energy Wx does not explode. This makes our W application more stable, more robust.

It is this property that is sought in GANs or to make its network robust.

Having lipschitz continuous applications ensures that our model is robust. The lipschitz constant K allows us to manage the “degree of robustness” of our application.

We would like to have a stronger condition than the K-lipschitz continuity. In practice, most layers of neural networks are already K-lipschitzian applications for a certain K . However, we would like to make this constant K equal to 1 so that the output energy is smaller or equal to the input energy.

A solution to make our layer 1-lipschitz is to use the spectral norm.

Spectral norm

As I said, the mathematical object that will make a network 1-Lipschitz continuous is the spectral norm.

Let WMm,n(R), the spectral norm of W is defined as σ(W)=supx21Wx2=supx0Wx2x2

To transform a linear application into a 1-lipschitz continuous application, simply divide the matrix W of the application by the spectral norm of this matrix :

WWσ(W)

Proof that Wσ(W) is Lipschitz continuous

x0, Wσ(W)xx=Wxσ(W)x=WxxzWz=WxxWzz1with Wzz=supx0Wxx and z=argmaxy0WyyWxxWzz1WxWzzxWσ(W)xx By definition it’s mean Wσ(W) is 1-Lipschitz continuous. ∎

Apply spectral norm to a neural network

import torch.nn as nn
m = nn.utils.spectral_norm(nn.Linear(20, 40))

Before going any further I will recall a few facts.

Let f1:E1E2, f2:E2E3,,fN:ENEN+1 be all Lipschitz applications. i{1,2,,N}, Ki>0 | x,yEi, fi(x)fi(y)Ei+1KixiyiEi F=i=1Nfi=f1f2fN therefore F is K-lipschitz with K=n=1NKi

Proof that F is K-lipschitz

fN(fN1(f1(x))fN(fN1(f1(y))KNfN1(fN2(f1(x))fN1(fN2(f1(y))n=1NKixy

By definition it’s mean that F is (n=1NKi)-lipschitz continuous. ∎

Mehdi Zouitine
Mehdi Zouitine
MSc Student in Applied maths

Personal blog about mathematics, machine learning and their applications!

Related