The Many Ways to Analyse Gradient Descent

Consider the classical gradient descent method:
xk+1 = xk αf(x k).

It’s a thing of beauty isn’t it? While it’s not used directly in practice any more, the proof techniques used in its analysis are the building blocks behind the theory of more advanced optimization methods. I know of 8 different ways of proving its convergence rate. Each of the proof techniques are interesting in their own right, but most books on convex optimization give just a single proof of convergence, then move onto greater things. But to do research in modern convex optimization you should know them all.

The purpose of this series of posts is to detail each of these proof techniques and what applications they have to more advanced methods. This post will cover the proofs under strong convexity assumptions, and the next post will cover the non-strongly convex case. Unlike most proofs in the literature, we will go into detail of every step, so that these proofs can be used as a reference (don’t cite this post directly though, cite the original source preferably, or the technical notes version). If you are aware of any methods I’ve not covered, please leave a comment with a reference so I can update this post.

For most of the proofs we end with a statement like Ak+1 (1 γ)Ak, where Ak is some quantity of interest, like distance to solution or function value sub-optimality. A full proof requires chaining these inequalities for each k, giving something of the form Ak (1 γ)kA0. We leave this step as a given.

Basic lemmas

These hold for any x and y. Here μ is the strong convexity constant and L the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [7] for proofs. We use the notation x for the unique minimizer of f (for strongly convex problems).

f(y) f(x) + f(x),y x + L 2 x y2. (1)
f(y) f(x) + f(x),y x + μ 2 x y2. (2)
f(y) f(x) + f(x),y x + 1 2L f(x) f(y)2. (3)
f(y) f(x) + f(x),y x + 1 2μ f(x) f(y)2. (4)
f(x) f(y),x y 1 L f(x) f(y)2. (5)
f(x) f(y),x y μ x y2. (6)

1 Function Value Descent

There is a very simple proof involving just the function values. We start by showing that the function value descent is controlled by the gradient norm:

Lemma 1. For any given α, the change in function value between steps can be bounded as follows:

f(xk) f(xk+1) α 1 1 2αL f(x k)2,

in particular, if α = 1 L we have f(xk) f(xk+1) 1 2L f(x k)2.

Proof. We start with (1), the Lipschitz upper bound about xk:

f(xk+1) f(xk) + f(x k),xk+1 xk + L 2 xk+1 xk 2.

Now we plug in the step equation xk+1 xk = αf(xk) :

f(xk+1) f(xk) α f(x k)2 + α2L 2 f(x k)2,

Negating and rearranging gives:

f(xk) f(xk+1) α 1 1 2αL f(x k)2.

Now since we are considering strongly convex problems, we actually have found a bound on the gradient norm in terms of function value. We apply (4): f(y) f(x) + f(x),y x + 1 2μ f(x) f(y)2 using x = x, y = xk:

f(xk) f(x) + 1 2μ f(x k)2,

f(x k) 2μ f(xk) f(x).

So combining these two results:

f(xk) f(xk+1) 1 2L f(x k)2 μ L f(xk) f(x).

We then negate, add & subtract f(x), then rearrange:

f(xk+1) f(xk) μ L f(xk) f(x),

f(xk+1) f(x) f(x k) + f(x) μ L f(xk) f(x),

f(xk+1) f(x) 1 μ L f(xk) f(x).

Note that this function value style proof requires the step size α = 1 L or smaller, instead of α = 2 μ+L, which we shall see gives the fastest convergence when using some of the other proof techniques below.

Comments

This proof (when α = 1 L is used) treats gradient descent as an upper bound minimization scheme. Such methods, sometimes known under the Majorization-Minimization nomenclature [3], are quite widespread in optimization. They can be applied to non-convex problems even, although the convergence rates in that case are necessarily weak. Likewise this proof gives the weakest convergence rate of the proof techniques presented in this post, but it is perhaps the simplest. Upper bound minimization techniques have recently seen interesting applications in 2nd order optimization, in the form of Nesterov’s cubicly regularized Newton’s method [9]. For stochastic optimization, the MISO method is also a upper bound minimization scheme [6]. For non-smooth problems, an interesting application of the MM approach is in minimizing convex problems with non-convex regularizers of the form λlog x + 1, in the form of reweighted L1 regularization [5].

2 Iterate Descent

There is also a simple proof involving just the distance of the iterates xk to the solution. Using the definition of the step xk+1 xk = αf(xk):

xk+1 x2 = x k αf(x k) x2 = xk x2 2α f(x k),xk x + α2 f(x k)2.

We now apply both the inner product bounds (5) f(x) f(y),x y 1 L f(x) f(y)2and (6) f(x) f(y),x y μ x y2 , in the following negated forms, using f(x) = 0:

f(x k),xk x1 L f(x k)2,

f(x k),xk xμ x k x2.

The inner product term has a weight 2α, and we apply each of these with weight α, giving:

xk+1 x2 1 αμ x k x2 + α α 1 L f(x k)2.

Now if we take α = 1 L, then the last term cancels and we have:

xk+1 x2 1 μ L xk x2.

This proof is not as tight as possible. Instead of splitting the inner product term and applying both bounds (5) and (6), we can apply the following stronger combined bound from Nesterov’s Book [7]:

f(x) f(y),x y μL μ + L x y2 + 1 μ + L f(x) f(y)2. (7)

Doing so yields:

xk+1 x2 1 2αμL μ + L xk x2 + α α 2 μ + L f(x k)2.

Now clearly to cancel out the gradient norm term we can take α = 2 μ+L, which yields the convergence rate:

xk+1 x2 1 4μL μ + L2 xk x2 1 4μ L xk x2.

Comments

This proof technique is the building block of the standard stochastic gradient descent (SGD) proof. The above proof is mostly based on Nesterov’s book, I’m not sure what the original citation is. It has a nice geometric interpretation, as the bound on the inner product term f(xk),xk x can easily be illustrated in 2 dimensions, say on a white-board. It’s effectively a statement on the angles that gradients in convex problems can take. To get the strongest bound using this technique, the complex bound in Equation 7 has to be used. That stronger bound is not really straight-forward, and perhaps too technical (in my opinion) to use in a textbook proof of the convergence rate.

3 Using the Second Fundamental Theorem of Calculus

Recall the second fundamental theorem of calculus:

f(y) = f(x) +xyf(z)dz.

This can be applied along intervals in higher dimensions. The case we care about is applying it to the first derivatives of f, giving an integral involving the Hessian:

f(y) = f(x) +01 fx + τ(y x),y xdτ.

We abuse the angle bracket notation here to apply to matrix-vector products as well as the usual dot-product. Using this result gives an interesting proof of convergence of gradient descent that doesn’t rely on the usual convexity lemmas. This proof bounds the distance to solution, just like the previous proof.

Lemma 2. For any positive t:

x y + t f(y) f(x) max 1 tL, 1 tμ x y.

Proof. We start by applying the second fundamental theorem of calculus in the above form:

x y + t f(y) f(x) = x y + t01 fx + τ(y x),y xdτ = 01 tfx + τ(y x) I,y xdτ 01 tfx + τ(y x) I,y xdτ 01 tfx + τ(y x) I x ydτ maxz tf(z) I x y.

Now we examine the eigenvalues of f(z). the minimum one is at least μ and the maximum at most L. An examination of the possible range of the eigenvalues of (tf(z) I) gives max 1 tL, 1 tμ. □

Using this lemma gives a simple proof along the lines of the iterate descent proof.

First, note that xk+1 x is in the right form for direct application of this lemma after substituting in the step equation:

xk+1 x = x k x + α f(x k) f(x) max 1 αL, 1 αμ xk x.

Note we introduced f(x) for “free”, as it’s of course equal to zero. The next step is optimize this bound in terms of α. Note that L is always larger than μ, so we take the 1 αL absolute value as negative, and the other positive, and match their magnitudes:

1 + αL = 1 αμ,

α(L + μ) = 2,

α = 2 L + μ.

Which gives the convergence rate:

xk+1 x L μ L + μ xk x.

Note that this rate is in terms of the distance to solution directly, rather than its square like in the previous proof. Converting to squared norm gives the same rate as before.

Comments

This proof technique has a linear-algebra feel to it, and is perhaps most comfortable to people with that background. The absolute values make it ugly in my opinion though. This proof technique is the building block used in the standard proof of the convergence of the heavy ball method for strongly convex problems [10]. It doesn’t appear to have many other applications, and so is probably the least seen of the techniques in this document. The main use of this kind of argument is in lower complexity bounds, where we often do some sort of eigenvalue analysis.

4 Lyapunov Style

The above results prove convergence of either the iterates or the function value separately. There is an interesting proof involving the sum of the two quantities. First we start with with the iterate convergence:

xk+1 x2 = x k x αf(x k)2 = xk x2 2α f(x k),xk x + α2 f(x k)2.

Now we use the function descent amount equation (Lemma 1) to bound the gradient norm term: 1 c f(x k)2 f(x k) f(xk+1) , where we have defined c = 1 α 1 1 2αL:

xk+1 x2 x k x2 + cα2 f(x k) f(xk+1) 2α f(x k),xk x.

Now we use the strong convexity lower bound (2) in a rearranged form:

f(x k),x x k f(x) f(x k) μ 2 xk x2,

to simplify:

xk+1 x2 1 αμ x k x2 + cα2 f(x k) f(xk+1) + 2α f(x) f(x k).

Now rearranging further:

xk+1 x2+cα2 f(x k+1) f(x) 1 αμ x k x2+cα2 2α f(x k) f(x).

Now this equation gives a descent rate for the weighted sum of xk x2 and f(xk) f(x). The best rate is given by matching the two convergence rates, that of the iterate distance terms:

1 αkμ,

and that of the function value terms, which changes from cα2 to cα2 2α:

cα2 2αk cα2 = 1 2 cα = 1 2 1 1 2αL = αL 1.

Matching these two rates:

1 αμ = αL 1,

2 = α(μ + L),

α = 2 μ + L.

Using this derived value for α gives a convergence rate of 1 2μ μ+L. I.e.

xk+1 x2+cα2 f(x k+1) f(x) 1 2μ μ + L xk x2 + cα2 f(x k) f(x).

and therefore after k steps:

xk x2 + cα2 f(x k) f(x) 1 2μ μ + Lk x 0 x2 + cα2 f(x 0) f(x).

The constants can be simplified to:

cα2 = α2 α 1 1 2αL = α 1 1 2αL = α 1 L μ+L = α μ μ+L = 2 μ.

Now we use: f(x0) f(x) L 2 x0 x2 on the right, and we just drop the function value term altogether on the left:

xk x2 1 2μ μ + Lkμ + L μ x0 x2.

If we instead use the more robust step size 1 L, which doesn’t require knowledge of μ, then a simple calculation shows that we instead get c = 2L, and so:

xk x2 1 μ Lk x 0 x2 + 2 L f(x0) f(x), 1 μ Lk2 x 0 x2.

The right hand side is obviously a much tighter bound then when 2(μ + L) is used, but the geometric rate is roughly twice as slow.

Comments

This proof technique has seen a lot of application lately. It is used for the SAGA [2]and SVRG [4] methods, and can be applied to accelerated method even, such as the accelerated coordinate descent theory [8]. The Lyapunov function analysis technique is of great general utility, and so it is worth studying carefully. It is covered perhaps best in Polyak’s book [10].

5 Gradient Norm Descent

In the strongly convex case, it is actually possible to show that the gradient norm decreases at least linearly as well as the function value and iterates. This requires a fixed step size of α = 1 L, as it is not true when line searches are used.

Lemma 3. For α = 1 L:

xk+2 xk+1 2 1 μ L xk+1 xk 2.

Note that xk+2 xk+1 2 = 1 L2 f(xk+1)2 and xk+1 xk 2 = 1 L2 f(xk)2.

Proof. We start by expanding in terms of the step equation xk+1 = xk αf(xk).

xk+2 xk+1 2 = x k+1 αf(x k+1) xk + αf(x k)2 = xk+1 xk 2 + α2 f(x k+1) f(x k)2 +2α f(x k) f(x k+1),xk+1 xk .

Now applying both inner product bounds (5) and (6):

wk+1 wk 2 1 αμ x k+1 xk 2 + α α 1 L f(x k+1) f(x k)2.

So for α = 1 L this simplifies to:

xk+2 xk+1 2 1 μ L xk+1 xk 2.

Chaining this result (Lemma 3) over k gives:

xk+1 xk 2 1 μ Lk x 1 x0 2 = 1 μ Lk 1 L2 f(x 0)2.

We now use f(xk) f(x) 1 2μ f(x k)2 = L2 2μ xk+1 xk 2 :

f(xk) f(x) 1 μ Lk 1 2μ f(x 0)2.

Comments

This technique is probably the weirdest of those listed here. It has seen application in proving the convergence rate of MISO under some different stochastic orderings [1]. While clearly a primal result, this proof has some components normally seen in the proof for a dual method. The gradient f(xk) is effectively the dual iterate. Another interesting property is that the portion of the proof concerning the gradient’s convergence uses the strong convexity between xk+1 and xk, whereas the other proofs considered all use the degree of strong convexity between xk and x.

This proof technique can’t work when line searches are used, as bounding the inner product:

α f(x k) f(x k+1),xk+1 xk ,

would fail if α changed between steps, as it would become αkf(xk) αk+1f(xk+1),xk+1 xk, which is a weird expression to work with.

References

[1]    Aaron Defazio. New Optimization Methods for Machine Learning. PhD thesis, Australian National University, 2014.

[2]    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.

[3]    David R. Hunter and Kenneth Lange. Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics, 9, 2000.

[4]    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS, 2013.

[5]    Qiang Liu and Alexander Ihler. Learning scale free networks by reweighted l1 regularization. AISTATS, 2011.

[6]    Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. Technical report, INRIA Grenoble RhÃŽne-Alpes / LJK Laboratoire Jean Kuntzmann, 2014.

[7]    Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

[8]    Yu. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. Technical report, CORE, 2010.

[9]    Yu. Nesterov and B.T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[10]    Boris Polyak. Introduction to Optimization. Optimization Software, Inc., Publications Division., 1987.

One thought on “The Many Ways to Analyse Gradient Descent”

Leave a Reply

Your email address will not be published. Required fields are marked *