The previous post detailed a bunch of different ways of proving the convergence rate of gradient descent:
for strongly convex problems. This post considers the non-strongly convex, but still convex case.
Rehash of Basic Lemmas
These hold for any and . the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [2] for proofs. We use the notation for an arbitrary minimizer of .
(1) |
(2) |
(3) |
1 Proximal Style Convergence Proof
The following argument gives a proof of convergence that is well suited to modification for proving the convergence of proximal gradient methods. We start by proving a useful lemma:
Proof. We start with the Lipschitz upper bound around of :
Now we bound using the negated convexity lower bound of around (i.e. ):
Negating, rearranging and multiplying through by gives:
Now we replace using :
Now we complete the square using the quadratic . So we have:
□
Using this lemma, the proof is quite simple. We apply it with :
Now we sum this between and . The left hand side telescopes:
Now we use the fact that gradient descent is a descent method, which implies that for all . So:
Now we just drop the term since it is positive and small. Leaving:
Comments
As far as I know, this proof is fairly modern [1]. Notice that unlike the strongly convex case, the quantity we are bounding does not appear on both sides of the bound. Unfortunately, without strong convexity there is necessarily a looseness to the bounds, and this takes the form of bounding function value by distance to solution, with a large wiggle-factor. One thing that is perhaps a little confusing is the use of distance to solution , when it is not unique, as there are potentially multiple minimizers for non-strongly convex problems. The bound in fact holds for any chosen minimizer . I found this to be a little confusing at first.
2 Older Style Proof
This proof is from Nesterov [2]. I’m not sure of the original source for it.
We start with the function value descent equation, using
We introduce the simplified notation so that we have
(4) |
Now using the convexity lower bound around evaluated at , namely:
and applying Cauchy-Schwarz (note the spelling! there is no “t” in Schwarz) to it:
The last line is because gradient descent method descends in iterate distance each step. We now introduce the additional notation . Using this notation and rearranging gives:
We plug this into the function descent equation (Eq 4) above to get:
We now divide this through by :
Then divide through by also:
Now we use the fact that gradient descent is a descent method again, which implies that so:
We then chain this inequality for each :
To get the final convergence rate we invert both sides:
This is quite a complex expression. To simplify even further, we can get rid of the terms on the right hand side using the Lipschitz upper bound about :
Plugging in the step size gives , yielding the following simpler convergence rate:
Compared to the rate from the previous proof, , this is slightly better at , and worse thereafter.
Comments
I don’t like this proof. It’s feels like a random sequence of steps when you first look at it. The way the proof uses inverse quantities like is also confusing. The key equation is really the direct bound on :
Often this is the kind of equation you encounter when proving the properties of dual methods for example. Equations of this kind can be encountered when applying proximal methods to non-differentiable functions also. It is also quite a clear statement about what is going on in terms of per-step convergence, a property that is less clear in the previous proof.
3 Gradient Concentration
When we don’t even have convexity, just Lipschitz smoothness, we can still prove something about convergence of the gradient norm. The Lipschitz upper bound holds without the requirement of convexity:
Recall that from minimizing this bound with respect to we can prove the equation:
Examine this equation carefully. We have a bound on each gradient encountered during the optimization in terms of the difference in function values between steps. The sequence of function values is bounded below, so in fact we have a hard bound on the sum of the encountered gradient norms. Effectively, we chain (telescope) the above inequality over steps:
Now since :
Now to make this bound a little more concrete, we can put it in terms of the gradient with the smallest norm seen during the minimization ( for all ), so that , so:
Comments
Notice that the core technique used in this proof is the same as the last 2 proofs. We have a single step inequality bounding one of the quantities we care about. By summing that inequality over each step of minimization, one side of the inequality telescopes. We get an equality saying the the sum of the versions of that quantity (one from each step) is less then some fixed constant independent of , for any . The convergence rate is thus of the form , because the summation of the quantities fits in a fixed bound.
Almost any proof for an optimization method that applies in the non-convex case uses a similar proof technique. There is just not that many assumptions to work with, so the options are limited.
References
Hi Aaron, nice post. Do you know if the inequality of lemma 1 is sharp? I mean if there is an example of a convex and L-smooth function f and a suitably chosen y, such that we have equality in the place of inequality.
Hi Aaron, nice post. Do you know any case where the inequality of lemma 1 is sharp? I mean an example of a convex and L-smooth function f and a suitably chosen y, such that we have equality in the place of the inequality.
Essentially all of the inequalities are sharp for 1D quadratics, or higher dimensional quadratics along the eigen-direction corresponding to the largest eigenvalue for the L formulas and the smallest for the mu formulas.
Hi Aaron,
Could you provide me with a reference to the gradient concentration technique. Any paper where such a technique has been used for non-convex case. You may email me at the address.
Nesterov’s book contains the proof. “Introductory Lectures On Convex Programming”.