Variance Reduction
StopWasting My Gradients: Practical SVRG Another great practical paper from Mark Schmidt and the rest of his group. The techniques they describe help speedup SVRG, although your still probably better off using SAGA.On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants Great typesetting and plots. They show a good speedup on SVRG on a multicore machine. Would be good to have async-SAGA plots though. Speedup varies from 0.3 to 1 per additional core.
Local Smoothness in Variance Reduced Optimization Essentially theory and experiements for adaptive importance sampling for SVRG and SDCA. Looks good. It's hard to say how robust it is and how easy to implement it will be.
Variance Reduced Stochastic Gradient Descent with Neighbors Good theory, some improvements on the basic SAGA theory. The N-SAGA method looks like it would be hard to implement fast though.
Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling A SDCA variant with a more conservative step and corresponding improved theory. They give a proof that applies for almost any sampling scheme, nice. Could be faster than SDCA for sparse problems if implemented carefully.
A Universal Catalyst for First-Order Optimization Uses the accelerated proximal-point algorithm as a "meta-algorithm" to speedup any of the finite-sum methods. The double loop construction they use introduces a logarithmic term in the convergence rate which is unfortunate. Probably not ideal in practice due to the additional tuning required, but it's hard to say.
Data science
Hidden Technical Debt in Machine Learning Systems This is based on a previous workshop paper. I'm not really impressed with the paper, it just doesn't go into enough detail to actually be useful. It's practically an extended abstract.Precision-Recall-Gain Curves: PR Analysis Done Right I really like this paper. It correctly points out that PR curves are pretty bad, and suggests a better improvement. For the kind of problems I'm interested in ROC curves are still the better choice though.
Distributed Optimization
Deep learning with Elastic Averaging SGD The basic idea here is classical: distribute an optimisation problem by splitting it into parts (1 per machine), and enforce consistancy between the machines using a quadratic penality function. It's a "soft" penality, in the sense it doesn't enforce exact equality. It's interesting that they actually give each machine access to the whole dataset, whereas normally each machine is given a fixed subset of the data. I believe this means that on convex problems it will still converge to the exact solution, which is not true of a soft-penalty otherwise. They also show how important a momentum based updated is. Good paper.Online/Stochastic Optimization
Probabilistic Line Searches for Stochastic Optimization This paper was present orally at the conference as well. They gave an execelent presentation. The basic idea is to use bayesian optimization techniques for line searches. The details are non-trivial, and they have good plots to go along with it. I suspect however that their method is too slow in practice for most applications.Beyond Convexity: Stochastic Quasi-Convex Optimization They give an interesting and non-obvious definition of local-quasi-convexity that they use in their proof of convergence for the stochastic normalized gradient descent method. It's really a theory paper.
Online Gradient Boosting I'm no expert on boosting, but it looks interesting. I'll stick with offline boosting for solving practical problems for now.
Fast Rates for Exp-concave Empirical Risk Minimization Some good results here. The exp-concave subproblem of ERM is interesting; I'm a strong believer in narrowing down your problem class until you can prove something. They get an expected loss convergence rate like O(d/n) for d dimensions and n steps. This is better than the general ERM case, and better than past results. I'll probably read through the proof in detail at some point.
SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk The U-statistics terminology is new to me. They talk about direct minimisation of AUC instead of the usual ERM losses, which is an approach I find interesting. I'm surprised there hasn't been more research on this, from what they say in the paper. In the past I've never achieved better results out of direct AUC minimisation compared to logistic-loss though.
Bandits
Policy Evaluation Using the Ω-Return Not really my area, so I can't comment in detail. Looks interesting though.Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff Always nice to see improved bounds.
The Pareto Regret Frontier for Bandits
Frank-Wolfe
Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees The Frank-Wolfe method is popping up everywhere these days. Apparently it allows you to get convergence rates for Bayesian quadrature now. They appear to be applying it in function space, which is really cool. Nice visualisations as well.A Universal Primal-Dual Convex Optimization Framework
On the Global Linear Convergence of Frank-Wolfe Optimization Variants Simon Lacoste-Julien is quickly becoming the Frank-Wolfe expert. I like the consideration of the structure of the constraint set through a notion similar to a condition number. Great diagrams as well.
Inference/Sampling
Smooth and Strong: MAP Inference with Linear Convergence I don't fully understand the novelty in this paper, it looks like an application of standard duality tricks for smoothing. If anybody has insight here please let me know in the comments.Reflection, Refraction, and Hamiltonian Monte Carlo A way of improving traditional HMC in some cases by better handling innate structure.
Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference
A Complete Recipe for Stochastic Gradient MCMC
Other Optimization
HONOR: Hybrid Optimization for NOn-convex Regularized problems Basically they switch between Quasi-newton and gradient steps depending on the local structure. Feels a little adhoc. They also only compare against GIST, a method they also invented. I would like to see a more general comparison.Convergence rates of sub-sampled Newton methods The approach they take to sub-sampled approximations is interesting. They handle noise in the Hessian from the approximation by setting the smaller eigenvalues not to zero but to the value of the kth top eigenvalue, for some k. Noise should have a bigger relative effect on the lesser eigenvalues, so that looks reasonable. I'm not super-convinced by the plots though, I would like to see results on a few other problems. Also, fonts on plots are too small.
Theory
Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems Quite an interesting signal-recovery problem solved here. Related to compressed sensing where you assume some kind of structure on the design matrix (i.e. iid sampled from a gaussian or binomial).Newton-Stein Method: A Second Order Method for GLMs via Stein's Lemma An alternative hessian approximation. They make use of sub-sampling in a similar way to the "Convergence rates of sub-sampled Newton methods" paper above, citing a pre-print of it. I like how they use estimates of the covariance matrix of the data to estimate the Hessian. Plot font is too small. They use the wrong citation style for NIPS, not sure how that got through review.
Learning with Symmetric Label Noise: The Importance of Being Unhinged