- KDnuggets Home > News > 2017 > Feb > Tutorials, Overviews > Learning to Learn by Gradient Descent by Gradient Descent ( 17:n05 )
- Suppose we are training g to optimise an optimisation function f .
- And thereâ s something especially potent about learning learning algorithms, because better learning algorithms accelerate learningâ ¦
- Casting algorithm design as a learning problem allows us to specify the class of problems we are interested in through example problem instances.
- Each function in the system model could be learned or just implemented directly with some algorithm.

What if instead of hand designing an optimising algorithm (function) we learn it instead? That way, by training on the class of problems we’re interested in solving, we can learn an optimum optimiser for the class!

@kdnuggets: *Learning to Learn by Gradient Descent by Gradient Descent #MachineLearning @adriancolyer*

By Adrian Colyer, Venture Partner, Accel.

Learning to learn by gradient descent by gradient descent, Andrychowicz et al., NIPS 2016

One of the things that strikes me when I read these NIPS papers is just how short some of them are – between the introduction and the evaluation sections you might find only one or two pages! A general form is to start out with a basic mathematical model of the problem domain, expressed in terms of functions. Selected functions are then learned, by reaching into the machine learning toolbox and combining existing building blocks in potentially novel ways. When looked at this way, we could really call machine learning ‘function learning‘.

Thinking in terms of functions like this is a bridge back to the familiar (for me at least). We have function composition. For example, given a function f mapping images to feature representations, and a function g acting as a classifier mapping image feature representations to objects, we can build a systems that classifies objects in images with g ○ f.

Each function in the system model could be learned or just implemented directly with some algorithm. For example, feature mappings (or encodings) were traditionally implemented by hand, but increasingly are learned…

Part of the art seems to be to define the overall model in such a way that no individual function needs to do too much (avoiding too big a gap between the inputs and the target output) so that learning becomes more efficient / tractable, and we can take advantage of different techniques for each function as appropriate. In the above example, we composed one learned function for creating good representations, and another function for identifying objects from those representations.

We can have higher-order functions that combine existing (learned or otherwise) functions, and of course that means we can also use combinators.

And what do we find when we look at the components of a ‘function learner’ (machine learning system)? More functions!

The optimizer function maps from f θ to argminθ ∈ Θ f θ . The standard approach is to use some form of gradient descent (e.g., SGD – stochastic gradient descent). A classic paper in optimisation is ‘No Free Lunch Theorems for Optimization’ which tells us that no general-purpose optimisation algorithm can dominate all others. So to get the best performance, we need to match our optimisation technique to the characteristics of the problem at hand:

Thus there has been a lot of research in defining update rules tailored to different classes of problems – within deep learning these include for example momentum, Rprop, Adagrad, RMSprop, and ADAM.

But what if instead of hand designing an optimising algorithm (function) we learn it instead? That way, by training on the class of problems we’re interested in solving, we can learn an optimum optimiser for the class!

If learned representations end up performing better than hand-designed ones, can learned optimisers end up performing better than hand-designed ones too? The answer turns out to be yes!

In fact not only do these learned optimisers perform very well, but they also provide an interesting way to transfer learning across problems sets. Traditionally transfer learning is a hard problem studied in its own right. But in this context, because we’re learning how to learn, straightforward generalization (the key property of ML that lets us learn on a training set and then perform well on previously unseen examples) provides for transfer learning!!

Thinking functionally, here’s my mental model of what’s going on… In the beginning, you might have hand-coded a classifier function, c, which maps from some Input to a Class:

c :: Input -> Class

With machine learning, we figured out for certain types of functions it’s better to learn an implementation than try and code it by hand. An optimisation function f takes some TrainingData and an existing classifier function, and returns an updated classifier function:

What we’re doing now is saying, “well, if we can learn a function, why don’t we learn f itself?”

Let ϕ be the (to be learned) update rule for our (optimiser) optimiser. We need to evaluate how effective g is over a number of iterations, and for this reason g is modelled using a recurrent neural network (LSTM). The state of this network at time t is represented by ht.

Suppose we are training g to optimise an optimisation function f. Let g(ϕ) result in a learned set of parameters for f θ, The loss function for training g(ϕ) uses as its expected loss the expected loss of f as trained by g(ϕ).

To scale to tens of thousands of parameters or more, the optimiser network m operators coordinatewise on the parameters of the objective function, similar to update rules like RMSProp and ADAM. The update rule for each coordinate is implemented using a 2-layer LSTM network using a forget-gate architecture.

Optimisers were trained for 10-dimensional quadratic functions, for optimising a small neural network on MNIST, and on the CIFAR-10 dataset, and on learning optimisers for neural art (see e.g. Texture Networks).

Here’s a closer look at the performance of the trained LSTM optimiser on the Neural Art task vs standard optimisers:

And because they’re pretty… here are some images styled by the LSTM optimiser!

So there you have it. It seems that in the not-too-distant future, the state-of-the-art will involve the use of learned optimisers, just as it involves the use of learned feature representations today. This appears to be another crossover point where machines can design algorithms that outperform those of the best human designers. And of course, there’s something especially potent about learning learning algorithms, because better learning algorithms accelerate learning…

In this paper, the authors explored how to build a function g to optimise an function f, such that we can write:

where d is some training data.

When expressed this way, it also begs the obvious question what if I write:

or go one step further using the Y-combinator to find a fixed point:

Food for thought…

Bio: Adrian Colyer was CTO of SpringSource, then CTO for Apps at VMware and subsequently Pivotal. He is now a Venture Partner at Accel Partners in London, working with early stage and startup companies across Europe. If you’re working on an interesting technology-related business he would love to hear from you: you can reach him at acolyer at accel dot com.

Original. Reposted with permission.