Random Evolutionary Strategies in Machine Learning

Neural networks do not learn at all like people do. Neural network optimization is actually a gradient descent over some loss function $E(\theta)$ where the variables are the weights of the layers $\theta$ . This is a very powerful approach to tuning the system, which is also used in physics, economics, and many other fields. At the moment, many specific methods of gradient descent have been proposed, but they all suggest that the gradient $E(\theta)$ it behaves well: there are no cliffs where it increases spasmodically, or a plateau where it turns to zero. The first problem can be dealt with using gradient clipping, but the second one makes you think carefully. A piecewise linear or discrete function is not trivial to limit to a more pleasant function


What to do in such situations?

Under the cut a lot of formulas and gifs.

About five years ago , articles began to appear that explored a different approach to learning: instead of smoothing the objective function, let us approximate the gradient in such cases. Sampling values ​​in a certain way $E(\theta)$ , it can be assumed where to move, even if the derivative is exactly zero at the current point. Not that sampling approaches were especially new - Monte Carlo has been friends with neural networks for a long time - but recently, interesting new results have been obtained.

This article was based on Ferenc Huszár blog entries .

Random Evolution Strategy (ES)


Let's start with the simplest approach, which has regained popularity after the OpenAI article in 2017. I note that evolutionary strategies do not have a very good name associated with the development of this family of algorithms. ES is very distantly associated with genetic algorithms - only a random change in parameters. It is much more productive to think of ES as a method of estimating the gradient using sampling. We sample from the normal distribution $\mathbb{N}(0, \sigma)$ several perturbation vectors $\epsilon$ , we take the expectation from the resulting values ​​of the objective function, and we believe that

$\frac{\delta}{\delta\theta}E(\theta) \approx \frac{1}{\sigma^2} \mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)} [\epsilon E(\theta + \epsilon) ] $


The idea is pretty simple. With a standard gradient descent, at each step we look at the slope of the surface on which we are, and move towards the largest slope. In ES, we “fire” a nearby neighborhood with points where we can supposedly move, and move to the side where the most points with the greatest difference in elevation hit (and the farther the point, the more weight it is given).


The proof is also simple. Suppose that in fact the loss function can be expanded into a Taylor series in $\theta$ :

$ E(\theta + \epsilon) = E(\theta) + E'(\theta)\epsilon + \frac{E''(\theta)\epsilon^2}{2} + O(\epsilon^3)$


Multiply both sides by $\epsilon$ :

$ E(\theta + \epsilon)\epsilon = E(\theta)\epsilon + E'(\theta)\epsilon^2 + \frac{E''(\theta)\epsilon^3}{2} + O(\epsilon^4)$


Throw away $O(\epsilon^4)$ and take the expectation from both sides:

$ \mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)}[E(\theta + \epsilon)\epsilon] = \mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)}[E(\theta)\epsilon + E'(\theta)\epsilon^2 + \frac{E''(\theta)\epsilon^3}{2}]$


But the normal distribution is symmetrical: $\mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)}[\epsilon] = 0$ , $\mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)}[\epsilon^2] = \sigma^2$ , $\mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)}[\epsilon^3] = 0$ :

$ \mathbb{E}_{\epsilon \sim \mathbb{N}(0, \sigma)}[E(\theta + \epsilon)\epsilon] = E'(\theta)\sigma^2$


Divide by $\sigma^2$ and get the original statement.

In the case of a piecewise step function, the resulting estimate $\frac{\delta}{\delta\theta}E(\theta)$ will represent the gradient of the smoothed function without the need to calculate the specific values ​​of this function at each point; The article gives more rigorous statements. Also in the case when the loss function depends on discrete parameters $x$ (those. $E(\theta) = \mathbb{E}_{x}E(\theta, x)$ ), it can be shown that the estimate remains fair, since in the proof you can swap the order of taking the expectation

$ \mathbb{E}_{\epsilon}\epsilon E(\theta + \epsilon) = \mathbb{E}_{\epsilon}\epsilon \mathbb{E}_{x} E(\theta + \epsilon, x) = \mathbb{E}_{x} \mathbb{E}_{\epsilon} \epsilon E(\theta + \epsilon, x)$


Which is often impossible for regular SGD.

Let's look at a function smoothed by sampling from $\mathbb{N}(0, \sigma)$ . Once again: there really is no point in counting $E(\theta)$ , we only need derivatives. However, such visualizations help to understand which landscape the gradient descent is actually made of. So, the green graph is the original function, the blue one is what it looks like for the optimization algorithm after estimating the gradient by sampling:

$\sigma = 0.35$ :

$\sigma = 0.2$ :

$\sigma = 0.1$ :
The larger the sigma distribution, the less the local structure of the function manifests itself. If the sampling is too large, the optimization algorithm does not see narrow minima and hollows, according to which one can switch from one good state to another. If it is too small, the gradient descent may not start if the initialization point was chosen unsuccessfully.

More graphs:
Gradient descent automatically finds the most stable minimum in the middle of the plateau and “tunnels” through a sharp peak.

The advantages of this approach compared to conventional gradient descent:

  • As already mentioned $E(\theta)$ not required to be differentiable. At the same time, no one forbids using ES even when you can honestly calculate the gradient.
  • ES is trivially parallel. Although parallel versions of SGD exist, they require you to forward weights $\theta_i$ between worker nodes or nodes and the central server. This is very expensive when there are many layers. In the case of evolutionary strategies, each of the workstations can consider its own set. $E_i = E(\theta + \epsilon_i)$ . Forward $E_i$ quite simple - usually it's just scalars. $\epsilon_i$ Forwarding is as difficult as sending $\theta$ However, there is no need to do this: if all the nodes know the algorithm for sampling random numbers and seed generator, they can simulate $\epsilon_i$ each other ! The performance of such a system increases almost linearly with scaling, which makes the distributed version of ES just incredibly fast when a large number of training stations are available. OpenAI reports MuJoCo training in 10 minutes on 80 machines with 1440 CPUs.
  • Sampling $\theta$ introduces noise into the calculation of the gradient, which makes learning more stable. Compare with the dropout when training neural networks in the usual way.
  • ES does not depend on frame-skip with RL (see, for example, here ) - less than one hyperparameter.
  • It is also argued that evolutionary strategies make learning easier than regular SGDs, when a lot of time can elapse between an action in RL and a positive response, and in noisy conditions where it is not clear which change helped to improve the result (the situation when the network works less at the beginning of training, she can’t understand for a long time what needs to be done in a virtual environment and only twitches in place).

What are the disadvantages?

  • The graphs in the OpenAI article do not look so radiant. Learning with ES takes 2-10 times more eras than learning with TRPO. The authors defend themselves with the fact that when a large number of working machines are available, the real time spent is 20-60 times less (although each update brings less benefit, we can do them much more often). Great, of course, but where else to get 80 working nodes. Plus, as if the final results are a little worse. Perhaps you can "finish off" the network with a more accurate algorithm?
  • The noise in the gradients is a double-edged sword. Even with one-dimensional optimization, they turn out to be slightly unstable (see noise on the blue curves in the graphs above), what can I say when $\theta$ very multidimensional. It is not yet clear how serious this poses a problem, and what can be done about it except how to increase the sample size of the sample.
  • It is not known whether ES can be effectively applied in regular teacher training. OpenAI honestly reports that on a regular MNIST ES it can be 1000 (!) Times slower than normal optimization.

ES variations


An additional positive quality of evolutionary strategies is that they only tell us how to calculate the derivative, and do not force us to rewrite the back propagation algorithm cleanly. Everything that can be applied to backprop-SGD can also be applied to ES: Nesterov's impulse, Adam-like algorithms, batch normalization. There are some specific additional techniques, but so far more research is needed, under what circumstances they work:

  • ES allows you to evaluate not only the first, but also the second derivative $E(\theta)$ . To get the formula, it is enough to paint the Taylor series in the proof above to the fifth term, and not to the third term, and subtract from it the obtained formula for the first derivative. However, it seems that doing this is only out of scientific curiosity, since the estimate is terribly unstable, and Adam allows you to emulate the action of the second derivative in the algorithm.
  • It is not necessary to use a normal distribution. Obvious applicants are the Laplace distribution and the t-distribution, but more exotic options can be devised.
  • It’s not necessary to sample only around the current $\theta$ . If you remember the history of movement in the parameter space, you can sample the weights around the point a little in the direction of movement.
  • During training, you can periodically jump not in the direction of the gradient, but simply to the point with the smallest $E(\theta + \epsilon_i)$ . On the one hand, this destabilizes training even more, on the other hand, this approach works well in strongly non-linear cases with a large number of local minima and saddle points.
  • A December 2017
  • Uber article offers a novelty reward seeking evolution strategies (NSR-ES): an additional term is added to the balance update formulas to encourage new behavioral strategies. The idea of ​​introducing diversity in reinforcement learning is not new, it is obvious that they will try to stick it here. The resulting learning algorithms give noticeably better results on some games of the Atari dataset.
  • There is also an article claiming that it is not necessary to send the results from all the nodes to all the other nodes: the thinned-out graph of communication between workstations not only works faster, but also produces better results (!). Apparently, communication with not all fellow students works as an additional regularization like a dropout.

See also this article , which further reflects on the difference in the behavior of ES and TRPO in different types of parameter landscapes, this article , which describes in more detail the relationship between ES and SGD, this article , which proves the extraterrestrial origin of the Egyptian pyramids, and this article , which compares ES from OpenAI with older classic ES.

Variational optimization (VO)


Hmm, but why not a word is said about the change $\sigma$ during training? It would be logical to reduce it, because the longer the training takes, the closer we are to the desired result, therefore we need to pay more attention to the local landscape $E(\theta)$ . But just how to change the standard deviation? I don’t want to invent some tricky scheme ...
It turns out that everything has already been invented! Evolution strategies is a special case of variational optimization for fixed $\sigma$ .

Variational optimization is based on an auxiliary statement on the variational restriction from above: the global minimum of the function $E(\theta)$ cannot be greater than the average $E(\theta_i)$ how would we these $\theta_i$ neither sampled:

$\min_{\theta} E(\theta) \leq \min_{\eta} \mathbb{E}_{\theta \sim p(\theta|\eta)} [E(\theta)] = \min_{\eta} J(\eta, E) $


Where $J(\eta, E)$ - variational functional, the global minimum of which coincides with the global minimum $E$ . Note that now minimization is not in the initial parameters $\theta$ , and according to the distribution parameters from which we sampled these parameters - $\eta$ . Also, note that even if $E$ was undifferentiated, $J$ differentiable if differentiable distribution $ p(\theta|\eta)$ , and the derivative can be expressed as:

$ \frac{\delta}{\delta\eta} \mathbb{E}_{\theta \sim p(\theta|\eta)}[E(\theta)] = \frac{\delta}{\delta\eta} \int p(\theta|\eta) E(\theta) d\theta $


$ = \int \frac{\delta}{\delta\eta} p(\theta|\eta) E(\theta) d\theta $


$ = \int p(\theta|\eta) E(\theta) \frac{\delta}{\delta\eta} \log{p(\theta|\eta)} d\theta $


$ = \mathbb{E}_{\theta \sim p(\theta|\eta)} E(\theta) \frac{\delta}{\delta\eta} \log{p(\theta|\eta)} $


If a $ p(\theta|\eta)$ - Gaussian distribution with a fixed dispersion, from this formula it is easy to obtain a formula for ES. But what if $\sigma$ not fixed? Then for every weight in $\theta$ we get two parameters in $\eta$ : center and standard deviation of the corresponding Gaussian. The dimension of the problem is doubled! The easiest way to show a one-dimensional example. Here is a top view of the landscape obtained after the conversion for the function from the previous section:
And descent along it:
Its values ​​at the point $(\mu, \sigma)$ correspond to the average values ​​sampled from the point $\mu = x$ using normal distribution with standard deviation $\sigma$ . Note that

  • Again: in real life, it makes no sense to count this function. As with ES, we only need a gradient.
  • Now we minimize $(\mu, \sigma)$ , but not $x$ .
  • Less $\sigma$ the larger the cross section $J$ resembles the original function $E$
  • Global minimum $J$ achieved when $\mu = 0, \sigma \rightarrow 0$ . In the end, we only need $\mu$ , and $\sigma$ can be dropped.
  • It is quite difficult to get into the global minimum of the one-dimensional function - it is fenced with high walls. However, if you start from a point with a large $\sigma$ , it becomes much easier to do this, since the more smoothing is, the stronger the section of the function resembles an ordinary quadratic function.
  • Nevertheless, it is better to use Adam to get into the narrow hollow of the minimum.

What are the benefits of introducing variational optimization? It turns out that ES was fixed $\sigma$ , his formula could lead us towards suboptimal lows: the more $\sigma$ , however, deep but narrow minima are noticeable on the landscape of the smoothed function.

Since in VO $\sigma$ Permanent, we have at least a chance to get into a real global minimum.

The disadvantages are obvious: even in the simplest case, the dimension of the problem doubles, what can we say about the case when we want to have $\sigma_i$ in every direction. Gradients become even more unstable.

Remember the graph above with a “wall” in the middle. VO feels it even less:
(the former wall is a small little thing from the bottom in the middle)
Although convergence into a global minimum is not guaranteed even for simple cases, we have more chances to get at least somewhere with imperfect initialization:
Or even get out of the area of ​​poor spatial initialization:


Conclusion


An article in OpenAI brought a trick to variational optimization with the common seed of random number generators of various nodes. But so far (it seems?) There are no articles evaluating the real acceleration from it. It seems to me, in the near future, we will see them. If they are encouraging, and if ES and VO are extended to teacher training, perhaps a paradigm shift in machine learning awaits us.

You can download the code with which the visualizations were built here . You can look at other visualizations and learn about the relationship of the above methods with genetic algorithms, Natural Evolution Strategies and Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) here . Write in the comments if someone is interested to see how ES and VO behave on a particular function. Thanks for your attention!