Monday 21 May 2012

Machine learning at its turning point: Non-convexity

Machine learning had enjoyed the last decade of specifying and optimising convex regularised risks. It seems that we have reached the point when we need to move on, accepting the fact that global optimum may not be optimal after all.

To me, learning based on convex functions is desirable as long as we know what we are looking for. In many cases, however, we do not, and we are more interested in discovery. And one of the most effective way is to loosely specify latent variables which, hopefully captures the generative process of the data. Whenever we introduce latent variables, it is likely that we hit a non-convex objective function. In fact, the goal now is not to find a optimal configuration given the known, well-behaved search space, but to search for reasonably effective configurations in a largely unknown space. It is actually evidenced in most powerful natural living systems, such as human brains: They are very effective and time-tested, but they are hardly optimal in any sense.

This move from convex to non-convex is not linear. In fact, non-convex machine learning systems have been around since the beginning of computer-based modelling. The so-called Latent Variable Modelling has been around for decades in statistical sciences, and the standard, non-Bayesian way to optimise the incomplete data likelihood is usually the EM algorithm. One of the famous variant of the EM is the Baum-Welch algorithm for training HMMs. Another example is neural networks: As soon as they are getting interesting by having one ore more hidden layers, they are inherently non-convex.

The recent rise of deep architectures clearly illustrates this point further: One of the most basic building block, the Restricted Boltzmann Machines, has its representation power due to the hidden layer.

Now, given the objective function, if we can compute it at all, is non-convex, then how can we optimise it? There are two strategies: One is to approximate it by a surrogate convex function and iterate until converged and another one is to optimise the function directly. The EM algorithm is actually the first type, and it has been extremely popular. But is it the best way?

Hardly.

It is a nice way because we can have some sort of guarantee for its convergence to local optimum. But after all, it is still a local climbing method. Being clever at using a convex surrogate does not guarantee that it can stay away from bad optima.

It turns out the second strategy is equally good, although it is much less popular in the probabilistic world. Today we have at hand very efficient implementations of  Conjugate Gradients and Limited-Memory Quasi-Newton methods for large-scale problems. Although these were perhaps intended for convex functions, they can be applicable for non-convex problems as long as we accept local optima as a reasonable choice.

Outside the probabilistic world, these techniques are hardly anything exciting because neural network people have been using them for years. The exciting parts are (i) how can we exploit the problem structures to obtain better algorithms, and (ii) whether we need to search for global optima at all?

I intend to leave this two questions unanswered, but the answer for the second question may be: No. And the rest of this post is about the first strategy.

In machine learning, there is a well-known method called Concave-Convex Procedure (CCCP) which was introduced a decade ago to deal with the non-convex situations. The idea is based on the observation any non-convex function can be decomposed into a convex part and a concave part. The interesting point is that, although the original authors made some attempt to relate it to the existing line of work in the global optimisation community: the Difference-of-Convex function Algorithm (DCA), machine learning people almost never cite the DCA, which is the earlier (around 1994-1998) and its body of knowledge is much more comprehensive.

I discovered about in 2005 that the log-likelihood of log-linear models with hidden variables is actually in the D.C. form. That fact is interesting but hardly a new discovery.  What is more interesting is that when we apply CCCP/DCA to solve that problem, we actually obtain an update that is identical to that when we apply the EM.

No comments:

Post a Comment