Machine learns as data speak: May 2012

Thursday 31 May 2012

Ratings in collaborative filtering should be treated ordinally

Collaborative filtering (CF), like its umbrella, recommender systems, is hot right now. Commercial CF systems have been implemented by major players like Amazon, Google News, TiVo, and Netflix. Within the academic world, CF research is now regularly delivered in popular venues such as information retrieval (or information filtering), information science, knowledge management, social media, data mining, machine learning, AI, statistics, human-computer interaction, drug-discovery, social sciences, marketing, and of course, the ACM RecSys annual conferences.

The basic idea behind CF is that, since people are interested in common items, and thus there must exist some shared structures, e.g., some people are really similar in their tastes or some items are really similar in their characteristics. The sharing will eventually help to predict how much an user will like an item he or she hasn't seen before.

Depending on your problem representation, you will end up with different algorithms.

One particularly appealing representation is matrix (e.g., see here). Here we are given an incomplete user-item matrix, where observed entries are often ratings by users on items they have seen. The tasks are either to predict the ratings of remaining entries, or to produce a list of unseen items for each user (or other way around, a list of new users for an item).

Under this representation, the shared structure can be captured by the low-rank decomposition, that is, the user-item matrix is really a product of an user-feature matrix and an item-feature matrix, where features are latent aspect of each user and each item. The number of latent features is often small compared to the original matrix sizes. In effect, we project both users and items onto the same latent space, and the preference is measured by some similarity between an user and an item in that space.

Another representation is bipartite graph, where users and items are nodes in the graph, and a weighted link between an user and an item indicates the level of interest. The problem is now to predict the missing links and their weights.

However, these two representations are somewhat complementary since they do not overlap entirely. These are evidenced in recent successful methods which combine both the approaches: here, here and here.

Surprisingly, both camps did not even consider the fact that (i) the ratings are indeed ordinal, (ii) we more or less interested in ranking items rather than rating them.

The second problem deserves separate posts, and here we shall focus on the first one.

Ordinal variables are those categorical variables whose relative orders between categories is the most important information, such as those in standard survey options: {very bad, bad, OK, good and very good}. You may be familiar with number of stars given to a movie, but these stars shouldn't be treated as a (real or integer) number. This is because, first, people are different in their use of rating values -- for some a 4 star movie means very excellent, but for other, it is may be just OK. Second, you can't be sure that these stars are spaced equally -- people are really bad at making consistent quantitative judgement.

On the other extreme, ordinal variables may be treated as standard categorical variable, where each time we pick only one category, assuming that the utility of that category is the highest among all categories. However, this treatment effectively ignores the fact that ordinal values are ordered. Under the categorical treatment, all mistakes are equally bad, but in a proper ordinal treatment, a mistake that is far from the true category should be considered as worse than those near by in the ordinal scale.

The absence of adequate ordinal treatment in CF might suggest some degree of ignorance in machine learning research: we sometimes do not pay attention to existing research in statistics. As for ordinal data, the statistical community has been addressed it for many decades under the umbrella of ordinal regression, and a hugely influential paper was written in 1980 by McCullagh. One may argue that the natures of the collaborative filtering and ordinal regression are different. In fact, they do not: The setting of collaborative filtering is identical to that of the so-called "multi-raters" problem, and the modelling of the raters itself has been long known as the random effect models.

Only very recently, in CF there has been some work addressing the ordinal nature of the data: here, here, here, here and here. The last two are ours, one was published in UAI'09 and the other one will appear in AAAI this year. The UAI'09 paper exploits the ordinal property in the feature construction but it doesn't really model ordinal ratings. The AAAI'12 paper addresses this shortcoming and the main contribution is the deviation from the standard McCullagh's model by adopting the sequential decision approach of Mare (1980).

To be more concrete, the McCullagh's model assumes that there exists a continuous utility that the user perceives when seeing an item. The ordinal level is determined by examining the interval to which the utility belongs. This property, when combining with the underlying assumption of the utility distribution, gives rise to several names: grouped continuous model, proportional odds models, cumulative logit model or ordered probit models. This model will work well if there exists the true interpretation of utility interval, such as income ranges, severity levels, etc.

On the other hand, the sequential decision approach assumes the decision is made progressively starting from the lowest level. If the utility is beyond the level's threshold, the next level will be considered. Otherwise, the current level is selected. This approach is attractive in the cases where stagewise progresses are made, for example, the education level, the health recovery status, etc.

The question is then, which approach is the best for CF? There are no easy answers because we do not know the true decision making behind the rating processes. For now we could only make some guess and verify on real datasets. That that is what we reveal in our forthcoming paper: The sequential approach is as good as, if not better than, the McCullagh's approach, for a number of large-scale datasets.

Updated References

Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, Montreal, Canada.
A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 25-th Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, July 2012.
Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.
Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.
Collaborative filtering via sparse Markov random fields, Truyen Tran, Dinh Phung, Svetha Venkatesh, Information Science, doi:10.1016/j.ins.2016.06.027.

Monday 21 May 2012

Machine learning at its turning point: Non-convexity

Machine learning had enjoyed the last decade of specifying and optimising convex regularised risks. It seems that we have reached the point when we need to move on, accepting the fact that global optimum may not be optimal after all.

To me, learning based on convex functions is desirable as long as we know what we are looking for. In many cases, however, we do not, and we are more interested in discovery. And one of the most effective way is to loosely specify latent variables which, hopefully captures the generative process of the data. Whenever we introduce latent variables, it is likely that we hit a non-convex objective function. In fact, the goal now is not to find a optimal configuration given the known, well-behaved search space, but to search for reasonably effective configurations in a largely unknown space. It is actually evidenced in most powerful natural living systems, such as human brains: They are very effective and time-tested, but they are hardly optimal in any sense.

This move from convex to non-convex is not linear. In fact, non-convex machine learning systems have been around since the beginning of computer-based modelling. The so-called Latent Variable Modelling has been around for decades in statistical sciences, and the standard, non-Bayesian way to optimise the incomplete data likelihood is usually the EM algorithm. One of the famous variant of the EM is the Baum-Welch algorithm for training HMMs. Another example is neural networks: As soon as they are getting interesting by having one ore more hidden layers, they are inherently non-convex.

The recent rise of deep architectures clearly illustrates this point further: One of the most basic building block, the Restricted Boltzmann Machines, has its representation power due to the hidden layer.

Now, given the objective function, if we can compute it at all, is non-convex, then how can we optimise it? There are two strategies: One is to approximate it by a surrogate convex function and iterate until converged and another one is to optimise the function directly. The EM algorithm is actually the first type, and it has been extremely popular. But is it the best way?

Hardly.

It is a nice way because we can have some sort of guarantee for its convergence to local optimum. But after all, it is still a local climbing method. Being clever at using a convex surrogate does not guarantee that it can stay away from bad optima.

It turns out the second strategy is equally good, although it is much less popular in the probabilistic world. Today we have at hand very efficient implementations of Conjugate Gradients and Limited-Memory Quasi-Newton methods for large-scale problems. Although these were perhaps intended for convex functions, they can be applicable for non-convex problems as long as we accept local optima as a reasonable choice.

Outside the probabilistic world, these techniques are hardly anything exciting because neural network people have been using them for years. The exciting parts are (i) how can we exploit the problem structures to obtain better algorithms, and (ii) whether we need to search for global optima at all?

I intend to leave this two questions unanswered, but the answer for the second question may be: No. And the rest of this post is about the first strategy.

In machine learning, there is a well-known method called Concave-Convex Procedure (CCCP) which was introduced a decade ago to deal with the non-convex situations. The idea is based on the observation any non-convex function can be decomposed into a convex part and a concave part. The interesting point is that, although the original authors made some attempt to relate it to the existing line of work in the global optimisation community: the Difference-of-Convex function Algorithm (DCA), machine learning people almost never cite the DCA, which is the earlier (around 1994-1998) and its body of knowledge is much more comprehensive.

I discovered about in 2005 that the log-likelihood of log-linear models with hidden variables is actually in the D.C. form. That fact is interesting but hardly a new discovery. What is more interesting is that when we apply CCCP/DCA to solve that problem, we actually obtain an update that is identical to that when we apply the EM.