Machine learns as data speak: Model stability: Learn it many more times, but on different datasets

How often do you actually publish your data-derived models? Chances are you almost never do, if you are machine learning type. Only quite recently, when training a big deep net is very expensive, people start publishing models. And it helps people who come afterward significantly.

This is quite contrary to fields like medicine, where the models (often regression coefficients for a GLM) are routinely published. This is because in those fields, model is the actual finding, not the learning algorithm that produces it.

In a way, empirical sciences progress as new models are found, published and verified. One key requirement is that the models are reproducible. Empirical models, those derived from data, must be stable across different datasets by different research groups to be accepted as a rule.

But this has not been well-respected in data-driven fields.

Anyone who use decision trees to derive a prediction rule from a reasonably complex data would experience a phenomenon that trees will differ drastically if you change just few data points. Unfortunately there have been many trees published in the medicine literature, probably because trees are highly interpretable. But I doubt that anyone could ever reproduce a tree from their own data.

At a recent "Big Data" conference I asked a bioinformatics professor why people keep publishing new "findings" of genes, which are supposed to cause or worsen a medical condition. The trouble is that different groups claim different subsets, many of which do not overlap at all. Needless to say, all of those findings are statistically significant, on their own dataset. The professor did not answer my question directly. She said people had different hypotheses and thus focused on those genes whey suspected. Sometimes, the biases or resource limitations prevent people from looking elsewhere.

For the past few years I have worked on deriving simple prediction rules for healthcare from high-dimensional data. The standard method of the day is sparsity-induced techniques such as Lasso. Every time I changed the data a little bit, either by changing some patients due to different selection criteria, or changing some features (there are endless possibilites), I would have a different feature subset and their coefficients with comparable predictive power!

For those who care, stability and sparsity are not the best friends. Sparse models are known to be unstable. Same as feature selection techniques.

Model instability is a daunting issue for empirical sciences (e.g., the so-called evidence-based medicine). There are two jobs that must be done. One is quantifying the instability. The other is deriving strategies to stabilize the estimation.

The first job has been studied to a great detail in the context of confidence interval estimation. For standard GLMs, the estimation is well-known, but as soon as sparsity comes into play, the job is much harder. A solution is simulation-based, a.k.a., the one-size-fit-all bootstrap. That is, for a dataset, resample it to obtain the new set of the same size, and re-estimate the model. Parameter confidence intervals can then be calculated from multiple estimates, says B times, where B is usually in the order of thousands. While this method is straightforward with modern computer, its theoretical properties still need further investigation.

The second job is much less studied. At PRaDA (Deakin University), we have attempted to solve the problem from several directions, and for several GLM instances such as logistic regression, ordinal regression and Cox's model. The main idea is to exploit the domain knowledge or statistics, so that the degree of freedom is limited. Some of the recent works are listed in the References below.

In subsequent posts, we will cover some specific techniques. Stay tuned.

Updated references

Preterm Birth Prediction: Deriving Stable and Interpretable Rules from High Dimensional Data, Truyen Tran, Wei Luo, Dinh Phung, Jonathan Morris, Kristen Rickard, Svetha Venkatesh, Conference on Machine Learning in Healthcare, LA, USA Aug 2016.
Stabilizing Linear Prediction Models using Autoencoder, Shivapratap Gopakumara, Truyen Tran, Dinh Phung, Svetha Venkatesh, International Conference on Advanced Data Mining and Applications (ADMA 2016).
Stabilizing Sparse Cox Model using Statistic and Semantic Structures in Electronic Medical Records. Shivapratap Gopakumar, Tu Dinh Nguyen, Truyen Tran, Dinh Phung, and Svetha Venkatesh, PAKDD'15, HCM City, Vietnam, May 2015.
Stabilizing high-dimensional prediction models using feature graphs, Shivapratap Gopakumar, Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh, IEEE Journal of Biomedical and Health Informatics, 2014 DOI:10.1109/JBHI.2014.2353031S
Stabilizing sparse Cox model using clinical structures in electronic medical records, S Gopakumar, Truyen Tran, D Phung, S Venkatesh, 2nd International Workshop on Pattern Recognition for Healthcare Analytics, August 2014
Stabilized sparse ordinal regression for medical risk stratification, Truyen Tran, Dinh Phung, Wei Luo, and Svetha Venkatesh, Knowledge and Information Systems, 2014, DOI: 10.1007/s10115-014-0740-4.

Machine learns as data speak

Monday 19 December 2016

Model stability: Learn it many more times, but on different datasets

1 comment: