Question

Can limma/other DE algorithms be used for differential analysis of chemical descriptors?

0

Entering edit mode

perdedorium • 0

@perdedorium-15654

Last seen 5.6 years ago

Houston, TX

If I have two groups of small molecules with different properties, e.g. one can penetrate membranes and the other cannot, and I have an m × n matrix of m small molecules and n descriptors (e.g. polar surface area, molecular weight), can I use, say, limma to identify descriptors that differ between these two groups? So basically, I would just be using limma on molecular descriptors instead of gene expression data. If it makes any difference, some of these descriptors are discrete (e.g. number of rotatable bonds) and some are continuous (e.g. weight.)

If so, would I have to prepare the data in any special way? If not, what algorithms would be best for this sort of task?

limma edger deseq2 cheminformatics • 695 views

ADD COMMENT • link updated 5.7 years ago by Aaron Lun ★ 28k • written 5.7 years ago by perdedorium • 0

score 3 · Accepted Answer · 2018-08-23

I can foresee a number of problems. The biggest one is that the different molecules are not replicates of each other in a statistical sense. There is no sampling or uncertainty, because your few descriptor values are known quantities. In fact, there is no need for testing at all! Your "null hypothesis", such as it is, is that the means of these two groups of molecules are the same. Just see if the means of the two groups of molecules are different, and this will tell you directly whether your null hypothesis is true or false. And that an exact answer: because if you "repeat" your "experiment" (i.e., analytically re-compute the weights and rotatable bonds, presumably by looking at the structure), you will get the same answer, as the descriptor values won't change.

If you want to make meaningful statements about statistical significance, then you need to think about distributions and randomness. I do not see how it is possible to do that here. Perhaps one could assume that the penetrating molecules are randomly "sampled" from the entire space of possible molecules that can penetrate membranes, in which case you could make general inferences about the differences between penetrating and non-penetrating molecules. But I doubt that the molecules were randomly chosen from whatever the space of possible molecules might be.

Going into the more technical issues: limma uses a normal model, which will not be appropriate for at least some of the variables you've described. I can also anticipate problems with empirical Bayes shrinkage and modelling of the mean-variance trend, which probably won't make any sense when you have a bag of very different variables, each of which will differ in their variance.

It seems like what you really want is an algorithm that tells you which variables are most likely to distinguish between penetrating and non-penetrating molecules. This is a classic application of LASSO (see the glmnet package), where you can use a logistic model to fit the various descriptors to the response (i.e., penetrating or not). You will then get an ordered set of descriptors in terms of their importance for distinguishing between penetrating and non-penetrating molecules. The same approach can be used for any classifier (e.g., SVMs, random forests) but LASSO makes it easy to interpret the results. Of course, it means that you can't say that "the penetrating molecules are significantly heavier", but this doesn't really make sense in the first place, as I mentioned above.