Question: Using camera on a filtered expression matrix
gravatar for giovanni.dario
2.8 years ago by
giovanni.dario0 wrote:

Dear all,

I have a question on the correct way to use camera with a filtered gene expression matrix.

Let's assume that I have a n * m expression matrix (array or RNA-Seq, it doesn't matter) and that after filtering the features below a certain intensity/cpm, I am left with a matrix n' * m, being k = n - n' the number of features removed. Presumably most of these filtered features will not be significantly associated with the phenotype.

Now, if I understand correctly, when I use 'ids2indices' to map the elements of a given gene set to the features of the expression matrix, the elements with no match will contain an NA, and will not excluded from the rest of the analysis. This means that if I have a gene set where only 10% of the genes are present in the filtered expression matrix, the actual gene set that will be tested will be composed by that 10%. In my (very possibly incorrect) understanding, this makes perfect sense if the non-matching features are actually not testable (for example if the array does not contain probe sets mapping them). However, in the case of filtered features I am a bit confused. In the example above, if that 10% of the genes in the gene set was associated with the phenotype, and the remaining 90% was removed, I would probably see a significant association of the gene set with the phenotype. If, instead, I kept that 90% of genes that are not significantly associated with the phenotype in the gene set, I would probably obtain a non-significant result. My questions therefore are:

1. Is my understanding correct?
2. If yes, what would be the best way to retain the information of the k weak, (moslty) non-significant features in the analysis?

Apologies for the somewhat lengthy question, and many thanks in advance.

ADD COMMENTlink modified 2.8 years ago by Gordon Smyth32k • written 2.8 years ago by giovanni.dario0
gravatar for Gordon Smyth
2.8 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Part of the justification for filtering non-expressed or very low expressed genes is that they are very unlikely to be DE, in the sense that they are unlikely to contain enough statistical information to achieve statistical significance. Filtered genes are therefore essentially not testable, much like probes that are not on the array at all.

For this reason, we treat non-expressed genes the same as probes not found on the array, by excluding non-expressed genes from the testing universe for camera and other gene set tests.

ADD COMMENTlink written 2.8 years ago by Gordon Smyth32k

Dear Gordon, thank you so much for your kind and clear answer. I was asking because I recently had a gene set that, by design, was significantly associated with a data set (genes found to be up-regulated in an independent identical experiment under the same conditions) which, although ranking near the top or the Camera output, was ranked below another gene set in which less than 20% of the genes were mappable to the expression matrix, and was, from a biological point of view, quite unlikely to be associated with the phenotype. In these cases, would you recommend filtering the gene sets below a certain fraction of mappable ids, or rather to keep all the gene sets and interpret the results a posteriori?

Many thanks again for your answer and for your code! 

ADD REPLYlink written 2.8 years ago by giovanni.dario0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 392 users visited in the last hour