Question

Using camera on a filtered expression matrix

0

Entering edit mode

giovanni.dario • 0

@giovannidario-7265

Last seen 4.0 years ago

Switzerland

Dear all,

I have a question on the correct way to use camera with a filtered gene expression matrix.

Let's assume that I have a n * m expression matrix (array or RNA-Seq, it doesn't matter) and that after filtering the features below a certain intensity/cpm, I am left with a matrix n' * m, being k = n - n' the number of features removed. Presumably most of these filtered features will not be significantly associated with the phenotype.

Now, if I understand correctly, when I use 'ids2indices' to map the elements of a given gene set to the features of the expression matrix, the elements with no match will contain an NA, and will not excluded from the rest of the analysis. This means that if I have a gene set where only 10% of the genes are present in the filtered expression matrix, the actual gene set that will be tested will be composed by that 10%. In my (very possibly incorrect) understanding, this makes perfect sense if the non-matching features are actually not testable (for example if the array does not contain probe sets mapping them). However, in the case of filtered features I am a bit confused. In the example above, if that 10% of the genes in the gene set was associated with the phenotype, and the remaining 90% was removed, I would probably see a significant association of the gene set with the phenotype. If, instead, I kept that 90% of genes that are not significantly associated with the phenotype in the gene set, I would probably obtain a non-significant result. My questions therefore are:

1. Is my understanding correct?
2. If yes, what would be the best way to retain the information of the k weak, (moslty) non-significant features in the analysis?

Apologies for the somewhat lengthy question, and many thanks in advance.

camera limma gene set analysis • 1.3k views

ADD COMMENT • link updated 9.3 years ago by Gordon Smyth 50k • written 9.3 years ago by giovanni.dario • 0

score 0 · Answer 1 · 2015-01-28

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Part of the justification for filtering non-expressed or very low expressed genes is that they are very unlikely to be DE, in the sense that they are unlikely to contain enough statistical information to achieve statistical significance. Filtered genes are therefore essentially not testable, much like probes that are not on the array at all.

For this reason, we treat non-expressed genes the same as probes not found on the array, by excluding non-expressed genes from the testing universe for camera and other gene set tests.

ADD COMMENT • link 9.3 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, thank you so much for your kind and clear answer. I was asking because I recently had a gene set that, by design, was significantly associated with a data set (genes found to be up-regulated in an independent identical experiment under the same conditions) which, although ranking near the top or the Camera output, was ranked below another gene set in which less than 20% of the genes were mappable to the expression matrix, and was, from a biological point of view, quite unlikely to be associated with the phenotype. In these cases, would you recommend filtering the gene sets below a certain fraction of mappable ids, or rather to keep all the gene sets and interpret the results a posteriori?

Many thanks again for your answer and for your code!

ADD REPLY • link 9.3 years ago giovanni.dario • 0