Hi, all/Micheal Love,
Is there a way in DESeq2 package to hep with this kind of problems? For example, I may be able to think of tens of possible variables that may affect gene expressions(Genotypes of a couple of genes, genders, age, PMI, RIN, RIN^2, mapping rate, batches....). Obviously, I should only include a limited number of those variables.
And how could I choose these variables? How the number of my samples would restrain my selections, in order to make a robust estimation?
From the literature, I saw someone using the PC1 as the corresponding factor. Then ANOVA model could be applied to testing the contribution of each possible variables to PC1. This is reasonable, but with obvious limitations. Especially, when sometimes you see the PC1 is mainly dominated by a single factor(such as Batch), then (PC2,PC3, etc) may also be used to identify other factors.
Any suggestions? (I know this is not a pure DESeq2 package problem, but I guess Micheal would have some clue about this:-) )
Thanks in advance,
Raymond
Hi Michael,
Thank you so much for your answer to the question. I have a similar question regarding the QC metrics output and whether to include them in our design matrix. We performed a PCA association analysis of the normalized counts against all the variables we have. We found that average insert size, intronic rate, exonic rate, intergenic rate and duplication rate of mapped reads are all significantly associated with the variations captured by PC1. We were thinking about including one of these metrics (exonic rate) in our design matrix for our differential gene expression analysis as a control for technical variability from sequencing. Would this be valid or are methods like SVA you mentioned above preferred over this?
Thanks again and I really appreciate it!
I've answered this elsewhere on the support site. I prefer SVA/RUV as it captures these and has orthogonality.