Hi - is it possible to correlate gene expression patterns with a linear numerical factor in DESeq? Here's the background: I have used DESeq to analyse gene expression in 40 RNAseq samples; three sample groups.
I have noticed on the PCA that gender is a very strong determinant (=PC1) and one of the other components (PC3) also separates the samples relatively well, into 2 rough clusters, but this clustering pattern or sample grouping does not correlate with any of the known factors I have for the samples. I have an image which shows this pattern - not sure how to upload/attach it though; the image button wants a URL but this is a PNG file (not on the web).
I was thinking of either splitting the samples into two groups (low and high) based on an arbitrary cutoff roughly half way along the PC, or perhaps using the principle component factor value (from the pca$x table) to find genes that correlate their expression with this value across the samples, to find the genes responsible for this stratification in PC3.
This is kind of a long shot as my main factor(s) didn't show any significant changes, and I'm wondering whether this factor could show some biological significance in itself (depending on the types of genes changed) or by accounting for it in the design model matrix, and removing it's individual effects which could be muddying the waters for my main factors, e.g. ~gender+PC3_factor+condition.
As always, here's my disclaimer: I'm learning this as I go along, no formal training in maths or stats, just a biologist moving toward bioinformatics as a set of skills essential for my field. I hope I've asked the question in a clear and concise way and not broken too many rules. As I don't have a specific problem or error with a package there is no code or sessionInfo() to include (I don't think).
Cheers for any help or input!
Matt

here's a pointer for sva + DESeq2
http://master.bioconductor.org/help/workflows/rnaseqGene/#batch
thanks - i'll look into that. I wasn't using PCA for batch error correction, just to examine sample relationships. Just to clarify: I don't think this is a batch effect per se, as there were no batches: all samples were processed together. no lane effect as all samples were spread equally across 8 lanes of single flowcell. it's weird, and that's why I think it may correlate with a as-yet-unknown sample characteristic. Samples were collected years ago, so I'm not sure how easy it will be to obtain extra info about them (to find a factor that correlates with this PCA clustering pattern). I suppose you could classify it as Unwanted Variation (à la RUVSeq), so I will give that a shot. Thanks again! Matt
just in case you are interested in my outcome with this? (might be helpful for others looking to do same thing)...
Firstly, I tried the sva method, followed the code exactly as Michael's guide on the RNAseqGene page, and here is my head(results(dds),4): ranked on pvalue
> head(resSva.O, 4) log2 fold change (MAP): condition Non-ASD vs ASD Wald test p-value: condition Non-ASD vs ASD DataFrame with 4 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> TAS2R13 53.167524 -0.4453818 0.11195912 -3.978075 6.947541e-05 0.7090342 KANTR 149.987511 -0.3153299 0.08045521 -3.919323 8.879820e-05 0.7090342 POU4F1 1.210617 0.2682640 0.06850487 3.915984 9.003608e-05 0.7090342 SLPI 33.024806 0.4240727 0.11039906 3.841271 1.223988e-04 0.7229178not great padj values...
however, when I look at different contrasts, as I have three groups under 'condition'...
> head(resSva.Low.non.O)log2 fold change (MAP): condition Low Total Cast vs Non-ASD Wald test p-value: condition Low Total Cast vs Non-ASD DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> FCRL1 1098.90425 0.5110747 0.09597991 5.324809 1.010592e-07 0.001317307 MIR600HG 352.42715 0.4183308 0.08300525 5.039811 4.659913e-07 0.003037098 PTPRK 142.01688 0.4933695 0.10028901 4.919478 8.677551e-07 0.003462373 KIAA0125 407.44631 0.4316904 0.08846663 4.879698 1.062485e-06 0.003462373 TSTD3 22.97953 0.5605473 0.11698908 4.791450 1.655803e-06 0.004316678 AFF3 541.36563 0.4043529 0.08737233 4.627929 3.693415e-06 0.008023943 > head(resSva.ASD.Low.O) log2 fold change (MAP): condition ASD vs Low Total Cast Wald test p-value: condition ASD vs Low Total Cast DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> FOLR3 341.296934 0.6076173 0.10955879 5.546039 2.922144e-08 0.0004840532 COL19A1 420.415327 -0.4130494 0.08880545 -4.651172 3.300545e-06 0.0200067582 TPD52 309.831782 -0.3071897 0.06632051 -4.631896 3.623319e-06 0.0200067582 ANKRD20A11P 7.549347 0.4524846 0.10032811 4.510048 6.481296e-06 0.0245543233 AFF3 541.365633 -0.3407873 0.07604289 -4.481515 7.411507e-06 0.0245543233 FCRL1 1098.904245 -0.3645985 0.08498879 -4.289960 1.787052e-05 0.0451427384...I get great padj! because certain contrasts give more significant results, I guess this is because the surrogate variables only affect some of the groups/contrasts and not all. which is interesting.
So thanks very much, Ryan and Michael: you've saved my day.
Cheers,
matt