Hi - is it possible to correlate gene expression patterns with a linear numerical factor in DESeq? Here's the background: I have used DESeq to analyse gene expression in 40 RNAseq samples; three sample groups.
I have noticed on the PCA that gender is a very strong determinant (=PC1) and one of the other components (PC3) also separates the samples relatively well, into 2 rough clusters, but this clustering pattern or sample grouping does not correlate with any of the known factors I have for the samples. I have an image which shows this pattern - not sure how to upload/attach it though; the image button wants a URL but this is a PNG file (not on the web).
I was thinking of either splitting the samples into two groups (low and high) based on an arbitrary cutoff roughly half way along the PC, or perhaps using the principle component factor value (from the pca$x table) to find genes that correlate their expression with this value across the samples, to find the genes responsible for this stratification in PC3.
This is kind of a long shot as my main factor(s) didn't show any significant changes, and I'm wondering whether this factor could show some biological significance in itself (depending on the types of genes changed) or by accounting for it in the design model matrix, and removing it's individual effects which could be muddying the waters for my main factors, e.g. ~gender+PC3_factor+condition.
As always, here's my disclaimer: I'm learning this as I go along, no formal training in maths or stats, just a biologist moving toward bioinformatics as a set of skills essential for my field. I hope I've asked the question in a clear and concise way and not broken too many rules. As I don't have a specific problem or error with a package there is no code or sessionInfo() to include (I don't think).
Cheers for any help or input!
Matt
here's a pointer for sva + DESeq2
http://master.bioconductor.org/help/workflows/rnaseqGene/#batch
thanks - i'll look into that. I wasn't using PCA for batch error correction, just to examine sample relationships. Just to clarify: I don't think this is a batch effect per se, as there were no batches: all samples were processed together. no lane effect as all samples were spread equally across 8 lanes of single flowcell. it's weird, and that's why I think it may correlate with a as-yet-unknown sample characteristic. Samples were collected years ago, so I'm not sure how easy it will be to obtain extra info about them (to find a factor that correlates with this PCA clustering pattern). I suppose you could classify it as Unwanted Variation (à la RUVSeq), so I will give that a shot. Thanks again! Matt
just in case you are interested in my outcome with this? (might be helpful for others looking to do same thing)...
Firstly, I tried the sva method, followed the code exactly as Michael's guide on the RNAseqGene page, and here is my head(results(dds),4): ranked on pvalue
not great padj values...
however, when I look at different contrasts, as I have three groups under 'condition'...
> head(resSva.Low.non.O)
...I get great padj! because certain contrasts give more significant results, I guess this is because the surrogate variables only affect some of the groups/contrasts and not all. which is interesting.
So thanks very much, Ryan and Michael: you've saved my day.
Cheers,
matt