I am currently using DESeq2 in conjunction with the SVA package in an effort to remove latent factors from RNA-Seq data. I have questions about the best way to process our dataset. We have samples from plant tissue grown in field (uncontrolled) conditions in two separate growing seasons and, as such, we expect to see signals from many environmental factors that are unaccounted for. There are 4 genotypes sampled at 5 timepoints in each of 2 years and 2 biological replicates (80 samples total). The timepoints in each year correspond to each other as they were sampled at the same intervals beginning at the same time after seeding. Currently I am using the following to test for differential expression:
dds = DESeqDataSetFromHTSeqCount(sampleTable,directory = htseqDirectory,design = ~ genotype + time + genotype:time)
Where genotype is a factor with 4 levels and time is a factor with 5 levels and all 80 samples are included in the analysis. Then SVA is applied to reveal 10 surrogate variables:
mod = model.matrix(~ time + genotype + genotype:time,data=colData(dds))
mod0 = model.matrix(~1,data=colData(dds))
dds = estimateSizeFactors(dds)
countDat = counts(dds,normalized=TRUE)
countDat = countDat[rowMeans(countDat) > 1,]
svseq = svaseq(countDat,mod,mod0)
ddssva = dds
After checking the surrogate variables for correlation with soil temperature the first surrogate variable was discovered to have high correlation. Because we are interested to find DE genes with respect to genotype and temperature, the following changes were made to the design:
design(ddssva) = ~ SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 +SV9 + SV10 + genotype + time + temperature + genotype:time + genotype:temperature
where "temperature" is the surrogate variable with high correlation with soil temperature. To find DE genes of interest we perform the likelihood ratio test with a reduced design formula:
ddssva = DESeq( ddssva, test = "LRT", reduced = ~ SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 + SV9 + SV10 + time + temperature )
The aim of the likelihood ratio test is to test for any differential expression due to genotype differences.
The resulting expression data after filtering for sig. diff. expression are used downstream in a correlation analysis and the results are in line with some biological expectations. What are any potential hazards/caveats in the approach I've just outlined? Thank you.
The approach looks fine to me overall, but I think you might want to detail the question(s) you are interested in a little bit.
With you current design and test, you find genes that either:
1.) change overall because of genotype differences
2.) change over time because of genotype differences
3.) change because of the interaction of soil temperature and genotypes
So basically your resulting list might contain potentially very different kinds of genes, their common
denominator being that they are affected by genotype changes in some way. So you will capture any change related to genotype.
If this is what you want, it's fine but I still would make more explicit what you are trying to test.