I am currently using DESeq2 in conjunction with the SVA package in an effort to remove latent factors from RNA-Seq data. I have questions about the best way to process our dataset. We have samples from plant tissue grown in field (uncontrolled) conditions in two separate growing seasons and, as such, we expect to see signals from many environmental factors that are unaccounted for. There are 4 genotypes sampled at 5 timepoints in each of 2 years and 2 biological replicates (80 samples total). The timepoints in each year correspond to each other as they were sampled at the same intervals beginning at the same time after seeding. Currently I am using the following to test for differential expression:
dds = DESeqDataSetFromHTSeqCount(sampleTable,directory = htseqDirectory,design = ~ genotype + time + genotype:time)
Where genotype is a factor with 4 levels and time is a factor with 5 levels and all 80 samples are included in the analysis. Then SVA is applied to reveal 10 surrogate variables:
mod = model.matrix(~ time + genotype + genotype:time,data=colData(dds)) mod0 = model.matrix(~1,data=colData(dds)) dds = estimateSizeFactors(dds) countDat = counts(dds,normalized=TRUE) countDat = countDat[rowMeans(countDat) > 1,] svseq = svaseq(countDat,mod,mod0)ddssva = dds
After checking the surrogate variables for correlation with soil temperature the first surrogate variable was discovered to have high correlation. Because we are interested to find DE genes with respect to genotype and temperature, the following changes were made to the design:
design(ddssva) = ~ SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 +SV9 + SV10 + genotype + time + temperature + genotype:time + genotype:temperature
where "temperature" is the surrogate variable with high correlation with soil temperature. To find DE genes of interest we perform the likelihood ratio test with a reduced design formula:
ddssva = DESeq( ddssva, test = "LRT", reduced = ~ SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 + SV9 + SV10 + time + temperature )
The aim of the likelihood ratio test is to test for any differential expression due to genotype differences.
The resulting expression data after filtering for sig. diff. expression are used downstream in a correlation analysis and the results are in line with some biological expectations. What are any potential hazards/caveats in the approach I've just outlined? Thank you.