Hi all, I have time course data: 0, 2, 4, 6, and 8 hours. 0 hr has 6 control samples. All other time points have 6 control and 6 treated samples. Total 54 samples. (The 6 replicates are basically 6 different donors.)
Here is what I used to compute DE for different time points using contrast.
dds <- DESeqDataSetFromMatrix(countData = counts, colData = samples, design = ~group+donor) dds <- estimateSizeFactors(dds) nc <- counts(dds, normalized=TRUE) filter <- rowSums(nc >= 10) >= 6 dds <- dds[filter,] dds<-DESeq(dds) results(dds, contrast=c('group','T8','C8'), alpha=0.05) summary(results)
I am filtering counts only to get rid of genes with low counts, I am assuming the independent filtering will choose genes with high power. note: One of my samples in the treated group happen to have way too many reads than other samples in the same group.
With my results, many genes that are reported as highly differentially expressed have counts that don't make sense to me (I would assume those would be filtered out). For example, few of the genes reported as highly differentially expressed are given in this link: genes with read counts
Below are my concerns:
Should these genes not be filtered out? I am guessing the reason they are being reported as DE is due to the fact that these genes have very high count for 1 sample in the treated group and it increases the average expression of that gene in the treated group??
How should I filter my data such that genes like these are not reported as differentially expressed? If I were comparing only 2 groups with a design for 1 particular time point (control 8hr vs trt 8hr), I would filter genes such that at least 6 of my samples out of 12 express at least 10 read counts or 1 cpm (this will not report those genes that I saw with my analysis). But with a design that has all time points, what should I do to exclude genes such as reported above?
For this time course data, I am particularly interested in differences between two groups at each time point. So, should I still compute DE using a single model -a design including all 54 samples together? or is it just fine to compute DEs separately using 4 different models at different time points? Are there any pros and cons either ways?