Hi,
I am analyzing time course RNA seq data.
My design looks like this:
ID | Genotype | SA.treatment |
1 | jc66 | 0h |
2 | jc66 | 0h |
3 | jc66 | 0h |
4 | dsr | 0h |
5 | dsr | 0h |
6 | dsr | 0h |
7 | comp | 0h |
8 | comp | 0h |
9 | comp | 0h |
10 | jc66 | 0.5h |
11 | jc66 | 0.5h |
12 | jc66 | 0.5h |
13 | dsr | 0.5h |
14 | dsr | 0.5h |
15 | dsr | 0.5h |
16 | comp | 0.5h |
17 | comp | 0.5h |
18 | comp | 0.5h |
20 | jc66 | 3h |
21 | jc66 | 3h |
22 | dsr | 3h |
23 | dsr | 3h |
24 | dsr | 3h |
25 | comp | 3h |
26 | comp | 3h |
27 | comp | 3h |
28 | jc66 | 6h |
29 | jc66 | 6h |
30 | jc66 | 6h |
31 | dsr | 6h |
32 | dsr | 6h |
33 | dsr | 6h |
34 | comp | 6h |
35 | comp | 6h |
36 | comp | 6h |
To sum up: 3 Genotypes, 4 time points.
I mapped the bam files with star aligner, and extracted raw counts with genomicalignments
I run deseq2 on it with mixed success. My design formula looks like this:
ddstc <- DESeqDataSet(se,~Genotype + SA.treatment + Genotype:SA.treatment) ddstc <- DESeq(ddstc_universal, test="LRT", reduced = ~Genotype + SA.treatment)
If I look at the pvalue histograms of the individual comparisons, I see, that things are not going all right, here an example pvalue histogramm:
(I switched the filters off, cause there is a heave expression induction upon treatment, and the filter filter all of these)
res <- results(ddstc, name='Genotype_dsr_vs_jc66', test='Wald', independentFiltering = F, cooksCutoff = F) hist(res$pvalue, col = "lavender", main = n, xlab = "p-values")
I found that this could be cause by the Wald test and I can run a false discovery rate correction, like explained here:
and it indeed improves my pvalue histogram and i can add a corrected adjusted pvalue (not perfect, but already a lot better):
res <- res[ !is.na(res$padj), ] res <- res[ !is.na(res$pvalue), ] res <- res[, -which(names(res) == "padj")] FDR.res <- fdrtool(res$stat, statistic= "normal", plot = T) FDR.res$param[1, "sd"] res[,"padj"] <- p.adjust(FDR.res$pval, method = "BH") hist(FDR.res$pval, col = "royalblue4",main = paste0(n,' - corrected'), xlab = "CORRECTED p-values")
This is all fine and good, but can I apply the same technique for the whole time course experiment?
I took the time course workflow from the RNAseq workflow:
https://www.bioconductor.org/help/workflows/rnaseqGene/#time-course-experiments
resTC <- results(ddsTC)
betas <- coef(ddstc) %>% na.omit()
colnames(betas)
topGenes <- head(order(resTC$padj),40)
mat <- betas[topGenes, -c(1)]
thr <- 3
mat[mat < -thr] <- -thr
mat[mat > thr] <- thr
pheatmap(mat, breaks=seq(from=-thr, to=thr, length=101),
cluster_col=FALSE)
If I understand it correctly, it takes the results of the last resultsName, and identifies the most significant genes, by ordering them with their adjusted pvalue (which is the same in all results, as it's globally Hochberg corrected). Then extracting the coefficients for these genes and plotting it into a heatmap.
But can I run the FDR tool on this result and will it correct the pvalues for all result contrasts? My assumption is that the FDR will correct the pvalue only for the last called results (resTC) and adjust the pvalue based on that and will not take any other contrast into account. Which then in turn would mean, that I have different adjusted pvalues in different contrasts, which then in turn means, that the topgenes I want to extract, are not the real topgenes.
So, can I do a FDR correction on the whole experiment ? Does it make sense to do it?
And I have a second question about the interacting terms:
If I understand correctly the results of our example here:
results(ddsTC , name='Genotypedsr.SA.treatment6h')
this shows the results for my Genotype dsr at 6h compared to the Genotype dsr at 0h, minus the effect on the control at the same time points ?
Is there a way to call the results without reducing it by the control levels ?
at least that's my interpretation of:
# the condition effect for genotype II # this is, by definition, the main effect *plus* the interaction term # (the extra condition effect in genotype II compared to genotype I). results(dds, list( c("condition_B_vs_A","genotypeII.conditionB") ))
Cheer
Jakob
hi Jakob,
Can you confirm that in the original pvalue histogram the spike at p=1 is not simply genes with very low counts.
Can you try using the filter from the beginning, so before running DESeq():
You don't show the pvalue distribution for the original LRT. Can you please show this? E.g.
hist(results(dds)$pvalue, ...)
Hi Michael,
the histogram of the original LRT looks like this:
part of the spike at p=1 will be from very low count genes.
Do you mean that i should run the filter before using results() ? (I assume sow, cause I cannot get normalized counts before running DESeq())
I tried your filter suggestion with some adjustments:
It reduces the spike at p=1, cause it will remove all genes, which would have an adjusted pvalue of NA from filtering.
The problematic part is, that a result of it is, that the FDRtool has nothing to correct anymore
And that it again removes induced genes, genes you can normally not find, but in treatment they show up.
I wrote a custom filter that is not removing high variance genes, but still filters for rowsums:
and FDR works again
But I don't know if that's a reasonable approach or not.
And back to my original question, is it statistically right, to run FDR on the LRT ?
And question two: can I extract fold changes without having them reduced by the main effect ?
Cheers
Jakob