Hi,
I recently performed RNASeq using two different protocols on 2 strains in triplicates. I anticipate similar results using the two protocols and wanted to compare the DE genes between two protocols, when comparing strain2 to strain1. I used DESeq2 to analyze differential expression, and for protocol-1, strain-2 versus strain-1 gave 53 DE genes and for protocol-2, strain-2 versus strain-1 gave 46 DE genes. 20 genes are DE in protocol-1 only, 33 DE in both, 13 genes DE in protocol-2 only. For the genes DE in only one protocol (see table below), i looked at the fold-changes (FC), and although fold-changes are very similar, adjusted p-values (bonferroni) are significant for one protocol and not in the other (protocol). I wonder what contributes to higher adj-pvalues and NA values for following genes in protocol-2 ?
FC(protocol-1) | FC(protocol-2) | adj-p(protocol-1) | adj-p(protocol-2) | |
gene1 | 1.8 | 1.7 | 0.000135209 | 0.381779936 |
gene2 | 1.6 | 1.5 | 2.30819E-05 | 0.20413359 |
gene3 | 1.6 | 1.7 | 0.000446964 | 0.058520602 |
gene4 | 1.6 | 1.2 | 0.000199375 | 1 |
gene5 | 1.5 | 1.4 | 0.004449978 | 0.962902825 |
gene6 | -1.8 | -1.5 | 0.009131046 | 0.200042882 |
gene7 | -1.8 | -1.7 | 2.38294E-08 | 0.279792112 |
gene8 | -1.5 | -1.4 | 0.000163779 | 0.172985361 |
gene9 | 1.7 | 1.2 | 0.000674295 | NA |
gene10 | 1.7 | 1.5 | 0.028869571 | 0.202590227 |
gene11 | 1.5 | 1.2 | 1.01927E-19 | 1 |
gene12 | -1.9 | -1.4 | 1.41247E-10 | 1 |
gene13 | 1.7 | -1.1 | 0.011046489 | NA |
gene14 | 40.0 | 4.7 | 0 | NA |
gene15 | 43.5 | 3.3 | 0 | NA |
gene16 | 6.8 | 7.9 | 6.66528E-63 | NA |
gene17 | 7.2 | 4.6 | 1.4105E-79 | NA |
gene18 | -1.5 | -1.3 | 0.001077303 | 1 |
gene19 | -1.6 | -1.2 | 0.006786376 | NA |
gene20 | -1.5 | -1.3 | 5.78916E-17 | 0.003348031 |
Priyanka
Hi Micheal,
I looked at the PCA plots for 5 strains using the two protocols and they look fairly overlapping to me (PCA plots and other files in the shared folder link below). I also looked at the normalized counts, fold change,lfcSE and it appears that for some genes a slightly higher lfcSE in one protocol lowers wald's-stat and hence the significance (e.g. gene 1 in the attached spreadsheet). Is that right ? While for other genes, e.g. genes 14, 15, 16, the fold changes seem over-shrunk and high cooks-distance for protocol-2, explaining the NA's in adjusted-pvalues. I wonder what is the reason behind such shrinkage for those genes and high cooks distance.
For files and plots click: https://drive.google.com/drive/folders/0B3ZNauEYziEHMW90LU1QU3ZWWmM?usp=sharing
Priyanka
The easiest way for me to follow is if you could show the plotCounts() plot for the specific genes which you suspect something is different across protocol but it should not be.
Below are the plotCounts for gene 1 and gene 14 using the two protocols. Both genes are differentially expressed (fold change 1.5 or more and adj-pval<0.05) between strain2 (S2) versus strain 1(S1) in protocol 1 but not in protocol2.
Thank you, this is helpful in seeing what might be going on. Can you re-run on both protocols with DESeq(dds, betaPrior=FALSE) and results(dds, cooksCutoff=FALSE)? The first suggestion, betaPrior=FALSE for the inference, is going to be default in the next version of DESeq2 anyway, with LFC shrinkage split into a second step for visualization. Gene 1 might still not pass, because the observed difference is within the range of the within-group dispersion and the number of replicates is only 3 (giving a small p-value but not small enough for multiple test correction), but it should make the results more concordant for the other genes.
After re-running with DESeq(dds, betaPrior=FALSE) and results(dds, cooksCutoff=FALSE), I looked at the DE genes and this time both protocol called more DE genes, genes for protocol2 with adjpavlue as NA (e.g. gene 14) were now differentially expressed, but gene1, as you thought, did not pass the threshold. Overall, the overlap between the two protocols (common DE genes) for the strains I looked at increased from 50 to 55%.
That makes more sense. You can expect variation in the DE sets which reflects the natural variation of performing experiments (and only having 3 vs 3 samples per protocol).
Thanks Micheal. I made a PCA plot of expression data from 5 strains done using both protocols and although each protocol retains the relationship between strains, expression data generated using two protocol appears distinct (PC1). I wonder what is contributing to this ? The spread of the counts, using boxplots before and after normalization look similar.
I don't know what the source of the difference is. It could be a shift in counts for enough genes that it contributes such a percent of variance when performing PCA on the top 500 genes by variance (see ?plotPCA).