I was trying to get some data from published RNA-Sequencing data, but to my surprise, the data analyses often seemed awfully incorrect to me. I would take the following paper published in Cell Stem Cell for example:
- Sun D, Luo M, Jeong M, Rodriguez B et al. Epigenomic profiling of young and aged HSCs reveals concerted changes during aging that reinforce self-renewal. Cell Stem Cell 2014 May 1;14(5):673-88. PMID: 24792119
The associated RNA-Seq data can be found here.
They sequenced HSCs from two 4-month old mice and two 24-month old mice (each mouse done in technical duplicates), and tried to identify differentially expressed genes (DEGs). For starters, technical replicates are not independent replicates and thus they only had 2 replicates for each age group. To my experience, such low number of replicates could hardly generate any significant results, especially after controlling for the huge number of genes analyzed (i.e., get adjusted p-values or FDR).
However, they did claim that they identified many DEGs. So, I looked into their data. Take one gene for example, they claim a very low FDR of 1.83E-09, but even if I counted their technical replicates as true replicates, I still could only get a p-value of 0.004 or 0.0002 by t-test, depending on whether I normalize each data to the mean of all genes. I believe FDR should not be smaller than p-values.
|geneSymbol||m04_hsc_l1.fpkm||m04_hsc_l2.fpkm||m04_hsc_l3.fpkm||m04_hsc_l4.fpkm||m24_hsc_l1.fpkm||m24_hsc_l2.fpkm||m24_hsc_l3.fpkm||m24_hsc_l4.fpkm||log2FC||FDR||ttest pvalues1||ttest pvalues2|
I am not sure how they get their results, but they claim to have used DESeq for data analyses in their method description. I am not familiar with DESeq. Can someone familiar with it help to double-check their results and explain to me how they could get such significant results? Why do we get conflicting results?
I also saw some unlikely significant results (p values < 1e-20) for many genes in an shRNA library used in just triplicates after a collaborator of our lab analyzing the data by DESeq. So, I am wondering if it is true that DESeq has become a black-box tool that could easily trick people who do not understand exactly what is going on inside?