Hi everyone,
I have a count matrix of FPKM values and I want to estimate differentially expressed genes between two conditions. First I used DESeq2, but I realized that this is not good for FPKM values. I then transformed the counts using voom() in the limma package and then used:
fit <- lmFit(myVoomData,design) fit <- eBayes(fit) options(digits=3) writefile = topTable(fit,n=Inf,sort="none", p.value=0.01) write.csv(writefile, file="file.csv")
My problem is that all of the 6156 genes are differentially expressed (p-value 0.01). Only a few hundred were differentially expressed using DESe2, but I guess that can't be trusted.
I am new to this type of analysis, and to R, but is it ok to simply transform the data by voom()? Can I use the transformed data in DESeq2? Any other ways I can use FPKM counts to estimate differentially expressed genes?
Thank you,
Jon Bråte
> sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] C attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.18.3 cummeRbund_2.4.0 Gviz_1.6.0 rtracklayer_1.22.0 GenomicRanges_1.14.3 XVector_0.2.0 [7] IRanges_1.20.6 fastcluster_1.1.11 reshape2_1.2.2 ggplot2_0.9.3.1 RSQLite_0.11.4 DBI_0.2-7 [13] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.24.0 BSgenome_1.30.0 Biobase_2.22.0 Biostrings_2.30.1 Formula_1.1-1 [6] GenomicFeatures_1.14.2 Hmisc_3.13-0 MASS_7.3-29 RColorBrewer_1.0-5 RCurl_1.95-4.1 [11] Rsamtools_1.14.2 XML_3.95-0.2 biomaRt_2.18.0 biovizBase_1.10.4 bitops_1.0-6 [16] cluster_1.14.4 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 gtable_0.1.2 [21] labeling_0.2 lattice_0.20-24 latticeExtra_0.6-26 munsell_0.4.2 plyr_1.8 [26] proto_0.3-10 scales_0.2.3 splines_3.0.2 stats4_3.0.2 stringr_0.6.2 [31] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0
----------------------------------------------------------------
Jon Bråte
Microbial Evolution Research Group (MERG)
Department of Biosciences
University of Oslo
P.B. 1066 Blindern
N-0316, Norway
Email: jon.brate@ibv.uio.no
Phone: 922 44 582
Web: mn.uio.no/ibv/english/people/aca/jonbra/index.html
Hello Gordon:
Are there any results that describe what happens when one switches from CPM to RPKM, TMM etc, when using limma.
Regards,
Nik
Hey Gordon,
I found your post very informative- enough so that I need to resurrect this thread :P.
Similar to the original post, I'm working solely with a matrix of FPKM values, and no access to raw counts ... The main issue I'm running into is how to deal with FPKM values that are equal to zero. When I try running the raw values through limma following your guidelines of using eBayes(fit, trend=TRUE), I get the warning: "Warning message: Partial NA coefficients for 586 probe(s)". The resultant list of significant genes is relatively small, with only 32 when using BH correction (compared to >223 from the original DE analysis)
A workaround suggested on several threads is to add a small constant to all FPKM values in the matrix. However, the suggestions I've seen seem rather trite, and I haven't found any posts where the ramifications of constant selection are explained. For example, I tried using a published suggestion of calculating the constant for FPKM as the 1st percentile of all non-zero values (0.003 in our dataset); therefore logFPKM = log2(FPKM + 0.003). However, this gave me 7 significant genes.
After testing 10 different constants, ranging from the most infinitesimal value in R (around 1e-320) to 0.1, I get a wide range of results (from 0 to 1300 genes being significant, in a seemingly random fashion). Half of the time, an intermittent warning pops up that says: "Zero sample variances detected, have been offset away from zero".
Can anyone please explain to me where I'm going wrong and whether there's a better solution to running FPKM values through limma?
Thanks again for this helpful post and for your great work with DE analysis for RNA-seq data!
What you're doing wrong is trying to do a DE analysis using FPKMs in the first place. There is no good answer to what the FPKM offset should be to avoid taking the log of zero, which is one of the problems with using FPKMs. I would personally use a larger offset than any you have used, certainly not less than 0.1, but then I wouldn't do it at all.
The way to choose a tuning parameter like this is to plot the data to see that it looks vaguely reasonable rather than to count the number of DE genes.
I assume you have filtered the data so that you are not trying to analyse genes for which all, or almost all, the FPKM are zero. Having lots of FPKM=0 would lead to the warnings about zero variances.
I also suggest you used
eBayes
withrobust=TRUE
as well astrend=TRUE
. That will reduce the sensitivity to zero variances.Greatly appreciated, Gordon!!!
Given all the complications of trying to use FPKMs, I'm perplexed why our biostats core solely provided us FPKMs along with our original DE analysis (performed with edgeR using the raw data)... FPKMs seem to be a dead end, aside from serving as a rough qualitative "visual" to accompany the stats :/
We often give FPKMs to our biological collaborators when we analyse their data, but the FPKM are only for plotting purposes, one can't reproduce the DE analysis from them. We don't give our collaborators the raw data because we do the analyses ourselves. It may be that your biostats core think they are playing the same role -- they think they are doing all the analysis for you, rather than providing you the data to perform the analysis yourself.
Gordon - thanks for the helpful answer here. Is there another non-DE approach that can be used to calculate statistical differences from FPKM values (we've misplaced the original fastq files, which were generated in a pre-GEO era)? Much appreciated.