Question: Differential expression of FPKM from RNA-seq data using limma and voom()
2
5.7 years ago by
Jon Bråte160
Norway
Jon Bråte160 wrote:

Hi everyone,

I have a count matrix of FPKM values and I want to estimate differentially expressed genes between two conditions. First I used DESeq2, but I realized that this is not good for FPKM values. I then transformed the counts using voom() in the limma package and then used:

fit <- lmFit(myVoomData,design)
fit <- eBayes(fit)
options(digits=3)
writefile = topTable(fit,n=Inf,sort="none", p.value=0.01)
write.csv(writefile, file="file.csv")

My problem is that all of the 6156 genes are differentially expressed (p-value 0.01). Only a few hundred were differentially expressed using DESe2, but I guess that can't be trusted.

I am new to this type of analysis, and to R, but is it ok to simply transform the data by voom()? Can I use the transformed data in DESeq2? Any other ways I can use FPKM counts to estimate differentially expressed genes?

Thank you,

Jon Bråte

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils  datasets  methods   base

other attached packages:
[1] limma_3.18.3         cummeRbund_2.4.0     Gviz_1.6.0 rtracklayer_1.22.0   GenomicRanges_1.14.3 XVector_0.2.0
[7] IRanges_1.20.6       fastcluster_1.1.11   reshape2_1.2.2 ggplot2_0.9.3.1      RSQLite_0.11.4       DBI_0.2-7
[13] BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.24.0   BSgenome_1.30.0        Biobase_2.22.0 Biostrings_2.30.1      Formula_1.1-1
[6] GenomicFeatures_1.14.2 Hmisc_3.13-0           MASS_7.3-29 RColorBrewer_1.0-5        RCurl_1.95-4.1
[11] Rsamtools_1.14.2       XML_3.95-0.2           biomaRt_2.18.0 biovizBase_1.10.4      bitops_1.0-6
[16] cluster_1.14.4         colorspace_1.2-4       dichromat_2.0-0 digest_0.6.3          gtable_0.1.2
[21] labeling_0.2           lattice_0.20-24        latticeExtra_0.6-26 munsell_0.4.2     plyr_1.8
[26] proto_0.3-10           scales_0.2.3           splines_3.0.2 stats4_3.0.2            stringr_0.6.2
[31] survival_2.37-4        tools_3.0.2            zlibbioc_1.8.0 

----------------------------------------------------------------
Jon Bråte

Microbial Evolution Research Group (MERG)
Department of Biosciences
University of Oslo
P.B. 1066 Blindern
N-0316, Norway
Email: jon.brate@ibv.uio.no
Phone: 922 44 582
Web: mn.uio.no/ibv/english/people/aca/jonbra/index.html

limma • 13k views
modified 3.1 years ago by Gordon Smyth38k • written 5.7 years ago by Jon Bråte160
Answer: Differential expression of RNA-seq data using limma and voom()
7
5.7 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

Dear Jon,

No, it is absolutely not ok to input FPKM values to voom. The results will be nonsense. Same goes for edgeR and DESeq, for the reasons explained in the documentation for those packages.

Note that a matrix of FPKM is not a matrix of counts.

It seems to me that you have three options in decreasing order of desirability:

1. Get the actual integer counts from which the FPKM were computed and do a proper analysis, for example using voom or edgeR.
2. Get the gene lengths and library sizes used to compute the FPKM and convert the FPKM back to counts.
3. If FPKM is really all you have, then convert the values to a log2 scale (y = log2(FPKM+0.1) say) and do an ordinary limma analysis as you would for microarray data, using eBayes() with trend=TRUE. Do not use voom, do not use edgeR, do not use DESeq. (Do not pass go and do not collect \$200.) This isn't 100% ideal, but is probably the best analysis available.

The third option is similar to the "limma-trend" analysis described in the limma preprint, except that it is applied to the logFPKM instead of logCPM. Statistically this will not perform as well as it would applied to the logCPM.

Best wishes
Gordon

Hello Gordon:

Are there any results that describe what happens when one switches from CPM to RPKM, TMM etc, when using limma.

Regards,

Nik

Hey Gordon,

I found your post very informative- enough so that I need to resurrect this thread :P.

Similar to the original post, I'm working solely with a matrix of FPKM values, and no access to raw counts ... The main issue I'm running into is how to deal with FPKM values that are equal to zero. When I try running the raw values through limma following your guidelines of using eBayes(fit, trend=TRUE), I get the warning: "Warning message: Partial NA coefficients for 586 probe(s)". The resultant list of significant genes is relatively small, with only 32 when using BH correction (compared to >223 from the original DE analysis)

A workaround suggested on several threads is to add a small constant to all FPKM values in the matrix. However, the suggestions I've seen seem rather trite, and I haven't found any posts where the ramifications of constant selection are explained. For example, I tried using a published suggestion of calculating the constant for FPKM as the 1st percentile of all non-zero values (0.003 in our dataset); therefore logFPKM = log2(FPKM + 0.003). However, this gave me 7 significant genes.

After testing 10 different constants, ranging from the most infinitesimal value in R (around 1e-320) to 0.1, I get a wide range of results (from 0 to 1300 genes being significant, in a seemingly random fashion). Half of the time, an intermittent warning pops up that says: "Zero sample variances detected, have been offset away from zero".

Can anyone please explain to me where I'm going wrong and whether there's a better solution to running FPKM values through limma?

Thanks again for this helpful post and for your great work with DE analysis for RNA-seq data!

What you're doing wrong is trying to do a DE analysis using FPKMs in the first place. There is no good answer to what the FPKM offset should be to avoid taking the log of zero, which is one of the problems with using FPKMs. I would personally use a larger offset than any you have used, certainly not less than 0.1, but then I wouldn't do it at all.

The way to choose a tuning parameter like this is to plot the data to see that it looks vaguely reasonable rather than to count the number of DE genes.

I assume you have filtered the data so that you are not trying to analyse genes for which all, or almost all, the FPKM are zero. Having lots of FPKM=0 would lead to the warnings about zero variances.

I also suggest you used eBayes with robust=TRUE as well as trend=TRUE. That will reduce the sensitivity to zero variances.

Greatly appreciated, Gordon!!!

Given all the complications of trying to use FPKMs, I'm perplexed why our biostats core solely provided us FPKMs along with our original DE analysis (performed with edgeR using the raw data)... FPKMs seem to be a dead end, aside from serving as a rough qualitative "visual" to accompany the stats :/

We often give FPKMs to our biological collaborators when we analyse their data, but the FPKM are only for plotting purposes, one can't reproduce the DE analysis from them. We don't give our collaborators the raw data because we do the analyses ourselves. It may be that your biostats core think they are playing the same role -- they think they are doing all the analysis for you, rather than providing you the data to perform the analysis yourself.

Gordon - thanks for the helpful answer here. Is there another non-DE approach that can be used to calculate statistical differences from FPKM values (we've misplaced the original fastq files, which were generated in a pre-GEO era)? Much appreciated.

Answer: Differential expression of RNA-seq data using limma and voom()
1
5.7 years ago by
Michael Love24k
United States
Michael Love24k wrote:
hi Jon, If you are new to R and Bioconductor, please take some time to read over the vignettes that accompany every software package on Bioconductor. We try to pack them full of useful information! In addition, there is a lot more technical information available in the reference manuals. from the command line you can type, e.g. browseVignettes(package="DESeq2") On the first page of the DESeq2 vignette, we discuss why you should only use raw counts as input to our software, not rounded normalized values or FPKM values. Also to help with discussion, by 'count matrix', we refer to a matrix of non-negative integers 0,1,2,..., which were produced by counting reads or fragments, which are the units of evidence of expression in RNA-Seq. So we would avoid referring to a 'count matrix of FPKM values', because these counts have been divided by gene length and library size. Mike On Tue, Nov 26, 2013 at 4:37 PM, Jon BrÃ¥te <jon.brate@ibv.uio.no> wrote: > Hi everyone, > > I have a count matrix of FPKM values and I want to estimate differentially > expressed genes between two conditions. First I used DESeq2, but I realized > that this is not good for FPKM values. > I then transformed the counts using voom() in the limma package and then > used: > > fit <- lmFit(myVoomData,design) > fit <- eBayes(fit) > options(digits=3) > writefile = topTable(fit,n=Inf,sort="none", p.value=0.01) > write.csv(writefile, file="file.csv") > > My problem is that all of the 6156 genes are differentially expressed > (p-value 0.01). Only a few hundred were differentially expressed using > DESe2, but I guess that can't be trusted. > > I am new to this type of analysis, and to R, but is it ok to simply > transform the data by voom()? Can I use the transformed data in DESeq2? Any > other ways I can use FPKM counts to estimate differentially expressed genes? > > Thank you, > > Jon BrÃ¥te > > > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] C > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] limma_3.18.3 cummeRbund_2.4.0 Gviz_1.6.0 > rtracklayer_1.22.0 GenomicRanges_1.14.3 XVector_0.2.0 > [7] IRanges_1.20.6 fastcluster_1.1.11 reshape2_1.2.2 > ggplot2_0.9.3.1 RSQLite_0.11.4 DBI_0.2-7 > [13] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.24.0 BSgenome_1.30.0 Biobase_2.22.0 > Biostrings_2.30.1 Formula_1.1-1 > [6] GenomicFeatures_1.14.2 Hmisc_3.13-0 MASS_7.3-29 > RColorBrewer_1.0-5 RCurl_1.95-4.1 > [11] Rsamtools_1.14.2 XML_3.95-0.2 biomaRt_2.18.0 > biovizBase_1.10.4 bitops_1.0-6 > [16] cluster_1.14.4 colorspace_1.2-4 dichromat_2.0-0 > digest_0.6.3 gtable_0.1.2 > [21] labeling_0.2 lattice_0.20-24 latticeExtra_0.6-26 > munsell_0.4.2 plyr_1.8 > [26] proto_0.3-10 scales_0.2.3 splines_3.0.2 > stats4_3.0.2 stringr_0.6.2 > [31] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0 > > > ---------------------------------------------------------------- > Jon BrÃ¥te > > Microbial Evolution Research Group (MERG) > Department of Biosciences > University of Oslo > P.B. 1066 Blindern > N-0316, Norway > Email: jon.brate@ibv.uio.no > Phone: 922 44 582 > Web: mn.uio.no/ibv/english/people/aca/jonbra/index.html > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]