Differential expression of FPKM from RNA-seq data using limma and voom()

7

Entering edit mode

Jon Bråte ▴ 260

@jon-brate-6263

Last seen 4 months ago

Norway

Hi everyone,

I have a count matrix of FPKM values and I want to estimate differentially expressed genes between two conditions. First I used DESeq2, but I realized that this is not good for FPKM values. I then transformed the counts using voom() in the limma package and then used:

fit <- lmFit(myVoomData,design)
fit <- eBayes(fit)
options(digits=3)
writefile = topTable(fit,n=Inf,sort="none", p.value=0.01)
write.csv(writefile, file="file.csv")

My problem is that all of the 6156 genes are differentially expressed (p-value 0.01). Only a few hundred were differentially expressed using DESe2, but I guess that can't be trusted.

I am new to this type of analysis, and to R, but is it ok to simply transform the data by voom()? Can I use the transformed data in DESeq2? Any other ways I can use FPKM counts to estimate differentially expressed genes?

Thank you,

Jon Bråte

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils  datasets  methods   base

other attached packages:
 [1] limma_3.18.3         cummeRbund_2.4.0     Gviz_1.6.0 rtracklayer_1.22.0   GenomicRanges_1.14.3 XVector_0.2.0
 [7] IRanges_1.20.6       fastcluster_1.1.11   reshape2_1.2.2 ggplot2_0.9.3.1      RSQLite_0.11.4       DBI_0.2-7
[13] BiocGenerics_0.8.0

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.24.0   BSgenome_1.30.0        Biobase_2.22.0 Biostrings_2.30.1      Formula_1.1-1
 [6] GenomicFeatures_1.14.2 Hmisc_3.13-0           MASS_7.3-29 RColorBrewer_1.0-5        RCurl_1.95-4.1
[11] Rsamtools_1.14.2       XML_3.95-0.2           biomaRt_2.18.0 biovizBase_1.10.4      bitops_1.0-6
[16] cluster_1.14.4         colorspace_1.2-4       dichromat_2.0-0 digest_0.6.3          gtable_0.1.2
[21] labeling_0.2           lattice_0.20-24        latticeExtra_0.6-26 munsell_0.4.2     plyr_1.8
[26] proto_0.3-10           scales_0.2.3           splines_3.0.2 stats4_3.0.2            stringr_0.6.2
[31] survival_2.37-4        tools_3.0.2            zlibbioc_1.8.0

----------------------------------------------------------------
Jon Bråte

Microbial Evolution Research Group (MERG)
Department of Biosciences
University of Oslo
P.B. 1066 Blindern
N-0316, Norway
Email: jon.brate@ibv.uio.no
Phone: 922 44 582
Web: mn.uio.no/ibv/english/people/aca/jonbra/index.html

limma • 40k views

ADD COMMENT • link written 10.9 years ago by Jon Bråte ▴ 260

0

Entering edit mode

Hey! I know it has been a long time but can you show the model matrix and the contrast you created? Might be the case that the model matrix wasn’t week created, resulting in that high number of high expressed genes. Best regards,

ADD REPLY • link 18 months ago manuelsokolov • 0

23

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 30 minutes ago

WEHI, Melbourne, Australia

Dear Jon,

No, it is absolutely not ok to input FPKM values to voom. The results will be nonsense. Same goes for edgeR and DESeq, for the reasons explained in the documentation for those packages.

Note that a matrix of FPKM is not a matrix of counts.

It seems to me that you have three options in decreasing order of desirability:

Get the actual integer counts from which the FPKM were computed and do a proper analysis, for example using voom or edgeR.
Get the gene lengths and library sizes used to compute the FPKM and convert the FPKM back to counts.
If FPKM is really all you have, then convert the values to a log2 scale (y = log2(FPKM+0.1) say) and do an ordinary limma analysis as you would for microarray data, using eBayes() with trend=TRUE. Do not use voom, do not use edgeR, do not use DESeq. (Do not pass go and do not collect $200.) This isn't 100% ideal, but is probably the best analysis available. You make this method somewhat better by using arrayWeights() as well which, in this context, will attempt to estimate the library sizes the FPKMs were originally computed from. Nevertheless, the mean-variance trend estimated by limma from the logFPKMs will never be as smooth or as informative as the trend that would have been estimated had you had the real counts.

The third option is similar to the "limma-trend" analysis described in the limma preprint, except that it is applied to the logFPKM instead of logCPM. Statistically this will not perform as well as it would applied to the logCPM.

Best wishes
Gordon

ADD COMMENT • link 10.9 years ago • updated 2.2 years ago Gordon Smyth 51k

0

Entering edit mode

Hello Gordon:

Are there any results that describe what happens when one switches from CPM to RPKM, TMM etc, when using limma.

Regards,

Nik

ADD REPLY • link 9.0 years ago Nik Tuzov ▴ 90

0

Entering edit mode

Hey Gordon,

I found your post very informative- enough so that I need to resurrect this thread :P.

Similar to the original post, I'm working solely with a matrix of FPKM values, and no access to raw counts ... The main issue I'm running into is how to deal with FPKM values that are equal to zero. When I try running the raw values through limma following your guidelines of using eBayes(fit, trend=TRUE), I get the warning: "Warning message: Partial NA coefficients for 586 probe(s)". The resultant list of significant genes is relatively small, with only 32 when using BH correction (compared to >223 from the original DE analysis)

A workaround suggested on several threads is to add a small constant to all FPKM values in the matrix. However, the suggestions I've seen seem rather trite, and I haven't found any posts where the ramifications of constant selection are explained. For example, I tried using a published suggestion of calculating the constant for FPKM as the 1st percentile of all non-zero values (0.003 in our dataset); therefore logFPKM = log2(FPKM + 0.003). However, this gave me 7 significant genes.

After testing 10 different constants, ranging from the most infinitesimal value in R (around 1e-320) to 0.1, I get a wide range of results (from 0 to 1300 genes being significant, in a seemingly random fashion). Half of the time, an intermittent warning pops up that says: "Zero sample variances detected, have been offset away from zero".

Can anyone please explain to me where I'm going wrong and whether there's a better solution to running FPKM values through limma?

Thanks again for this helpful post and for your great work with DE analysis for RNA-seq data!

ADD REPLY • link 5.5 years ago scideas • 0

0

Entering edit mode

What you're doing wrong is trying to do a DE analysis using FPKMs in the first place. There is no good answer to what the FPKM offset should be to avoid taking the log of zero, which is one of the problems with using FPKMs. I would personally use a larger offset than any you have used, certainly not less than 0.1, but then I wouldn't do it at all.

The way to choose a tuning parameter like this is to plot the data to see that it looks vaguely reasonable rather than to count the number of DE genes.

I assume you have filtered the data so that you are not trying to analyse genes for which all, or almost all, the FPKM are zero. Having lots of FPKM=0 would lead to the warnings about zero variances.

I also suggest you used eBayes with robust=TRUE as well as trend=TRUE. That will reduce the sensitivity to zero variances.

ADD REPLY • link 5.5 years ago Gordon Smyth 51k

0

Entering edit mode

Greatly appreciated, Gordon!!!

Given all the complications of trying to use FPKMs, I'm perplexed why our biostats core solely provided us FPKMs along with our original DE analysis (performed with edgeR using the raw data)... FPKMs seem to be a dead end, aside from serving as a rough qualitative "visual" to accompany the stats :/

ADD REPLY • link 5.5 years ago scideas • 0

1

Entering edit mode

We often give FPKMs to our biological collaborators when we analyse their data, but the FPKM are only for plotting purposes, one can't reproduce the DE analysis from them. We don't give our collaborators the raw data because we do the analyses ourselves. It may be that your biostats core think they are playing the same role -- they think they are doing all the analysis for you, rather than providing you the data to perform the analysis yourself.

ADD REPLY • link 5.5 years ago Gordon Smyth 51k

0

Entering edit mode

Gordon - thanks for the helpful answer here. Is there another non-DE approach that can be used to calculate statistical differences from FPKM values (we've misplaced the original fastq files, which were generated in a pre-GEO era)? Much appreciated.

ADD REPLY • link 5.3 years ago lauren_mifflin • 0

0

Entering edit mode

Hi Gordon,

I am using RSEM normalized data. To identify DEGs, I used the voom or "limma-voom" function. Do you think that's the right approach?

I input my RSEM normalized data, then calculated the normalized factors using calNormFactors and proceeded with the voom analysis. Since I already have the normalized data and calNormFactors doesn't normalize the data again, I thought this was a good option.

I am new to this analysis, and if you or anyone could share their opinion would be a great help.

Thank you

ADD REPLY • link 2.7 years ago Ety • 0

3

Entering edit mode

Michael Love 42k

@mikelove

Last seen 17 hours ago

United States

hi Jon, If you are new to R and Bioconductor, please take some time to read over the vignettes that accompany every software package on Bioconductor. We try to pack them full of useful information! In addition, there is a lot more technical information available in the reference manuals. from the command line you can type, e.g. browseVignettes(package="DESeq2") On the first page of the DESeq2 vignette, we discuss why you should only use raw counts as input to our software, not rounded normalized values or FPKM values. Also to help with discussion, by 'count matrix', we refer to a matrix of non-negative integers 0,1,2,..., which were produced by counting reads or fragments, which are the units of evidence of expression in RNA-Seq. So we would avoid referring to a 'count matrix of FPKM values', because these counts have been divided by gene length and library size. Mike On Tue, Nov 26, 2013 at 4:37 PM, Jon BrÃ¥te <jon.brate@ibv.uio.no> wrote: > Hi everyone, > > I have a count matrix of FPKM values and I want to estimate differentially > expressed genes between two conditions. First I used DESeq2, but I realized > that this is not good for FPKM values. > I then transformed the counts using voom() in the limma package and then > used: > > fit <- lmFit(myVoomData,design) > fit <- eBayes(fit) > options(digits=3) > writefile = topTable(fit,n=Inf,sort="none", p.value=0.01) > write.csv(writefile, file="file.csv") > > My problem is that all of the 6156 genes are differentially expressed > (p-value 0.01). Only a few hundred were differentially expressed using > DESe2, but I guess that can't be trusted. > > I am new to this type of analysis, and to R, but is it ok to simply > transform the data by voom()? Can I use the transformed data in DESeq2? Any > other ways I can use FPKM counts to estimate differentially expressed genes? > > Thank you, > > Jon BrÃ¥te > > > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] C > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] limma_3.18.3 cummeRbund_2.4.0 Gviz_1.6.0 > rtracklayer_1.22.0 GenomicRanges_1.14.3 XVector_0.2.0 > [7] IRanges_1.20.6 fastcluster_1.1.11 reshape2_1.2.2 > ggplot2_0.9.3.1 RSQLite_0.11.4 DBI_0.2-7 > [13] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.24.0 BSgenome_1.30.0 Biobase_2.22.0 > Biostrings_2.30.1 Formula_1.1-1 > [6] GenomicFeatures_1.14.2 Hmisc_3.13-0 MASS_7.3-29 > RColorBrewer_1.0-5 RCurl_1.95-4.1 > [11] Rsamtools_1.14.2 XML_3.95-0.2 biomaRt_2.18.0 > biovizBase_1.10.4 bitops_1.0-6 > [16] cluster_1.14.4 colorspace_1.2-4 dichromat_2.0-0 > digest_0.6.3 gtable_0.1.2 > [21] labeling_0.2 lattice_0.20-24 latticeExtra_0.6-26 > munsell_0.4.2 plyr_1.8 > [26] proto_0.3-10 scales_0.2.3 splines_3.0.2 > stats4_3.0.2 stringr_0.6.2 > [31] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0 > > > ---------------------------------------------------------------- > Jon BrÃ¥te > > Microbial Evolution Research Group (MERG) > Department of Biosciences > University of Oslo > P.B. 1066 Blindern > N-0316, Norway > Email: jon.brate@ibv.uio.no > Phone: 922 44 582 > Web: mn.uio.no/ibv/english/people/aca/jonbra/index.html > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.9 years ago Michael Love 42k

Login before adding your answer.