Question

Question about genas function in the limma package

0

Entering edit mode

Eduard Casas • 0

@eduard-casas-12072

Last seen 7.2 years ago

Dear Gordon Smyth,

I am facing a biological problem where I am comparing two different treatment conditions (plus a mutant) versus the control condition. I have RNA-Seq expression in triplicates for each condition. What I want to find out is the true (biological) correlation of fold changes in our two conditions when comparing each to the control. To do so, we came across your function genas from the limma package. However, I have some questions to make sure I am doing things right and that would help me understand the results of the function.

I provide the code I am using for you to find out what I am doing exactly:

condition <- c("MUT", "MUT", "MUT", "A", "A", "A", "B", "B", "B", "Control", "Control", "Control")
y <- RawCounts[which(rowMedians(as.matrix(RawCounts))>=1),] ### RawCounts is the matrix of gene expression
group <- factor(condition, levels=c("Control","MUT", "A", "B"))
design <- model.matrix(~group)
colnames(design) <- c("Control", "MUT", "A", "B")
dge <- DGEList(counts=y)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=T, prior.count=1)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
print(genas(fit, coef=c(2,3), plot=TRUE, subset="all"))

. When subsetting==Fpval, are only 'significantly changing genes' used to estimate biological correlation?
. Are the genes used to estimate biological correlation the only ones that show biological change (not technical correlation) in any of the two comparisons?
. Why is technical correlation always 0.5? Is this based on some assumption besides that "technical correlations between coefficients are the same for all genes" ?
. In line with this last question, when using RNA-Seq data, is it right to use previously filtered and scaled data (using for instance calcnormfactors) ?
. Is it correct to run genas on subsets of low number of genes (from tens to hundreds) to estimate biological correlations for different sets of genes?

### session info:

> sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
[1] LC_CTYPE=C LC_NUMERIC=C
[3] LC_TIME=C LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] ellipse_0.3-8 edgeR_3.16.2
[3] pheatmap_1.0.8 BiocParallel_1.8.1
[5] genefilter_1.56.0 stringr_1.1.0
[7] scales_0.4.1 ggplot2_2.2.0
[9] RColorBrewer_1.1-2 DESeq2_1.14.0
[11] SummarizedExperiment_1.4.0 Biobase_2.34.0
[13] GenomicRanges_1.26.1 GenomeInfoDb_1.10.1
[15] IRanges_2.8.1 S4Vectors_0.12.0
[17] BiocGenerics_0.20.0 limma_3.30.3

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 splines_3.3.1 lattice_0.20-34
[4] colorspace_1.3-0 htmltools_0.3.5 chron_2.3-47
[7] survival_2.40-1 XML_3.98-1.5 foreign_0.8-67
[10] DBI_0.5-1 plyr_1.8.4 zlibbioc_1.20.0
[13] munsell_0.4.3 gtable_0.2.0 latticeExtra_0.6-28
[16] knitr_1.15 geneplotter_1.52.0 AnnotationDbi_1.36.0
[19] htmlTable_1.7 Rcpp_0.12.7 acepack_1.4.1
[22] xtable_1.8-2 Hmisc_4.0-0 annotate_1.52.0
[25] XVector_0.14.0 gridExtra_2.2.1 digest_0.6.10
[28] stringi_1.1.2 grid_3.3.1 tools_3.3.1
[31] bitops_1.0-6 magrittr_1.5 lazyeval_0.2.0
[34] RCurl_1.95-4.8 tibble_1.2 RSQLite_1.0.0
[37] Formula_1.2-1 cluster_2.0.5 Matrix_1.2-7.1
[40] data.table_1.9.6 assertthat_0.1 rpart_4.1-10
[43] nnet_7.3-12

limma genas • 1.2k views

ADD COMMENT • link updated 7.3 years ago by Gordon Smyth 50k • written 7.3 years ago by Eduard Casas • 0

score 2 · Accepted Answer · 2016-12-25

Why is technical correlation always 0.5?

This follows from the fact that you have equal numbers of replicates in each group. To see why this is the case, suppose that A, B and C are independent random variables with the same variance. Then the correlation between A-C and B-C is always 0.5.

If you ran a simulation to generate random data with the same design matrix as your experiment but no true differential expression, you would find that cor( fit$coef[,2], fit$coef[,3] ) would be approximately 0.5.

When subsetting==Fpval, are only 'significantly changing genes' used to estimate biological correlation?

Yes and no. With this option, genas() tries to estimate the total number of genes that are DE for either coefficient, no matter how small the change in expression is. In other words, it figures out how many genes would need to be DE so that the remaining genes could have uniformly distributed p-values. So yes, it restricts to genes that seem likely to be DE but it will include lots of genes are that are not significantly DE at conventional significant levels. Roughly speaking, genes are included in the analysis if there is a better than an even chance that they are DE.

Are the genes used to estimate biological correlation the only ones that show biological change (not technical correlation) in any of the two comparisons?

genas() assumes that a large number of genes are DE to some extent and that biological correlation between profiles is a global phenomenon or at least one involving a large number of genes. genas() uses all the genes, at least those not subsetted out, to estimate the biological correlation.

In line with this last question, when using RNA-Seq data, is it right to use previously filtered and scaled data (using for instance calcnormfactors) ?

RNA-seq data always have to be normalized for any analysis. There is also no point in including genes with consistently low counts because such genes can't show any useful signal. Again, this is the same with any analysis, not just genas.

Is it correct to run genas on subsets of low number of genes (from tens to hundreds) to estimate biological correlations for different sets of genes?

Yes, and we have done this ourselves in this article: https://www.ncbi.nlm.nih.gov/pubmed/20445021. However genas() may not be very stable on small numbers of genes. Ten or 20 may be too few.