Question about genas function in the limma package
1
0
Entering edit mode
@eduard-casas-12072
Last seen 7.2 years ago

Dear Gordon Smyth,

I am facing a biological problem where I am comparing two different treatment conditions (plus a mutant) versus the control condition. I have RNA-Seq expression in triplicates for each condition. What I want to find out is the true (biological) correlation of fold changes in our two conditions when comparing each to the control. To do so, we came across your function genas from the limma package. However, I have some questions to make sure I am doing things right and that would help me understand the results of the function. 

I provide the code I am using for you to find out what I am doing exactly:

condition <- c("MUT", "MUT", "MUT", "A", "A", "A", "B", "B", "B", "Control", "Control", "Control")
y <- RawCounts[which(rowMedians(as.matrix(RawCounts))>=1),]    ### RawCounts is the matrix of gene expression
group <- factor(condition, levels=c("Control","MUT", "A", "B"))
design <- model.matrix(~group)
colnames(design) <- c("Control", "MUT", "A", "B")
dge <- DGEList(counts=y)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=T, prior.count=1)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
print(genas(fit, coef=c(2,3), plot=TRUE, subset="all"))


. When subsetting==Fpval, are only 'significantly changing genes' used to estimate biological correlation?
. Are the genes used to estimate biological correlation the only ones that show biological change (not technical correlation) in any of the two comparisons?
. Why is technical correlation always 0.5? Is this based on some assumption besides that "technical correlations between coefficients are the same for all genes" ?
. In line with this last question, when using RNA-Seq data, is it right to use previously filtered and scaled data (using for instance calcnormfactors) ?
. Is it correct to run genas on subsets of low number of genes (from tens to hundreds) to estimate biological correlations for different sets of genes?

 

### session info:

> sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=C                 LC_NUMERIC=C              
 [3] LC_TIME=C                  LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ellipse_0.3-8              edgeR_3.16.2              
 [3] pheatmap_1.0.8             BiocParallel_1.8.1        
 [5] genefilter_1.56.0          stringr_1.1.0             
 [7] scales_0.4.1               ggplot2_2.2.0             
 [9] RColorBrewer_1.1-2         DESeq2_1.14.0             
[11] SummarizedExperiment_1.4.0 Biobase_2.34.0            
[13] GenomicRanges_1.26.1       GenomeInfoDb_1.10.1       
[15] IRanges_2.8.1              S4Vectors_0.12.0          
[17] BiocGenerics_0.20.0        limma_3.30.3              

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1       splines_3.3.1        lattice_0.20-34     
 [4] colorspace_1.3-0     htmltools_0.3.5      chron_2.3-47        
 [7] survival_2.40-1      XML_3.98-1.5         foreign_0.8-67      
[10] DBI_0.5-1            plyr_1.8.4           zlibbioc_1.20.0     
[13] munsell_0.4.3        gtable_0.2.0         latticeExtra_0.6-28 
[16] knitr_1.15           geneplotter_1.52.0   AnnotationDbi_1.36.0
[19] htmlTable_1.7        Rcpp_0.12.7          acepack_1.4.1       
[22] xtable_1.8-2         Hmisc_4.0-0          annotate_1.52.0     
[25] XVector_0.14.0       gridExtra_2.2.1      digest_0.6.10       
[28] stringi_1.1.2        grid_3.3.1           tools_3.3.1         
[31] bitops_1.0-6         magrittr_1.5         lazyeval_0.2.0      
[34] RCurl_1.95-4.8       tibble_1.2           RSQLite_1.0.0       
[37] Formula_1.2-1        cluster_2.0.5        Matrix_1.2-7.1      
[40] data.table_1.9.6     assertthat_0.1       rpart_4.1-10        
[43] nnet_7.3-12         

 

limma genas • 1.2k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 16 minutes ago
WEHI, Melbourne, Australia

Why is technical correlation always 0.5?

This follows from the fact that you have equal numbers of replicates in each group. To see why this is the case, suppose that A, B and C are independent random variables with the same variance. Then the correlation between A-C and B-C is always 0.5.

If you ran a simulation to generate random data with the same design matrix as your experiment but no true differential expression, you would find that cor( fit$coef[,2], fit$coef[,3] ) would be approximately 0.5.

When subsetting==Fpval, are only 'significantly changing genes' used to estimate biological correlation?

Yes and no. With this option, genas() tries to estimate the total number of genes that are DE for either coefficient, no matter how small the change in expression is. In other words, it figures out how many genes would need to be DE so that the remaining genes could have uniformly distributed p-values. So yes, it restricts to genes that seem likely to be DE but it will include lots of genes are that are not significantly DE at conventional significant levels. Roughly speaking, genes are included in the analysis if there is a better than an even chance that they are DE.

Are the genes used to estimate biological correlation the only ones that show biological change (not technical correlation) in any of the two comparisons?

genas() assumes that a large number of genes are DE to some extent and that biological correlation between profiles is a global phenomenon or at least one involving a large number of genes. genas() uses all the genes, at least those not subsetted out, to estimate the biological correlation.

In line with this last question, when using RNA-Seq data, is it right to use previously filtered and scaled data (using for instance calcnormfactors) ?

RNA-seq data always have to be normalized for any analysis. There is also no point in including genes with consistently low counts because such genes can't show any useful signal. Again, this is the same with any analysis, not just genas.

Is it correct to run genas on subsets of low number of genes (from tens to hundreds) to estimate biological correlations for different sets of genes?

Yes, and we have done this ourselves in this article: https://www.ncbi.nlm.nih.gov/pubmed/20445021. However genas() may not be very stable on small numbers of genes. Ten or 20 may be too few.

ADD COMMENT
0
Entering edit mode

Dear Gordon Smyth,

Thank you for your fast and comprehensive answer. I have been playing around with different parametres and now I have a different doubt:

When changing the 'subset' parameter the results change completely. I am using a group of 4263 genes and when running genas with:

subset="all" I get a biological correlation of -1 (although with a p.value of 0.875) using all genes.

subset=="Fpval" I get a biological correlation of 0.25 with a p.value e-05. while using 593 of these genes.

subset=="p.union" I get a biological correlation of 0.86 with a p.value e-44. while using 678 of these genes.

subset=="logFC"  I get a biological correlation of 0.51, p. val e-26, 757 genes.

subset=="predFC" I get a biological correlation of  0.53, p.val e-30, 760 genes.

Following the genas function recommendation from the manual ("Ideally, only genes that are truly differentially expressed for one or both of the contrasts should be used estimate the biological correlation") we can exclude the option subset=all. Also, the p.value is not significant for such analysis. Then, the good part is that 403 genes are common between the 593 of Fpval and the 678 of p.union. Using subset="p.int" gives me this error: "Error in squeezeVar(sigma^2, df.residual, covariate = covariate, robust = robust,  : 
 Could not estimate prior df"

Both spearman and pearson correlation of the contrast fold changes of these 4263 genes show a high positive correlation, but what I want is to exclude the technical correlation. This makes me think that probably the 'p.union' subset option is the most accurate in such case, but how do I make sure I am using the right subsetting value in this case? Is there any quality control value that I can use (besides p.values) to assess which is the best method (and how reliable it is) ?

Thank you again for your time,

Eduard

ADD REPLY

Login before adding your answer.

Traffic: 582 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6