Bioconductor Digest, Vol 124, Issue 5
1
0
Entering edit mode
@w-evan-johnson-5447
Last seen 2.0 years ago
United States
David, ComBat assumes a normal-normal mixture when doing the adjustment. It should work very well if your log-data are fairly normally distributed (within each gene) and if your sample size is relatively large (say ~15-30). If you have a small sample size or if your data (within each gene) are not normally distributed then I would worry that ComBat might not be appropriate. Also, there is the non-parametric prior approach to ComBat. It assumes normal data, but lets the empirical prior distribution be unspecified. Try this and see if the results are different--this will tell you if you are robust to the prior specification. Hope this helps, Evan On Jun 5, 2013, at 6:00 AM, <bioconductor-request at="" r-project.org=""> wrote: > Message: 18 > Date: Tue, 4 Jun 2013 15:40:20 -0400 > From: "David O'Brien" <dobrie10 at="" jhmi.edu=""> > To: bioconductor at r-project.org > Subject: [BioC] Remove batch effects from RNA-seq data using edgeR and > sva/ComBat > Message-ID: > <capruzndvhypd1dsg5m=ce_ddw1nm9fv1cs3c3kxgmec16nzsgq at="" mail.gmail.com=""> > Content-Type: text/plain; charset="iso-8859-1" > > I'm trying to analyze an RNA-seq experiment where the PCA plot shows better > clustering by the day the experiment was done rather than treatment type. > Using edgeR to determine differentially expressed genes resulted in less > than 5 genes with an FDR under 5%. Creating a GLM model to remove batch > effects for day of experiment as stated in the edgeR manual resulted in 42 > genes with an FDR less than 5%. An improvement, but still not good. So I > tried using ComBat and the result was 986 genes with an FDR under 5%. > Looking at the GO enrichment, the differentially expressed genes seem to > make sense, but since ComBat was developed for microarrays, I'm concerned > that there may be some caveats with this approach that I'm missing. Looking > at the top genes below, the log2 fold change is really low and generally > this just seems too good to be true. So my question is: Are there any > reasons why using ComBat with RNA-seq data is not legit? And if so, can you > see any problems with the approach below? > > mean_control mean_treatment logFC pval padj > Gene5727 51.224797 45.371919 -0.1750427 3.474361e-08 0.0003224554 > Gene3059 8.998311 5.740828 -0.6483954 1.056473e-07 0.0003268376 > Gene11899 35.044302 39.027842 0.1553238 7.398559e-08 0.0003268376 > Gene11724 2.556712 3.684178 0.5270535 1.959058e-07 0.0003636404 > Gene12218 30.852989 23.702209 -0.3803888 1.908726e-07 0.0003636404 > > Gene4952 26.122068 30.466346 0.2219474 3.346424e-07 0.0005176360 > > > My code is below. I've attached a file, dge.Rdata, that contains the > counts info that is output from readDGE, so you can have the initial > counts info. > > > require(edgeR) > require(sva) > source('code/annotate_edgeR.R') > files = data.frame(files=c('counts.control0', 'counts.control1', > 'counts.control2', 'counts.control3', 'counts.treatment0', > 'counts.treatment1', 'counts.treatment2', 'counts.treatment3'), > group=c('control', 'control', 'control', 'control', > 'treatment', 'treatment', 'treatment', 'treatment'), > day=rep(0:3,2) > ) > labels <- paste0(files$group, files$day) > dge <- readDGE(files=files, path='data/HTSeq/', labels=labels) > rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene > names to anonymize data > ################################ > # save(dge, file='objs/dge.Rdata') > # SEE ATTACHED FILE # > ############################### > > ## filter out the no_feature etc. rows > dge <- dge[1:(nrow(dge)-5), ] > ## This mitochondrial rRNA gene takes up a massive portion of my libraries > dge <- dge[!rownames(dge)%in%'Gene13515', ] > ## filter out lowly expressed genes > keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where cpm > is > 1 > dge <- dge[keep, ] > ## Recompute library sizes > dge$samples$lib.size <- colSums(dge$counts) > ## Normalize for lib size > dge <- calcNormFactors(dge) > > ## ComBat > mod <- model.matrix(~as.factor(group), data=dge$sample) > mod0 <- model.matrix(~1, data=dge$sample) > batch <- dge$sample$day > > combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod) > pval_combat = f.pvalue(combat, mod, mod0) > padj_combat = p.adjust(pval_combat, method="BH") > mean_control <- rowMeans(combat[, 1:4]) > mean_treatment <- rowMeans(combat[, 5:8]) > logFC <- log2(mean_treatment/mean_control) > > res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat, > padj=padj_combat) > res <- res[order(res$padj), ] > > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.16.0 sva_3.6.0 mgcv_1.7-23 corpcor_1.6.6 > edgeR_3.2.3 limma_3.16.4 > > loaded via a namespace (and not attached): > [1] grid_3.0.1 lattice_0.20-15 Matrix_1.0-12 nlme_3.1-109 > RCurl_1.95-4.1 tools_3.0.1 > [7] XML_3.96-1.1
GO Clustering edgeR GO Clustering edgeR • 770 views
0
Entering edit mode
Last seen 7.7 years ago
Hi David and Evan, In our experience careful use of ComBat yields good results even for relatively small RNAseq datasets. For instance, we have modified ComBat to not estimate a scale adjustment for batch as this is where the heteroscedasticity of log-count data is really problematic. If you want to see how we deal with this you can checkout the package for RNAseq analysis we use with our collaborators: http://github.com/kokrah/cbcbSEQ Cheers, Hector Hector Corrada Bravo Center for Bioinformatics and Computational Biology University of Maryland, College Park hcorrada@umiacs.umd.edu On Wed, Jun 5, 2013 at 8:08 AM, Johnson, William Evan <wej@bu.edu> wrote: > David, > > ComBat assumes a normal-normal mixture when doing the adjustment. It > should work very well if your log-data are fairly normally distributed > (within each gene) and if your sample size is relatively large (say > ~15-30). If you have a small sample size or if your data (within each gene) > are not normally distributed then I would worry that ComBat might not be > appropriate. > > Also, there is the non-parametric prior approach to ComBat. It assumes > normal data, but lets the empirical prior distribution be unspecified. Try > this and see if the results are different--this will tell you if you are > robust to the prior specification. > > Hope this helps, > > Evan > > > > On Jun 5, 2013, at 6:00 AM, <bioconductor-request@r-project.org> > wrote: > > > Message: 18 > > Date: Tue, 4 Jun 2013 15:40:20 -0400 > > From: "David O'Brien" <dobrie10@jhmi.edu> > > To: bioconductor@r-project.org > > Subject: [BioC] Remove batch effects from RNA-seq data using edgeR and > > sva/ComBat > > Message-ID: > > <capruzndvhypd1dsg5m=> Ce_Ddw1Nm9fV1Cs3c3kXgMec16NZSGQ@mail.gmail.com> > > Content-Type: text/plain; charset="iso-8859-1" > > > > I'm trying to analyze an RNA-seq experiment where the PCA plot shows > better > > clustering by the day the experiment was done rather than treatment type. > > Using edgeR to determine differentially expressed genes resulted in less > > than 5 genes with an FDR under 5%. Creating a GLM model to remove batch > > effects for day of experiment as stated in the edgeR manual resulted in > 42 > > genes with an FDR less than 5%. An improvement, but still not good. So I > > tried using ComBat and the result was 986 genes with an FDR under 5%. > > Looking at the GO enrichment, the differentially expressed genes seem to > > make sense, but since ComBat was developed for microarrays, I'm concerned > > that there may be some caveats with this approach that I'm missing. > Looking > > at the top genes below, the log2 fold change is really low and generally > > this just seems too good to be true. So my question is: Are there any > > reasons why using ComBat with RNA-seq data is not legit? And if so, can > you > > see any problems with the approach below? > > > > mean_control mean_treatment logFC pval padj > > Gene5727 51.224797 45.371919 -0.1750427 3.474361e-08 > 0.0003224554 > > Gene3059 8.998311 5.740828 -0.6483954 1.056473e-07 > 0.0003268376 > > Gene11899 35.044302 39.027842 0.1553238 7.398559e-08 > 0.0003268376 > > Gene11724 2.556712 3.684178 0.5270535 1.959058e-07 > 0.0003636404 > > Gene12218 30.852989 23.702209 -0.3803888 1.908726e-07 > 0.0003636404 > > > > Gene4952 26.122068 30.466346 0.2219474 3.346424e-07 > 0.0005176360 > > > > > > My code is below. I've attached a file, dge.Rdata, that contains the > > counts info that is output from readDGE, so you can have the initial > > counts info. > > > > > > require(edgeR) > > require(sva) > > source('code/annotate_edgeR.R') > > files = data.frame(files=c('counts.control0', 'counts.control1', > > 'counts.control2', 'counts.control3', 'counts.treatment0', > > 'counts.treatment1', 'counts.treatment2', 'counts.treatment3'), > > group=c('control', 'control', 'control', 'control', > > 'treatment', 'treatment', 'treatment', 'treatment'), > > day=rep(0:3,2) > > ) > > labels <- paste0(files$group, files$day) > > dge <- readDGE(files=files, path='data/HTSeq/', labels=labels) > > rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene > > names to anonymize data > > ################################ > > # save(dge, file='objs/dge.Rdata') > > # SEE ATTACHED FILE # > > ############################### > > > > ## filter out the no_feature etc. rows > > dge <- dge[1:(nrow(dge)-5), ] > > ## This mitochondrial rRNA gene takes up a massive portion of my > libraries > > dge <- dge[!rownames(dge)%in%'Gene13515', ] > > ## filter out lowly expressed genes > > keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where > cpm > > is > 1 > > dge <- dge[keep, ] > > ## Recompute library sizes > > dge$samples$lib.size <- colSums(dge$counts) > > ## Normalize for lib size > > dge <- calcNormFactors(dge) > > > > ## ComBat > > mod <- model.matrix(~as.factor(group), data=dge$sample) > > mod0 <- model.matrix(~1, data=dge$sample) > > batch <- dge$sample$day > > > > combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod) > > pval_combat = f.pvalue(combat, mod, mod0) > > padj_combat = p.adjust(pval_combat, method="BH") > > mean_control <- rowMeans(combat[, 1:4]) > > mean_treatment <- rowMeans(combat[, 5:8]) > > logFC <- log2(mean_treatment/mean_control) > > > > res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat, > > padj=padj_combat) > > res <- res[order(res$padj), ] > > > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > LC_TIME=en_US.UTF-8 > > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > > LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > > LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] biomaRt_2.16.0 sva_3.6.0 mgcv_1.7-23 corpcor_1.6.6 > > edgeR_3.2.3 limma_3.16.4 > > > > loaded via a namespace (and not attached): > > [1] grid_3.0.1 lattice_0.20-15 Matrix_1.0-12 nlme_3.1-109 > > RCurl_1.95-4.1 tools_3.0.1 > > [7] XML_3.96-1.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]