Question: RNA-Seq, generate batch-free count matrix
3
gravatar for Brian Haas
4.8 years ago by
Brian Haas90
Brian Haas90 wrote:

Greetings all,

I've been researching ways to remove batch effects from RNA-Seq count matrices. Basically, I'm starting with a counts matrix that includes batch effects, and want to generate a new matrix of counts that has the batch effects removed.

I'm looking to apply this to sets of RNA-Seq samples (~100 samples) that were sequenced in batches on different days (factor) and for which I also have other metadata with continuous values (covariates such as total sequenced reads in each sample, quality metrics, etc). I want to study all these samples in an unsupervised manner, and don't have a model for anything but the various batch effects that I want removed (ie. no cancer vs. normal labeling, instead they're all 'normal' and I'd like to see if they form clusters based on natural variation in the population, and perhaps identify subtypes).

From what I've read thus far, methods like sva (and the included Combat) require that you provide a model for the covariates that you do not want removed (biological factors) in addition to the ones you do want removed (batch effects). Is it not possible to use these methods in my scenario, where I don't have factors other than the specified batch effects?

In searching the bioconductor mailing list archive, I found:

      edgeR package, removeBatchEffect() function

which seems to do exactly what I want, and I'll experiment with it shortly. I'm mostly curious about what other methods might be available to do this, and whether the SVA or other libraries contain functions that I should explore.

Many thanks in advance for any advice!

~brian

edger sva • 7.5k views
ADD COMMENTlink modified 4.5 years ago by Gordon Smyth37k • written 4.8 years ago by Brian Haas90
Answer: RNA-Seq, generate batch-free count matrix
7
gravatar for Gordon Smyth
4.7 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

Hi Brian,

In a literal sense, getting a matrix of batch corrected counts is not possible. Once the batch effects have been removed, the values will no longer be counts.

To batch correct, it is necessary to first transform the counts to a pseudo-continuous scale. Then you can use batch correction methods developed for microarrays. This is how we usually do it.

First, put the counts in a DGEList object:

 library(edgeR)
 y <- DGEList(counts=counts)

Filter non-expressed genes:

 A <- aveLogCPM(y)
 y2 <- y2[A>1,]

Then normalize and compute log2 counts-per-million with an offset:

 y2 <- calcNormFactors(y2)
 logCPM <- cpm(y2, log=TRUE, prior.count=5)

Then remove batch correct:

 logCPMc <- removeBatchEffect(y2, batch)

Here batch is a vector or factor taking a different value for each batch group. You can input two batch vectors.

Now you can cluster the samples, for example by:

 plotMDS(logCPMc)

Variations on this would be use rpkm() instead of cpm(), or to give removeBatchEffect() a design matrix of known groups that are not batch effects.

Best wishes
Gordon

ADD COMMENTlink modified 4.5 years ago • written 4.7 years ago by Gordon Smyth37k

Originally I mistyped the argument to removeBatchEffect() as y instead of y2, now corrected.

ADD REPLYlink modified 4.5 years ago • written 4.7 years ago by Gordon Smyth37k

Thanks, Gordon!

If I want to also remove effects related to total read counts or coverage biases or other library quality metrics, which sometimes end up being highly correlated with principal component 1, do I include these continuous values as covariates in the removeBatchEffect() method? I was looking for some examples on how to use it, but couldn't find any.

Thx
-Brian

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 4.8 years ago by Brian Haas90

Yes, just input continuous variables as covariates.

Gordon

ADD REPLYlink modified 4.2 years ago • written 4.7 years ago by Gordon Smyth37k

Dear Gordon,

I know this post is from a year ago but I've only just had the need to use batch correction on my data. Thank you so much for this post, it has been very helpful to me. 

I have a couple of minor clarifications to pursue...

First - in the filtering step above, y2 is a subset of y where the average log CPM > 1 - right ? So should it say

y2 <- y[A>1,]

rather than what it says below ?

A <- aveLogCPM(y)
y2 <- y2[A>1,]

Assuming my first comment isn't absolute gibberish, y2 being a subset of y means it is as yet un-logged. This means that logCPM is on a logged scale but logCPMc is on an unlogged scale. I only noticed this as I plotted before and after plots using plotMDS and noted the scales differed quite vastly.

I can think of two potential solutions here - (1) to turn 'log' off in the cpm function (2) to pass logCPM into the 'removeBatchEffect' function. However, having read the 'removeBatchEffect' manual, the 2nd option seems the more correct option as it assumes one is giving it log-expression values.

I am writing to check if my line of thought is correct and to hopefully help others who might be toying with a similar notion, however minor that group might be.

Thank you for your time. 

Kind regards, 

Manasa

 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by emm.rox0

Thank you for this!

ADD REPLYlink written 21 months ago by jol.espinoz20
Answer: RNA-Seq, generate batch-free count matrix
2
gravatar for davide risso
4.7 years ago by
davide risso810
Weill Cornell Medicine
davide risso810 wrote:
Hi Brian, as mentioned by Dario the RUVSeq package has a way to deal with your situation. The typical application of RUVSeq is differential expression, hence "adding" the batch effect in the generalized linear model rather than correcting the counts for the batch effects. However, the RUVSeq functions return a matrix of normalized counts that are "batch effect free". The main risk is that you remove (part of) the signal of interest when removing the batch effects (especially if the two are correlated). On the other hand, if the batch effect and the signal of interest are "not too correlated" RUVSeq will give you exactly what you want. If you have replicate samples, we found that in practice the "replicate method" (function RUVs) works much better than the "negative controls" method (function RUVg) when dealing with unsupervised problems. I hope this helps. Best, davide On Sat, Jul 19, 2014 at 8:01 PM, Brian Haas <bhaas at="" broadinstitute.org=""> wrote: > Greetings all, > > I've been researching ways to remove batch effects from RNA-Seq count > matrices. Basically, I'm starting with a counts matrix that includes batch > effects, and want to generate a new matrix of counts that has the batch > effects removed. > > I'm looking to apply this to sets of RNA-Seq samples (~100 samples) that > were sequenced in batches on different days (factor) and for which I also > have other metadata with continuous values (covariates such as total > sequenced reads in each sample, quality metrics, etc). I want to study > all these samples in an unsupervised manner, and don't have a model for > anything but the various batch effects that I want removed (ie. no cancer > vs. normal labeling, instead they're all 'normal' and I'd like to see if > they form clusters based on natural variation in the population, and > perhaps identify subtypes). > > >From what I've read thus far, methods like sva (and the included Combat) > require that you provide a model for the covariates that you do not want > removed (biological factors) in addition to the ones you do want removed > (batch effects). Is it not possible to use these methods in my scenario, > where I don't have factors other than the specified batch effects? > > In searching the bioconductor mailing list archive, I found: > > edgeR package, removeBatchEffect() function > > which seems to do exactly what I want, and I'll experiment with it shortly. > I'm mostly curious about what other methods might be available to do this, > and whether the SVA or other libraries contain functions that I should > explore. > > Many thanks in advance for any advice! > > ~brian > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Davide Risso, PhD Post Doctoral Scholar Department of Statistics University of California, Berkeley 344 Li Ka Shing Center, #3370 Berkeley, CA 94720-3370 E-mail: davide.risso at berkeley.edu
ADD COMMENTlink written 4.7 years ago by davide risso810
Answer: RNA-Seq, generate batch-free count matrix
0
gravatar for Steve Lianoglou
4.8 years ago by
Denali
Steve Lianoglou12k wrote:
Hi Brian, One comment: On Sat, Jul 19, 2014 at 8:01 PM, Brian Haas <bhaas at="" broadinstitute.org=""> wrote: [clip] > In searching the bioconductor mailing list archive, I found: > > edgeR package, removeBatchEffect() function Beware: this function is part of limma and not edgeR and therefore not really meant for (raw) count data. It does handle "voom"ed data, ie. data with weights, but to be technically/theoretically correct, your downstream analysis must also be one that deals with observational weights, which I'm guessing is unlikely. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech
ADD COMMENTlink written 4.8 years ago by Steve Lianoglou12k
Answer: RNA-Seq, generate batch-free count matrix
0
gravatar for Dario Strbenac
4.7 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:
Hello, RUVSeq is available in the Bioconductor project. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
ADD COMMENTlink written 4.7 years ago by Dario Strbenac1.4k
Answer: C: RNA-Seq, generate batch-free count matrix
0
gravatar for Jeff Leek
4.7 years ago by
Jeff Leek550
United States
Jeff Leek550 wrote:
Hi Brian, You could also use svaseq in the devel branch of sva. It gives very similar results to ruvseq. Here are some analysis files with examples: http://jtleek.com/svaseq/simulateData.html http://jtleek.com/svaseq/zebrafish.html http://jtleek.com/svaseq/geuvadis.html As Gordon mentioned, you should be careful about doing downstream differential expression analysis on "batch cleaned" data - since the degrees of freedom will not be correct you often end up with overly optimistic inference. There isn't currently a good statistical method for dealing with post batch removed data in differential expression or clustering analyses. Best, Jeff On Mon, Jul 21, 2014 at 12:36 PM, davide risso <risso.davide@gmail.com> wrote: > Hi Brian, > > as mentioned by Dario the RUVSeq package has a way to deal with your > situation. > The typical application of RUVSeq is differential expression, hence > "adding" the batch effect in the generalized linear model rather than > correcting the counts for the batch effects. > > However, the RUVSeq functions return a matrix of normalized counts > that are "batch effect free". The main risk is that you remove (part > of) the signal of interest when removing the batch effects (especially > if the two are correlated). On the other hand, if the batch effect and > the signal of interest are "not too correlated" RUVSeq will give you > exactly what you want. > > If you have replicate samples, we found that in practice the > "replicate method" (function RUVs) works much better than the > "negative controls" method (function RUVg) when dealing with > unsupervised problems. > > I hope this helps. > > Best, > davide > > On Sat, Jul 19, 2014 at 8:01 PM, Brian Haas <bhaas@broadinstitute.org> > wrote: > > Greetings all, > > > > I've been researching ways to remove batch effects from RNA-Seq count > > matrices. Basically, I'm starting with a counts matrix that includes > batch > > effects, and want to generate a new matrix of counts that has the batch > > effects removed. > > > > I'm looking to apply this to sets of RNA-Seq samples (~100 samples) that > > were sequenced in batches on different days (factor) and for which I also > > have other metadata with continuous values (covariates such as total > > sequenced reads in each sample, quality metrics, etc). I want to study > > all these samples in an unsupervised manner, and don't have a model for > > anything but the various batch effects that I want removed (ie. no cancer > > vs. normal labeling, instead they're all 'normal' and I'd like to see if > > they form clusters based on natural variation in the population, and > > perhaps identify subtypes). > > > > >From what I've read thus far, methods like sva (and the included Combat) > > require that you provide a model for the covariates that you do not want > > removed (biological factors) in addition to the ones you do want removed > > (batch effects). Is it not possible to use these methods in my > scenario, > > where I don't have factors other than the specified batch effects? > > > > In searching the bioconductor mailing list archive, I found: > > > > edgeR package, removeBatchEffect() function > > > > which seems to do exactly what I want, and I'll experiment with it > shortly. > > I'm mostly curious about what other methods might be available to do > this, > > and whether the SVA or other libraries contain functions that I should > > explore. > > > > Many thanks in advance for any advice! > > > > ~brian > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Davide Risso, PhD > Post Doctoral Scholar > Department of Statistics > University of California, Berkeley > 344 Li Ka Shing Center, #3370 > Berkeley, CA 94720-3370 > E-mail: davide.risso@berkeley.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 4.7 years ago by Jeff Leek550
Many thanks, everyone! I'll see how far I can get with exploring these various methods and I'll post a summary. -- -- Brian J. Haas The Broad Institute http://broad.mit.edu/~bhaas [[alternative HTML version deleted]]
ADD REPLYlink written 4.7 years ago by Brian Haas90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 136 users visited in the last hour