You could also use svaseq in the devel branch of sva. It gives very
results to ruvseq. Here are some analysis files with examples:
As Gordon mentioned, you should be careful about doing downstream
differential expression analysis on "batch cleaned" data - since the
degrees of freedom will not be correct you often end up with overly
optimistic inference. There isn't currently a good statistical method
dealing with post batch removed data in differential expression or
On Mon, Jul 21, 2014 at 12:36 PM, davide risso
> Hi Brian,
> as mentioned by Dario the RUVSeq package has a way to deal with your
> The typical application of RUVSeq is differential expression, hence
> "adding" the batch effect in the generalized linear model rather
> correcting the counts for the batch effects.
> However, the RUVSeq functions return a matrix of normalized counts
> that are "batch effect free". The main risk is that you remove (part
> of) the signal of interest when removing the batch effects
> if the two are correlated). On the other hand, if the batch effect
> the signal of interest are "not too correlated" RUVSeq will give you
> exactly what you want.
> If you have replicate samples, we found that in practice the
> "replicate method" (function RUVs) works much better than the
> "negative controls" method (function RUVg) when dealing with
> unsupervised problems.
> I hope this helps.
> On Sat, Jul 19, 2014 at 8:01 PM, Brian Haas
> > Greetings all,
> > I've been researching ways to remove batch effects from RNA-Seq
> > matrices. Basically, I'm starting with a counts matrix that
> > effects, and want to generate a new matrix of counts that has the
> > effects removed.
> > I'm looking to apply this to sets of RNA-Seq samples (~100
> > were sequenced in batches on different days (factor) and for which
> > have other metadata with continuous values (covariates such as
> > sequenced reads in each sample, quality metrics, etc). I want to
> > all these samples in an unsupervised manner, and don't have a
> > anything but the various batch effects that I want removed (ie. no
> > vs. normal labeling, instead they're all 'normal' and I'd like to
> > they form clusters based on natural variation in the population,
> > perhaps identify subtypes).
> > >From what I've read thus far, methods like sva (and the included
> > require that you provide a model for the covariates that you do
> > removed (biological factors) in addition to the ones you do want
> > (batch effects). Is it not possible to use these methods in my
> > where I don't have factors other than the specified batch effects?
> > In searching the bioconductor mailing list archive, I found:
> > edgeR package, removeBatchEffect() function
> > which seems to do exactly what I want, and I'll experiment with it
> > I'm mostly curious about what other methods might be available to
> > and whether the SVA or other libraries contain functions that I
> > explore.
> > Many thanks in advance for any advice!
> > ~brian
> > [[alternative HTML version deleted]]
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> Davide Risso, PhD
> Post Doctoral Scholar
> Department of Statistics
> University of California, Berkeley
> 344 Li Ka Shing Center, #3370
> Berkeley, CA 94720-3370
> E-mail: firstname.lastname@example.org
> Bioconductor mailing list
> Search the archives:
[[alternative HTML version deleted]]
Originally I mistyped the argument to removeBatchEffect() as y instead of y2, now corrected.
If I want to also remove effects related to total read counts or coverage biases or other library quality metrics, which sometimes end up being highly correlated with principal component 1, do I include these continuous values as covariates in the removeBatchEffect() method? I was looking for some examples on how to use it, but couldn't find any.
Yes, just input continuous variables as covariates.
I know this post is from a year ago but I've only just had the need to use batch correction on my data. Thank you so much for this post, it has been very helpful to me.
I have a couple of minor clarifications to pursue...
First - in the filtering step above, y2 is a subset of y where the average log CPM > 1 - right ? So should it say
rather than what it says below ?
Assuming my first comment isn't absolute gibberish, y2 being a subset of y means it is as yet un-logged. This means that logCPM is on a logged scale but logCPMc is on an unlogged scale. I only noticed this as I plotted before and after plots using plotMDS and noted the scales differed quite vastly.
I can think of two potential solutions here - (1) to turn 'log' off in the cpm function (2) to pass logCPM into the 'removeBatchEffect' function. However, having read the 'removeBatchEffect' manual, the 2nd option seems the more correct option as it assumes one is giving it log-expression values.
I am writing to check if my line of thought is correct and to hopefully help others who might be toying with a similar notion, however minor that group might be.
Thank you for your time.
Thank you for this!