Question: Estimating and removing batch effects from rna-seq dataset
gravatar for AB
2.5 years ago by
United States
AB10 wrote:

Hi everyone,

I'm working on RNA-Seq datasets containing 22 samples from 3 batches. I used mm10 as the reference genome and generated the count table from the BAM files using GenomicAlignments package and then used the rpkm() function to get the rpkm data. Now I want to perform batch correction using either the count matrix or the rpkm data. How do I use the sva package for this purpose ? Is there any other way I can perform batch estimation and correction with the data I have ? Can i use removeBatchEffects() function or is it only for microarray data ?



ADD COMMENTlink modified 2.5 years ago by Jeff Leek510 • written 2.5 years ago by AB10

As Ryan suggested, you should read through the vignettes for sva and RUVSeq, which were written to address this type of general question that you are asking.

Once you have tried to apply these procedures to your data and run into specific problems, you are welcome to come back to the support site and ask a more specific question that describes your problem and includes a code snippet that outlines the steps you've taken to get to the point you might be asking about.

At that point, you'll likely receive useful help.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Steve Lianoglou12k

Thank you Dr.Lianoglu. I tried to implement sva but I am getting this error.

Error in dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod)) : 
  requires numeric/complex matrix/vector arguments.

this is my code

data <- read.csv("data.csv")
type<-factor(c("ctrl","ctrl","ctrl","ctrl","mut","mut","mut","mut", "ctrl","ctrl","ctrl","ctrl","mut","mut","mut","mut", "ctrl","ctrl","mut", "mut","mut","mut"))

mod <- model.matrix(~type, colData=coldata)
mod0 <- model.matrix(~ 1, colData=coldata)
svseq <- svaseq(data, mod, mod0,


ADD REPLYlink written 2.5 years ago by AB10

You should post a new question on the support site for specific issues, but before you do that make sure that you are passing the correct data into the svaseq function.

In particular, the dat argument (your data variable) needs to be a matrix of counts (integers). A call to read.csv will return a data.frame, so you should check what types of things are stored in the columns of it (sapply(data, class)). These all need to be integer (or numeric), otherwise when you naively convert it (data) to a matrix, it will be upcast to whatever the most general column type is (maybe you have a character column in there?)

Also, I don't think your call to model.matrix is doing what you think it's doing, ie. I'm not sure what the colData argument is doing here (likely nothing), you just happen to get lucky and have the type variable defined in your global workspace. Please inspect those values (mod and mod0) to make sure they are what they should be, because the error you are getting suggest otherwise.


ADD REPLYlink written 2.5 years ago by Steve Lianoglou12k

How do you know you have 3 batches if you don't have any known batches? The sva documentation is fairly straightforward. What part of the documentation are you having difficulty with?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Ryan C. Thompson6.5k
gravatar for Gordon Smyth
2.5 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

removeBatchEffects() is not just for microarray data but (as the documentation tells you) it is intended to be used in conjunction with data exploration like a heatmap or MDS plot. It is not intended to be used in conjunction with a differential expression analysis. For a DE analysis, you can get the same effect as removeBatchEffects() just by adding the batches as an additive factor to the linear model.

ADD COMMENTlink written 2.5 years ago by Gordon Smyth33k
gravatar for Jeff Leek
2.5 years ago by
Jeff Leek510
United States
Jeff Leek510 wrote:


sva is mostly useful for estimating batch effects when you don't know what the batches/artifacts are. if you know the batch effects, you can build a model matrix including them as terms in the matrix and then use limma/voom, DESeq, edgeR, etc. 

If you are discovering batch effects from sequencing data you need to use something like svaseq (or just take the log of the counts plus a small constant and apply sva) 



ADD COMMENTlink written 2.5 years ago by Jeff Leek510
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 299 users visited in the last hour