(batch) corrections of RNA-seq data: integrating LIMMA and SVA
Entering edit mode
Bogdan ▴ 620
Last seen 14 days ago
Palo Alto, CA, USA

Dear all,

i am aiming to integrate limma and sva for RNAseq analysis, and i would appreciate please an advice on the following R code (it is below, is does it look correct ?), and your insights on 2 questions :

1 -- could SVASEQ work on logCPM values ? It works with raw counts, however I am receiving errors when using logCPM or CPM in svaseq function.

2 -- if SVA computes the surrogate variable based on RAW counts, and LIMMA (typically) uses logCPM, is it OK to do : fit <- lmFit(logCPM, modSv); where modSV has been estimated based on RAW counts ?

The R code that I am using is below :


pheno <- read.delim("pheno.data", row.names="Samples")

#> pheno
#       sample subject experiment
#siCTL1      1       1      siCTL
#siCTL2      2       2      siCTL
#siCTL3      3       3      siCTL
#siX1        4       1        siX
#siX2        5       2        siX
#siX3        6       3        siX

eset <- as.matrix(read.delim("eset.data", row.names="Gene"))

#> head(eset)
#       siCTL1 siCTL2 siCTL3 siX1 siX2 siX3
#Xkr4       51      0      0   50    0    0
#Rp1         6      0      0    1    0    0
#Sox17     373      0      0  206    0    0
#Mrpl15    973   2158   1251 1079 1749 1293
#Lypla1   6419   4172    882 5016 3001 1398
#Tcea1    2681   1942    717 2213 1404  929

### keeping only non-zero values in eset
nonzero_rows = apply(eset, 1, function(x) !all(x==0) )
eset_nonzero <- eset[nonzero_rows,]

### setting up the FACTOR GROUP
group <- factor(pheno$experiment)
group <- relevel(group, ref="siCTL")

### setting up the FACTOR SUBJECT
subject <- factor(pheno$subject)
#subject <- factor(c(1,2,3,1,2,3))

## setting up the DESIGN MATRIX for EDGER/LIMMA
design <- model.matrix(~group+subject, data=pheno)

## estimating size factors on raw counts
## running svaseq on normalized counts

### slightly more filtering based on CPM values.

y <- DGEList(counts=eset_nonzero, group=group)

keep <- rowSums(cpm(y) > 0.5) >= 6
y <- y[keep,,keep.lib.sizes=FALSE]

y <- calcNormFactors(y)
logCPM <- cpm(y,log=TRUE,prior.count=3)

### the code related to SVA ....

mod = model.matrix(~group+subject, data=pheno)
mod0 = model.matrix(~1, data=pheno)

n.sv = num.sv(eset_nonzero, mod, method="leek")
## n.sv = num.sv(logCPM, mod, method="leek")
## n.sv = num.sv(eset_nonzero, mod, method="be)

svobj = svaseq(eset_nonzero, mod, mod0, n.sv=n.sv)

modSv = cbind(mod, svobj$sv)

# mod0Sv = cbind(mod0, svobj$sv)
# pValuesSv = f.pvalue(eset_nonzero,modSv,mod0Sv)
# qValuesSv = p.adjust(pValuesSv,method="fdr")

## instead of :
## fit <- lmFit(logCPM, design)

fit <- lmFit(logCPM, modSv)
fit <- eBayes(fit, trend=TRUE, robust=TRUE)

results <- topTable(fit, coef=2, adjust="fdr", number=Inf)
write.table(results, file="zz.be.results.edgeR.txt",
                           sep="\t", eol="\n", row.names=TRUE, col.names=TRUE)




limma sva svaseq • 1.8k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 8 hours ago
The city by the bay

1) You don't mention the type of error you're getting, but if you look at the documentation for svaseq, it says that the supplied data matrix dat will be log-transformed after adding constant. Clearly, giving the function log-values to be log-transformed again would be silly.

2) Well, it's probably okay because svaseq operates internally on log-counts. As long as the surrogate variables are correctly identified, it shouldn't matter how you get there. I haven't used sva but I suspect you may need to divide each count by the (effective) library size before running the function, otherwise the first surrogate variable would just represent the differences in library size.

There are, however, other issues with your code. The first is why you are using limma without voom to analyse RNA-seq data. There had better be good reasons for using the limma-trend approach as it does not perform as well when your library sizes are very different. The second point is that you should be ensuring that pheno$subject is a factor when constructing the design matrix. In your code, only subject is guaranteed to be a factor, not pheno$subject. Thirdly, your second filtering step requires all 6 libraries to have CPMs above 0.5. This will remove strongly DE genes that are silent in one condition.

Entering edit mode

Dear Aaron, thank you again for your comments ! i will modify the R code, and will re-post the questions. Thanks a lot !


Login before adding your answer.

Traffic: 395 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6