Dear all,
we have 12 RNA-seq samples of pairs "CSC" - "noncsc" ("csc" stands from cancer stem cells; and each pair CSC-noncsc comes from a different mouse). I write the following code for computing the differential expression with limma on the paired samples. Please could you let me know if it is correct. Thank you !
eset <- read.delim("mm10.genes.FPKM.for.Erica.and.Ariel.integrated.with.edgeR.version8feb2016.for-PCA.renaming",row.names="Symbol")
group <- factor(c("csc","csc","csc","csc","csc","csc","non","non","non","non","non","non"))
subject <- factor(c(1,2,3,4,5,6,1,2,3,4,5,6))
design <- model.matrix(~0+group+subject)
contrast.matrix <- makeContrasts(groupcsc-groupnon, levels=design)
fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
results <- topTable(fit, coef=1, adjust="fdr", number=24073)
write.table(results, file="results", sep="\t", eol="\n", row.names=TRUE, col.names=TRUE)
sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] limma_3.26.9
loaded via a namespace (and not attached):
[1] tools_3.2.4
Looks mostly fine to me. However, you might want to do some filtering to remove low-abundance genes at the start (i.e., before running
calcNormFactors
, and definitely before runningvoom
, otherwise you'll end up with a strange mean-variance trend). Also, for future reference, modify your original question rather than posting a new answer.thanks, Aaron. will open a new question about edgeR soon. On the same topic of limma though, a little additional question please (on a tiny detail): when I set up the design, it looks that the word "group" is added automatically to the names of column1 and column2 : is it supposed to be this way ?
> group <- factor(c("csc","csc","csc","csc","csc","csc","non","non","non","non","non","non"))
> subject <- factor(c(1,2,3,4,5,6,1,2,3,4,5,6))
> design <- model.matrix(~0+group+subject)
> design
groupcsc groupnon subject2 subject3 subject4 subject5 subject6
1 1 0 0 0 0 0 0
2 1 0 1 0 0 0 0
3 1 0 0 1 0 0 0
4 1 0 0 0 1 0 0
5 1 0 0 0 0 1 0
6 1 0 0 0 0 0 1
7 0 1 0 0 0 0 0
8 0 1 1 0 0 0 0
9 0 1 0 1 0 0 0
10 0 1 0 0 1 0 0
11 0 1 0 0 0 1 0
12 0 1 0 0 0 0 1
and we do :
contrast.matrix <- makeContrasts(groupcsc-groupnon, levels=design)
dge <- DGEList(counts=eset,group=group)
dge <- calcNormFactors(dge)
v <- voom(dge,design,plot=TRUE)
fit <- lmFit(v, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
Yes, that's meant to happen. You could imagine if you had two factors, both of which have the level "X"; if you didn't put the factor name in front, you wouldn't be able to figure out which factor a column named "X" would be referring to.