I am analyzing an RNA-seq experiment in which I compare a control library (high coverage) with a treatment one (low coverage). There is some sample-specific length bias across samples and I am trying to correct this with the EDASeq package.
data <- newSeqExpressionSet(counts=as.matrix(reads[keep,]), featureData=featureData, phenoData=data.frame( conditions=conds, row.names=colnames(reads))) dataOffset <- withinLaneNormalization(data,"length", which="full",offset=TRUE) dataOffset <- betweenLaneNormalization(dataOffset, which="full",offset=TRUE)
After normalization, I do DGE analysis with edgeR:
a <- DGEList(counts=counts(dataOffset), group=conds, genes=as.character(rownames(reads[keep,]))) a$offset <- -offst(dataOffset) a <- estimateDisp(a, design) fit <- glmQLFit(a,design,robust=TRUE)
If we look at the volcano plot from the DGE analysis, there's a weird cluster of genes that are clearly separated from the "volcano":
These genes are the ones that have 0 (or very low) read counts in the treatment library (which has a much lower coverage than the control one). If we look at the raw and normalized counts of these genes, I see that often they have originally 0 reads, but after normalization they get assigned with a descrete number of reads. This number is usually relatively different between replicates, and I guess that's why edgeR (correctly?) assign a high FDR to these genes.
I am however a bit confused by how can a gene with 0 reads be assigned any number of reads after normalization. Is this something that is expected? I don't recall ever seeing a volcano plot with a separated cluster of genes as I am seeing now.