EDASeq normalization: genes with 0 counts
1
0
Entering edit mode
gprezza • 0
@611515a7
Last seen 3 months ago
Germany

I am analyzing an RNA-seq experiment in which I compare a control library (high coverage) with a treatment one (low coverage). There is some sample-specific length bias across samples and I am trying to correct this with the EDASeq package.

data <- newSeqExpressionSet(counts=as.matrix(reads[keep,]),
featureData=featureData,
phenoData=data.frame(
conditions=conds,

dataOffset <- withinLaneNormalization(data,"length", which="full",offset=TRUE)
dataOffset <- betweenLaneNormalization(dataOffset, which="full",offset=TRUE)


After normalization, I do DGE analysis with edgeR:

a <- DGEList(counts=counts(dataOffset), group=conds, genes=as.character(rownames(reads[keep,])))
a\$offset <- -offst(dataOffset)
a <- estimateDisp(a, design)
fit <- glmQLFit(a,design,robust=TRUE)


If we look at the volcano plot from the DGE analysis, there's a weird cluster of genes that are clearly separated from the "volcano":

These genes are the ones that have 0 (or very low) read counts in the treatment library (which has a much lower coverage than the control one). If we look at the raw and normalized counts of these genes, I see that often they have originally 0 reads, but after normalization they get assigned with a discrete number of reads. This number is usually relatively different between replicates, and I guess that's why edgeR (correctly I guess) assigns a high FDR to these genes.

I am however a bit confused by how can a gene with 0 reads be assigned any number of reads after normalization. Is this something that is expected? I don't recall ever seeing a volcano plot with a separated cluster of genes as I am seeing now.

edgeR EDASeq • 287 views
2
Entering edit mode
davide risso ▴ 870
@davide-risso-5075
Last seen 3 months ago

Hi,

you are using full-quantile normalization to normalize the data. This method matches the empirical distribution of all samples to that of a reference sample (usually taken as the median across all samples). If, as in your case, samples have very different coverages, it may happen that the zeroes are transformed in non-zero values.

As a (somewhat extreme) toy example, consider if you only had two samples and 6 genes that looked like this:

> x = matrix(c(0, 0, 1, 2, 4, 5, 10, 12, 12, 13, 14, 15), ncol=2)
> x
[,1] [,2]
[1,]    0   10
[2,]    0   12
[3,]    1   12
[4,]    2   13
[5,]    4   14
[6,]    5   15


Because the median of 0 and 10 is 5, the first 0 will be transformed to 5 (and the second to 6).

> betweenLaneNormalization(x, which="full")
[,1] [,2]
[1,]    5    5
[2,]    6    6
[3,]    6    6
[4,]    8    8
[5,]    9    9
[6,]   10   10


If you want to preserve the zeros in the normalized data, you have to use one of the many scaling normalizations available, for instance with upper-quartile scaling, the same toy example will look like this

> betweenLaneNormalization(x, which="upper")
[,1] [,2]
[1,]    0    6
[2,]    0    8
[3,]    2    8
[4,]    5    8
[5,]   10    9
[6,]   12    9
attr(,"scaled:scale")
[1] 0.4057971 1.5942029


I hope this helps.

0
Entering edit mode

This is very clear and it definitely helps. Thanks for taking the time to reply!