I'm using voom and limma to analyse RNAseq data. I'm comparing the percentage distribution of reads among different categories of transcripts within a sample with the aim of making a stacked percentage bar plot (or similar), for instance: counts for protein-coding (97%) vs pre-miRNA (1%) vs snoRNAs (0.1%) etc. I'm wondering whether I should be using the raw counts output from read summarisation (eg featureCounts output) or say the "E" matrix of normalised log2 counts from the EList object after voom transformation, or if it makes little difference? I'm unsure if the normalised counts can be used for purposes other than in the limma pipeline or if they should be multiplied by the normalisation factors first?

If you want to examine the percentage distribution of reads between different types of transcripts, then it makes sense to use the original counts, e.g., those that you get out of featureCounts. For example, if you have a matrix of counts and a vector specifying the type of transcript for each row of the matrix, you could do something like

y <- aggregate(counts ~ type, FUN=sum)
barplot(as.matrix(y[,-1]))

The normalized log-counts from voom don't seem suitable for this application. It doesn't much sense to add log-counts from different genes, because that'd be equivalent to multiplying the original counts together.

Your second question would be easier to answer if you specify what non-limma application you want to use the normalized (log-)counts for. Nonetheless, here's some general advice. voom will automatically incorporate the normalization factors into the normalized expression values, so long as the input is a DGEList object. For example, you could create your DGEList from your count data, run calcNormFactors on that DGEList (or input your own normalization factors), and then supply the resulting object to voom.

Yeah I have used the original counts (incidentally the results are very similar using either original or voom transformed counts... as you might expect).

Re the voom normalised expression values:

I have noticed a curiosity. When I sum the columns by colSums(2^(v$E) they do not add up to 1,000,000 ie they do not appear to be log2cpm exactly? However, when I multiply by the norm factors colSums(2^(v$E))*v$targets$norm.factors I get basically 1,000,000. Is this the expected behavior? Minimal code example with the numbers below.

Yes, this makes sense. Remember that voom computes (log-)counts per million using the effective library size, i.e., the product of the library size and its normalization factor. Each CPM is a ratio of the count to the effective library size (after dividing the latter by 1 million). Summing across CPM values for each library would yield an expression where the numerator is the sum of counts across all genes, and the denominator is the effective library size divided by a million.

As the numerator is equal to the total library size, the value of the sum becomes equal to 1 million divided by the normalization factor for each library. So, if you multiply by the normalization factor, you'll end up with 1 million, as you have observed. Note that the numbers are actually higher as a continuity correction of 0.5 is added to the counts in voom prior to log-transformation; so, the numerator will be slightly greater than the library size.

Hi Aaron and Steve,

Thanks for the replies.

Yeah I have used the original counts (incidentally the results are very similar using either original or voom transformed counts... as you might expect).

Re the voom normalised expression values:

I have noticed a curiosity. When I sum the columns by

`colSums(2^(v$E)`

they do not add up to 1,000,000 ie they do not appear to be log2cpm exactly? However, when I multiply by the norm factors`colSums(2^(v$E))*v$targets$norm.factors`

I get basically 1,000,000. Is this the expected behavior? Minimal code example with the numbers below.Best,

Peter

Yes, this makes sense. Remember that

`voom`

computes (log-)counts per million using theeffectivelibrary size, i.e., the product of the library size and its normalization factor. Each CPM is a ratio of the count to the effective library size (after dividing the latter by 1 million). Summing across CPM values for each library would yield an expression where the numerator is the sum of counts across all genes, and the denominator is the effective library size divided by a million.As the numerator is equal to the total library size, the value of the sum becomes equal to 1 million divided by the normalization factor for each library. So, if you multiply by the normalization factor, you'll end up with 1 million, as you have observed. Note that the numbers are actually higher as a continuity correction of 0.5 is added to the counts in

`voom`

prior to log-transformation; so, the numerator will be slightly greater than the library size.Great, thanks for the clarification!