limma voom accessing normalised counts tables
2
0
Entering edit mode
Peter Crisp ▴ 10
@peter-crisp-6309
Last seen 6.9 years ago
Australia

Hi all,

I'm using voom and limma to analyse RNAseq data.  I'm comparing the percentage distribution of reads among different categories of transcripts within a sample with the aim of making a stacked percentage bar plot (or similar), for instance: counts for protein-coding (97%) vs pre-miRNA (1%) vs snoRNAs (0.1%) etc.   I'm wondering whether I should be using the raw counts output from read summarisation (eg featureCounts output) or say the "E" matrix of normalised log2 counts from the EList object after voom transformation, or if it makes little difference? I'm unsure if the normalised counts can be used for purposes other than in the limma pipeline or if they should be multiplied by the normalisation factors first?

Thanks,

Peter

limma rnaseq geneexpression voom counts • 5.9k views
3
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 4 hours ago
The city by the bay

If you want to examine the percentage distribution of reads between different types of transcripts, then it makes sense to use the original counts, e.g., those that you get out of featureCounts. For example, if you have a matrix of counts and a vector specifying the type of transcript for each row of the matrix, you could do something like

y <- aggregate(counts ~ type, FUN=sum)
barplot(as.matrix(y[,-1]))


The normalized log-counts from voom don't seem suitable for this application. It doesn't much sense to add log-counts from different genes, because that'd be equivalent to multiplying the original counts together.

Your second question would be easier to answer if you specify what non-limma application you want to use the normalized (log-)counts for. Nonetheless, here's some general advice. voom will automatically incorporate the normalization factors into the normalized expression values, so long as the input is a DGEList object. For example, you could create your DGEList from your count data, run calcNormFactors on that DGEList (or input your own normalization factors), and then supply the resulting object to voom.

0
Entering edit mode

Hi Aaron and Steve,

Thanks for the replies.

Yeah I have used the original counts (incidentally the results are very similar using either original or voom transformed counts... as you might expect).

Re the voom normalised expression values:

I have noticed a curiosity.  When I sum the columns by colSums(2^(v$E) they do not add up to 1,000,000 ie they do not appear to be log2cpm exactly? However, when I multiply by the norm factors colSums(2^(v$E))*v$targets$norm.factors I get basically 1,000,000.  Is this the expected behavior? Minimal code example with the numbers below.

Best,

Peter

#get counts
columns=c(1,7),
group=sample.groups,
labels=as.character(samples)
)

#calculate normfactor using edgeR
dge <- calcNormFactors(dge, method="TMM")

#sample summaries:
dge$samples lib.size norm.factors Sample_Control_1 30625879 0.9154784 Sample_Control_2 27570152 0.9213957 Sample_Control_3 30879502 0.9195541 Sample_TestGroupA_4 26584750 1.0313922 Sample_TestGroupA_5 28578391 0.9911917 Sample_TestGroupA_6 24986334 1.0279975 Sample_TestGroupB_7 27158791 0.9029378 Sample_TestGroupB_8 30399430 1.1605661 Sample_TestGroupB_9 36884178 1.1706503 #create model design <- model.matrix(~0+factor(keyfile$Group))
colnames(design) <- unique(keyfile$Group) colnames(design) [1] "Control" "TestGroupA" "TestGroupB" #voom transform v <- voom(dge,design,plot=TRUE) #sample summaries post voom data.frame(colSums(2^(v$E)))

colSums.2..v.E..
Sample_Control_1              1092675.4
Sample_Control_2              1085696.6
Sample_Control_3              1087829.4
Sample_TestGroupA_4         969921.5
Sample_TestGroupA_5        1009233.3
Sample_TestGroupA_6         973147.4
Sample_TestGroupB_7            1107896.5
Sample_TestGroupB_8             861926.8
Sample_TestGroupB_9             854453.5

#sample summaries post voom counts log2 cpm multiplied by norm.factors
data.frame(colSums(2^(v$E))*v$targets\$norm.factors)

colSums.2..v.E.....v.targets.norm.factors
Sample_Control_1                                         1000321
Sample_Control_2                                         1000356
Sample_Control_3                                         1000318
Sample_TestGroupA_4                                   1000369
Sample_TestGroupA_5                                   1000344
Sample_TestGroupA_6                                   1000393
Sample_TestGroupB_7                                       1000362
Sample_TestGroupB_8                                       1000323
Sample_TestGroupB7_9                                       1000266
0
Entering edit mode

Yes, this makes sense. Remember that voom computes (log-)counts per million using the effective library size, i.e., the product of the library size and its normalization factor. Each CPM is a ratio of the count to the effective library size (after dividing the latter by 1 million). Summing across CPM values for each library would yield an expression where the numerator is the sum of counts across all genes, and the denominator is the effective library size divided by a million.

As the numerator is equal to the total library size, the value of the sum becomes equal to 1 million divided by the normalization factor for each library. So, if you multiply by the normalization factor, you'll end up with 1 million, as you have observed. Note that the numbers are actually higher as a continuity correction of 0.5 is added to the counts in voom prior to log-transformation; so, the numerator will be slightly greater than the library size.

0
Entering edit mode

Great, thanks for the clarification!

0
Entering edit mode
@steve-lianoglou-2771
Last seen 3 days ago
Denali

I agree with what Aaron is saying: for the purposes you are asking for, the original counts are most appropriate.