I am planning to use DESeq2 to compare gene expression in samples from two different species of mammals with sequenced genomes. My plan is to use the method mentioned in https://support.bioconductor.org/p/97936/, so to align the reads of each species to its genome using Salmon and to normalize the counts wrt library size as well as AvgTxLength.
(1) Do you think this approach is valid? i.e. would it be valid to compare read counts from two different species after they are normalized wrt library size and average transcript length as is done in DESeq2?
(2) I wanted to understand the normalization that DESeq2 does in more detail, so I dug through the DESeq2 source code and separately implemented the underlying steps that the DESeq2 wrapper function calls. What I was trying to reproduce is highlighted in the following two lines, where dds
is a Tximported Salmon dataset:
dds_deseq <- DESeq2(dds)
normalizationFactors(dds_deseq)
I was able to get the exact same values that normalizationFactors
gives with the following steps:
counts_dds <- counts(dds)
#from estimateSizeFactors.DESeqDataSet:
transcript_lengths <- assays(dds)[["avgTxLength"]] #transcript length matrix
norm.mat <- transcript_lengths / exp(rowMeans(log(transcript_lengths))) #divide out geometric mean
#from estimateNormFactors:
sf <- estimateSizeFactorsForMatrix(counts_dds/norm.mat) #correcting for library size
nf <- t( t(norm.mat) * sf ) #transcript length normalization factors and size factors are multiplied
normalization_factors <- nf / exp(rowMeans(log(nf))) # normalization factors are still normalized
normalized_counts <- counts_dds/normalization_factors
Questions:
(2a) What is the purpose of nf <- t( t(norm.mat) * sf)
? The code multiplies sf
by the transpose of norm.mat
, even though norm.mat
was used to define sf
. Why is norm.mat
taken into account twice?
(2b) What is the purpose of normalization_factors <- nf / exp(rowMeans(log(nf)))
? It seems that dividing nf
by exp(rowMeans(log(nf))
doesn't change the values much
Thanks!