Question

Cross-species analysis and normalization with DESeq2

0

Entering edit mode

rmurray2050 • 0

@rmurray2050-23617

Last seen 5.7 years ago

I am planning to use DESeq2 to compare gene expression in samples from two different species of mammals with sequenced genomes. My plan is to use the method mentioned in https://support.bioconductor.org/p/97936/, so to align the reads of each species to its genome using Salmon and to normalize the counts wrt library size as well as AvgTxLength.

(1) Do you think this approach is valid? i.e. would it be valid to compare read counts from two different species after they are normalized wrt library size and average transcript length as is done in DESeq2?

(2) I wanted to understand the normalization that DESeq2 does in more detail, so I dug through the DESeq2 source code and separately implemented the underlying steps that the DESeq2 wrapper function calls. What I was trying to reproduce is highlighted in the following two lines, where dds is a Tximported Salmon dataset:

dds_deseq <- DESeq2(dds)
normalizationFactors(dds_deseq)

I was able to get the exact same values that normalizationFactors gives with the following steps:

counts_dds <- counts(dds)

#from estimateSizeFactors.DESeqDataSet:
transcript_lengths <- assays(dds)[["avgTxLength"]] #transcript length matrix
norm.mat <- transcript_lengths / exp(rowMeans(log(transcript_lengths)))  #divide out geometric mean

#from estimateNormFactors:
sf <- estimateSizeFactorsForMatrix(counts_dds/norm.mat) #correcting for library size
nf <- t( t(norm.mat) * sf ) #transcript length normalization factors and size factors are multiplied
normalization_factors <- nf / exp(rowMeans(log(nf))) # normalization factors are still normalized

normalized_counts <- counts_dds/normalization_factors

Questions:

(2a) What is the purpose of nf <- t( t(norm.mat) * sf)? The code multiplies sf by the transpose of norm.mat, even though norm.mat was used to define sf. Why is norm.mat taken into account twice?

(2b) What is the purpose of normalization_factors <- nf / exp(rowMeans(log(nf)))? It seems that dividing nf by exp(rowMeans(log(nf)) doesn't change the values much

Thanks!

normalization deseq2 across species rnaseq • 1.9k views

ADD COMMENT • link updated 5.7 years ago by Michael Love 43k • written 5.7 years ago by rmurray2050 • 0

score 2 · Answer 1 · 2020-05-29

I don't have any particular comments on (1) beyond what's been said in the other thread. I don't do much cross species work. Seems reasonable to correct for average transcript length but I can't say if that is sufficient to deal with all the aspects that would affect different counts beside expression level. That's beyond scope of what I can provide feedback on.

2a - the size factors calculated here are to adjust for systematic differences across average transcript length normalized counts, so norm.mat is divided out to estimate sf then multiplied back in. Seems straightforward to me.

2b - if we didn't scale the rows of nf we could have the case that normalized counts are very far from the scale of original counts. Because we use normalized counts at various points in the pipeline, we want the normalization factors to be roughly around 1.