Question

normalisation of count data

0

Entering edit mode

sara.beier • 0

@sarabeier-10890

Last seen 6.0 years ago

Hello

I am using both the normMatrix and the controlGenes option in DeSeq to create size Factors in a RNA-Seq time series experiment to test against changes against T0 using the LRT-test.

If working with a full dataset and a second dataset containing count data only for a subset of genes (though the same genes are indicated in both cases for the controlGene option), as expected, the normalization Factors stay the same for those genes being present in both, the full dataset and the subset. Also the baseMean in the result files is equal for genes being present in both datasets, I however was surprised to see that the log2FoldChange values against T0 change slightly. Shouldn't these values be constant for a specific gene, if both the raw count data and the normalization Factors for this gene are constant, independent from the presence of other genes in the database? Can somebody explain this to me?

thanks in advance,

Sara

deseq2 • 527 views

ADD COMMENT • link updated 6.0 years ago by Ryan C. Thompson ★ 7.9k • written 6.0 years ago by sara.beier • 0

score 0 · Answer 1 · 2018-04-13

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

If you were fitting a linear model, the log fold changes would be identical. However, in a negative binomial GLM, the calculation of log fold changes depends on the estimated dispersion parameter, which depends on other genes. You might consider subsetting your genes after dispersion estimation, which will ensure that you are using the same dispersion estimates for both cases. This is usually the better way to go about things anyway, since using more genes for dispersion estimation yields a more robust trend. (The exception would be genes that you discard for being outliers, since they might distort the dispersion trend.)

ADD COMMENT • link 6.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Agree with Ryan. One details is that dispersion outliers shouldn’t affect the dispersion trend because the trend is iteratively fit while excluding genes that are outliers. This procedure of DESeq2 goes back to the DESeq method for fitting the trend using a gamma GLM. If it doesn’t converge after 10 iterations it quits and uses loess.

ADD REPLY • link 6.0 years ago Michael Love 41k