Question

Accounting for 5'/3' Bias in DESeq 2

0

Entering edit mode

Jakub ▴ 50

@jakub-9073

Last seen 10 weeks ago

United Kingdom

I didn't find an answer to this searching the forums. I have RNASeq samples with 5'/3' biases that are unevenly distributed amongst the samples. Some of my conditions have more samples with the bias, some less - the reason for these biases is almost certainly different levels of RNA sample fragmentation or other differences in sample prep (PS: I realise that this is not a good start). This makes DESeq2 call DE amongst the bias distributions.

What is the best method for accounting for this variation in an objective way: the RUV package, adding 5'/3' calculated bias ratios to the GLM (e.g. from Picard), using residuals? Any opinions would be greatly appreciated.

Many thanks, J

deseq2 • 2.4k views

ADD COMMENT • link updated 8.1 years ago by Ryan C. Thompson ★ 7.9k • written 8.1 years ago by Jakub ▴ 50

score 3 · Answer 1 · 2016-03-30

3

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

I've recently seen a paper that presents a metric meant to account for the integrity of each transcript/gene in each sample: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0922-z

They describe an adjustment that removes the dependency between this "transcript integrity number" and the logCPM of each gene by fitting a loess curve and then subtracting that curve out (see figure 6), and they demonstrate that their adjustment reduces the number of (presumed) false positives in a differential expression test. However, in the paper they implement their adjustment by modifying the counts directly. Instead, I would recommend you use the adjustment to compute an offset matrix, since it's important for edgeR and DESeq2 to have access to the raw counts so they can accurately account for the counting uncertainty.

ADD COMMENT • link 8.1 years ago • updated 8.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks!

I've done exactly this and computed TINs for each gene, and performed the loess regression. I now have the raw logcounts and corrected logcounts. I guess I am not clear in my head which value is best to use in a normFactor offset matrix, before normalising each row to a geometric mean of 1 as described in the vignette.

difference in absolute values, i.e. 10^corrected-10^raw
difference in log values, i.e. (corrected-raw)
difference in % abs values, i.e. 10^corrected/10^raw

PS: I used % values as absolute differences can be negative and the matrix has to be positive and the package authors explicitly warn against using log differences.

ADD REPLY • link 8.0 years ago Jakub ▴ 50

0

Entering edit mode

Looking at the documentation, I see DESeq2 uses a matrix of "normalization factors" on the scale of the raw counts rather than a GLM offset matrix. The raw counts are divided by the normalization factors to get the normalized counts. So if normcounts = rawcounts / normfactors, then normfactors = rawcounts / normcounts. So compute that, then normalize the geometric mean of each row to 1 as described in the DESeq2 manual, and store these norm factors in the DESeqDataSet object. Finally, you'll need to run estimateSizeFactors, since the TIN normalization only normalizes within samples and you still need to normalize between samples. After that, you should be able to run through your standard DESeq2 pipeline and have it use your TIN-derived normalization factors.

(Mike, please correct me if I got anything wrong here.)

ADD REPLY • link 8.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Sounds right.

If you want to correct for library size on top of normalization factors, pass the normFactors matrix (with row-wise geometric means around 1) to the normMatrix argument of estimateSizeFactors:

normMatrix: optional, a matrix of normalization factors which do not
          control for library size.... Providing ‘normMatrix’ will estimate
          size factors on the count matrix divided by ‘normMatrix’ and
          store the product of the size factors and ‘normMatrix’ as
          ‘normalizationFactors’.

ADD REPLY • link 8.0 years ago Michael Love 41k

score 2 · Answer 2 · 2016-03-30

It's hard to predict how the 5'/3' bias will affect the counts, although it's reasonable to expect that it will.

I'd recommend either SVAseq or RUVseq, either of which will be able to pick up on systematic differences (including this bias) that affect the counts across many rows.

The only situation where these packages can't help you -- and I'm not sure any computational method can -- is if the 5'/3' bias is perfectly confounded with the condition.