Search
Question: Accounting for 5'/3' Bias in DESeq 2
0
2.7 years ago by
Jakub30
United Kingdom
Jakub30 wrote:

I didn't find an answer to this searching the forums. I have RNASeq samples with 5'/3' biases that are unevenly distributed amongst the samples. Some of my conditions have more samples with the bias, some less - the reason for these biases is almost certainly different levels of RNA sample fragmentation or other differences in sample prep (PS: I realise that this is not a good start). This makes DESeq2 call DE amongst the bias distributions.

What is the best method for accounting for this variation in an objective way: the RUV package, adding 5'/3' calculated bias ratios to the GLM (e.g. from Picard), using residuals? Any opinions would be greatly appreciated.

Many thanks, J

modified 2.7 years ago by Ryan C. Thompson7.1k • written 2.7 years ago by Jakub30
3
2.7 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.1k wrote:

I've recently seen a paper that presents a metric meant to account for the integrity of each transcript/gene in each sample: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0922-z

They describe an adjustment that removes the dependency between this "transcript integrity number" and the logCPM of each gene by fitting a loess curve and then subtracting that curve out (see figure 6), and they demonstrate that their adjustment reduces the number of (presumed) false positives in a differential expression test. However, in the paper they implement their adjustment by modifying the counts directly. Instead, I would recommend you use the adjustment to compute an offset matrix, since it's important for edgeR and DESeq2 to have access to the raw counts so they can accurately account for the counting uncertainty.

Thanks!

I've done exactly this and computed TINs for each gene, and performed the loess regression. I now have the raw logcounts and corrected logcounts. I guess I am not clear in my head which value is best to use in a normFactor offset matrix, before normalising each row to a geometric mean of 1 as described in the vignette.

• difference in absolute values, i.e. 10^corrected-10^raw
• difference in log values, i.e. (corrected-raw)
• difference in % abs values, i.e. 10^corrected/10^raw

PS: I used % values as absolute differences can be negative and the matrix has to be positive and the package authors explicitly warn against using log differences.

Looking at the documentation, I see DESeq2 uses a matrix of "normalization factors" on the scale of the raw counts rather than a GLM offset matrix. The raw counts are divided by the normalization factors to get the normalized counts. So if normcounts = rawcounts / normfactors, then normfactors = rawcounts / normcounts. So compute that, then normalize the geometric mean of each row to 1 as described in the DESeq2 manual, and store these norm factors in the DESeqDataSet object. Finally, you'll need to run estimateSizeFactors, since the TIN normalization only normalizes within samples and you still need to normalize between samples. After that, you should be able to run through your standard DESeq2 pipeline and have it use your TIN-derived normalization factors.

(Mike, please correct me if I got anything wrong here.)

Sounds right.

If you want to correct for library size on top of normalization factors, pass the normFactors matrix (with row-wise geometric means around 1) to the normMatrix argument of estimateSizeFactors:

normMatrix: optional, a matrix of normalization factors which do not
control for library size.... Providing ‘normMatrix’ will estimate
size factors on the count matrix divided by ‘normMatrix’ and
store the product of the size factors and ‘normMatrix’ as
‘normalizationFactors’.
2
2.7 years ago by
Michael Love20k
United States
Michael Love20k wrote:

It's hard to predict how the 5'/3' bias will affect the counts, although it's reasonable to expect that it will.

I'd recommend either SVAseq or RUVseq, either of which will be able to pick up on systematic differences (including this bias) that affect the counts across many rows.

The only situation where these packages can't help you -- and I'm not sure any computational method can -- is if the 5'/3' bias is perfectly confounded with the condition.