Question: Using duplication rate as a covariate
0
9 weeks ago by
rbutler0
rbutler0 wrote:

Working with a workflow that uses Fastp -> Salmon -> Deseq2

Is it generally considered good practice to control for Fastp's read duplication rate and/or Salmon's percent mapped (from meta_info.json) when doing a Deseq DE analysis? I have noticed a fair amount of variability across a set of samples in the same prep batch and sequencing run (duplication rate, 22-52%; percent mapped, 82-92%). Duplication rate in particular seems relevant, as I didn't figure it would be that variable, and previous workflows I had done with STAR had me remove duplicate reads altogether.

I mean, it would be easy enough to do ~ read_dups + trt or ~ read_dups + map_rate + trt, but are there arguments to not do this (i.e., overfitting or removing true variation)?

deseq2 salmon fastp • 90 views
modified 9 weeks ago by Michael Love26k • written 9 weeks ago by rbutler0
Answer: Using duplication rate as a covariate
2
9 weeks ago by
Michael Love26k
United States
Michael Love26k wrote:

I don't typically add in things like RIN or TIN or duplication or mapping rates.

My preferred approach to control for technical variation is either through Salmon's bias terms (GC, positional, etc.), or otherwise with RUV or SVA and providing these packages with the condition.

The vignettes have examples that use 2 SVs. Do you ever use more than 2? using svaseq to estimate the number of factors with num.sv gets me a very high number. I tried sequentially plotting SV1, SV1+SV2, SV1+SV2+SV3, etc using cleaned matrices, but I don't know what I am looking for other than for the lowest number of SVs where the batch effect disappears.

In that example there are three known batches, and so I know a priori to look for 2 SVs.

I would reach out to the SVA developers on advice on the number of SVs. Maybe a new post and tag the sva package.