Working with a workflow that uses Fastp -> Salmon -> Deseq2
Is it generally considered good practice to control for Fastp's read duplication rate and/or Salmon's percent mapped (from meta_info.json) when doing a Deseq DE analysis? I have noticed a fair amount of variability across a set of samples in the same prep batch and sequencing run (duplication rate, 22-52%; percent mapped, 82-92%). Duplication rate in particular seems relevant, as I didn't figure it would be that variable, and previous workflows I had done with STAR had me remove duplicate reads altogether.
I mean, it would be easy enough to do ~ read_dups + trt
or ~ read_dups + map_rate + trt
, but are there arguments to not do this (i.e., overfitting or removing true variation)?
The vignettes have examples that use 2 SVs. Do you ever use more than 2? using
svaseq
to estimate the number of factors withnum.sv
gets me a very high number. I tried sequentially plottingSV1
,SV1+SV2
,SV1+SV2+SV3
, etc using cleaned matrices, but I don't know what I am looking for other than for the lowest number of SVs where the batch effect disappears.In that example there are three known batches, and so I know a priori to look for 2 SVs.
I would reach out to the SVA developers on advice on the number of SVs. Maybe a new post and tag the sva package.