Question

Salmon and DEseq2 - analysis of libraries saturated with ncRNAs

0

Entering edit mode

Mikołaj • 0

@184b2bb2

Last seen 3.8 years ago

Poland

Hello,

I'm analyzing bacterial RNAseq data and I wonder if I should remove ncRNA transcripts (both rRNA and other ncRNAs) from my reference? The libraries were ribodepleted, so I don't have many reads mapping to rRNA, but due to experimental design some of my samples are saturated by reads mapping to little known ncRNAs (80-90% of reads mapping to these ncRNAs). I proceeded with my pipeline (fastp -> Salmon quantification with reference transcripts including rRNA and ncRNAs -> DEseq2 parametric with tximport), and despite this saturation, DEseq2 reported more than 500 DEGs (adjusted pvalue <0.001).

My libraries contain on average ~7mln paired reads, and so round 500k reads map to cDNA. Is this number of reads enough? In the literature I found a following statement "Similarly, even with only 25,000-30,000 non-rRNA fragments per sample we were able to identify 184 annotated genes in EDL933 whose abundance differed more than 2-fold between late exponential and early stationary phases (P < 1×10-5). " (https://doi.org/10.1186/1471-2164-13-734). Despite this I wonder if this kind of saturation can somehow bias reads quantification and later DEseq2 results? I am interested both in ncRNA DEGs and cDNA DEGs. Maybe I should analyse them separately?

RNASeq Salmon DESeq2 • 1.3k views

ADD COMMENT • link updated 3.8 years ago by Michael Love 43k • written 3.8 years ago by Mikołaj • 0

score 2 · Accepted Answer · 2021-05-20

Sorry, I don't have any specific guidance here. If aspects of measuring expression within one of the biological condition of interest involves higher ncRNA abundance in a way that is mostly driven by technical factors, that is an unfortunate confounding variable, and it's not easy to address this with statistical methods. You probably want to consult with others on how to resolve this in the future, to avoid having this problem. A very conservative approach would be to subset only to coding RNA, and then down-sample the non-ncRNA-saturated samples to the low sequencing depth of the ncRNA-saturated samples, so as not to have extreme differences in sequencing depth perfectly confounded with the condition. I don't have any scripts for doing this, you could do this yourself or collaborate with someone perhaps. This does start to be somewhat of an exercise in attempting to recover small signal from a perfectly technically confounded design.

The "enough reads" question depends on the expression level of the genes for which one wants to be able to detect DE.