Hello,
I'm analyzing bacterial RNAseq data and I wonder if I should remove ncRNA transcripts (both rRNA and other ncRNAs) from my reference? The libraries were ribodepleted, so I don't have many reads mapping to rRNA, but due to experimental design some of my samples are saturated by reads mapping to little known ncRNAs (80-90% of reads mapping to these ncRNAs). I proceeded with my pipeline (fastp -> Salmon quantification with reference transcripts including rRNA and ncRNAs -> DEseq2 parametric with tximport), and despite this saturation, DEseq2 reported more than 500 DEGs (adjusted pvalue <0.001).
My libraries contain on average ~7mln paired reads, and so round 500k reads map to cDNA. Is this number of reads enough? In the literature I found a following statement "Similarly, even with only 25,000-30,000 non-rRNA fragments per sample we were able to identify 184 annotated genes in EDL933 whose abundance differed more than 2-fold between late exponential and early stationary phases (P < 1×10-5). " (https://doi.org/10.1186/1471-2164-13-734). Despite this I wonder if this kind of saturation can somehow bias reads quantification and later DEseq2 results? I am interested both in ncRNA DEGs and cDNA DEGs. Maybe I should analyse them separately?