As this concerns bioinformatics in general, I also posted here.
I am working on RNAseq data,
I made my count table using kallisto
and then tximport
to work with DESeq2
.
My genes are a set of cDNAs, (supposed to be corresponding to all the genes of my species), but the annotation is quite bad, when I align on these cDNAs I get 60% of mapping, instead of 95% on total genome.
I have 2 conditions: (A and B) and 3 replicates in each condition.
My fear is: If a gene is over-expressed in A, not expressed in B, and not in my cDNA list, I expect to have less reads in A than is B and when the normalization by DESeq2
occurs, it could create a bias ?
Example:
A: 1 1 1 1 2 2 2 2 3 3
B: 1 1 1 1 2 3 3 3 3 3
3 is not annotated, then after normalization by DESeq2
:
A: 1 1 1 1 1 2 2 2 2 2
B: 1 1 1 1 1 1 1 1 2 2
1 over-expressed in B, but it is not true.
How can I deal with this kind of problem?
Should I add a line in my table with "unmapped reads" to have a better normalization?
Do you expect or observe that the proportion of unmapped reads is different across groups or samples?
Yes indeed, I map 65% of my reads in condition B and 55% in condition A.
And how about at the genomic level?
90% condition B
93% condition A