Hello again! I'm running a RNASeq analysis with some old, unpublished Roche454 data, using the workflow RnaSeqGeneEdgeRQL. However, reads from Roche454 aren't the best for this kind of analysis, being longer than Illumina ones and the method itself is quite outdated. So, in my opinion, it's likely the alignments between the reference indexed genome and my .fastq RNASeq data could have been underestimated.
That's the R documentation page on the align function:
https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/align
I wonder whether the options of the featureCounts function could matter too...
https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts
In your opinion, which options should I work on to improve the results of my analysis? The vignette of the workflow itself suggests:
"Ideally, the proportion of mapped reads should be above 80%"
But my samples are far from that result (the best one reaches about 74%), as you can see:
propmapped(all.bam)
Any suggestion is more than welcome! Thanks in advance!
I do not have any idea on to weather the alignment step could be improved on 454 data, but I have seen o few data set with low proportion of mapped reads and just want to highlight that there are many plausible explanation that perhaps should be explored before looking at the alignment/counting approach taken.
Hello, thanks for your answer and suggestion! Even by setting a higher mismatch threshold (7 instead of the default 3) and editing the length read to an ever larger tolerance range (the latter making no difference at all), I was only able to obtain an average 64% of mapped reads rather than the original 58%... So it wasn't really worth it... Contamination is possible, since my samples are plant extracts, so there may be some yeasts... or even human DNA if not treated properly. The genome of the species is not fully sequenced yet and has many parts, due to large heterozigosity, which remained unassembled... So actually both these reasons are possible. I'm pretty sure instead there was little rRNA.
Without a complete reference genome this figures are not in any way extreme in my eyes. Not very good that a large fraction of reads are not useful, but maybe the reads that do map are okay to use. In my experience both 454 and solid data are today seldom worth spending many hours on.
I thought over it better. The genome is complete, at least in its coding parts, which are the ones used for RNA reads alignment. So, having an incomplete reference in non-coding areas shouldn't really affect the result... Instead, the great heterozigosity of the genome of the organism I'm working on may be the cause of my lower than expected results.
I'm analysing the results of the workflow now and they are coherent, so I guess I'll use my data anyway! I'm working on Illumina reads in the future, but I just had these old ones to process and trying to make sense of.
However, thanks for your answer, again!