Question: Improving the performance of the align function from Rsubread for old Roche454 reads
0
3 months ago by
Raito9240
Italy
Raito9240 wrote:

Hello again! I'm running a RNASeq analysis with some old, unpublished Roche454 data, using the workflow RnaSeqGeneEdgeRQL. However, reads from Roche454 aren't the best for this kind of analysis, being longer than Illumina ones and the method itself is quite outdated. So, in my opinion, it's likely the alignments between the reference indexed genome and my .fastq RNASeq data could have been underestimated.

That's the R documentation page on the align function:

I wonder whether the options of the featureCounts function could matter too...

In your opinion, which options should I work on to improve the results of my analysis? The vignette of the workflow itself suggests:

"Ideally, the proportion of mapped reads should be above 80%"

But my samples are far from that result (the best one reaches about 74%), as you can see:

propmapped(all.bam)


Any suggestion is more than welcome! Thanks in advance!

rnaseq rsubread align roche454 • 174 views
modified 12 weeks ago by Wei Shi3.1k • written 3 months ago by Raito9240
1

I do not have any idea on to weather the alignment step could be improved on 454 data, but I have seen o few data set with low proportion of mapped reads and just want to highlight that there are many plausible explanation that perhaps should be explored before looking at the alignment/counting approach taken.

• Large amounts of rRNA in the RNA extraction
• Incomplete genome reference and DNA contamination in the RNA extraction
• Contamination from other species, I would align to plausible contaminants and see if the reads that do not map to the target species map to these.

Hello, thanks for your answer and suggestion! Even by setting a higher mismatch threshold (7 instead of the default 3) and editing the length read to an ever larger tolerance range (the latter making no difference at all), I was only able to obtain an average 64% of mapped reads rather than the original 58%... So it wasn't really worth it... Contamination is possible, since my samples are plant extracts, so there may be some yeasts... or even human DNA if not treated properly. The genome of the species is not fully sequenced yet and has many parts, due to large heterozigosity, which remained unassembled... So actually both these reasons are possible. I'm pretty sure instead there was little rRNA.

1

Without a complete reference genome this figures are not in any way extreme in my eyes. Not very good that a large fraction of reads are not useful, but maybe the reads that do map are okay to use. In my experience both 454 and solid data are today seldom worth spending many hours on.

1

I thought over it better. The genome is complete, at least in its coding parts, which are the ones used for RNA reads alignment. So, having an incomplete reference in non-coding areas shouldn't really affect the result... Instead, the great heterozigosity of the genome of the organism I'm working on may be the cause of my lower than expected results.

I'm analysing the results of the workflow now and they are coherent, so I guess I'll use my data anyway! I'm working on Illumina reads in the future, but I just had these old ones to process and trying to make sense of.

Answer: Improving the performance of the align function from Rsubread for old Roche454 r
2
12 weeks ago by
Wei Shi3.1k
Australia
Wei Shi3.1k wrote:

Your mapping percentages sound reasonable to me. You will expect typically 80 percent or higher mapping rate for Illumina reads. But the 454 reads definitely have a lower mapping rate. Other than the reasons mentioned by @thokall, higher sequencing error rate (particularly indel errors) is another factor contributing to the low mapping percentage.