Question

Improving the performance of the align function from Rsubread for old Roche454 reads

0

Entering edit mode

Raito92 ▴ 60

@raito92-20399

Last seen 2.8 years ago

Italy

Hello again! I'm running a RNASeq analysis with some old, unpublished Roche454 data, using the workflow RnaSeqGeneEdgeRQL. However, reads from Roche454 aren't the best for this kind of analysis, being longer than Illumina ones and the method itself is quite outdated. So, in my opinion, it's likely the alignments between the reference indexed genome and my .fastq RNASeq data could have been underestimated.

That's the R documentation page on the align function:

https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/align

I wonder whether the options of the featureCounts function could matter too...

https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts

In your opinion, which options should I work on to improve the results of my analysis? The vignette of the workflow itself suggests:

"Ideally, the proportion of mapped reads should be above 80%"

But my samples are far from that result (the best one reaches about 74%), as you can see:

propmapped(all.bam)

enter image description here

Any suggestion is more than welcome! Thanks in advance!

align rsubread roche454 rnaseq • 1.9k views

ADD COMMENT • link updated 5.9 years ago by Wei Shi ★ 3.6k • written 6.0 years ago by Raito92 ▴ 60

1

Entering edit mode

I do not have any idea on to weather the alignment step could be improved on 454 data, but I have seen o few data set with low proportion of mapped reads and just want to highlight that there are many plausible explanation that perhaps should be explored before looking at the alignment/counting approach taken.

Large amounts of rRNA in the RNA extraction
Incomplete genome reference and DNA contamination in the RNA extraction
Contamination from other species, I would align to plausible contaminants and see if the reads that do not map to the target species map to these.

ADD REPLY • link 6.0 years ago thokall ▴ 160

0

Entering edit mode

Hello, thanks for your answer and suggestion! Even by setting a higher mismatch threshold (7 instead of the default 3) and editing the length read to an ever larger tolerance range (the latter making no difference at all), I was only able to obtain an average 64% of mapped reads rather than the original 58%... So it wasn't really worth it... Contamination is possible, since my samples are plant extracts, so there may be some yeasts... or even human DNA if not treated properly. The genome of the species is not fully sequenced yet and has many parts, due to large heterozigosity, which remained unassembled... So actually both these reasons are possible. I'm pretty sure instead there was little rRNA.

ADD REPLY • link 6.0 years ago Raito92 ▴ 60

1

Entering edit mode

Without a complete reference genome this figures are not in any way extreme in my eyes. Not very good that a large fraction of reads are not useful, but maybe the reads that do map are okay to use. In my experience both 454 and solid data are today seldom worth spending many hours on.

ADD REPLY • link 6.0 years ago thokall ▴ 160

1

Entering edit mode

I thought over it better. The genome is complete, at least in its coding parts, which are the ones used for RNA reads alignment. So, having an incomplete reference in non-coding areas shouldn't really affect the result... Instead, the great heterozigosity of the genome of the organism I'm working on may be the cause of my lower than expected results.

I'm analysing the results of the workflow now and they are coherent, so I guess I'll use my data anyway! I'm working on Illumina reads in the future, but I just had these old ones to process and trying to make sense of.

However, thanks for your answer, again!

ADD REPLY • link 5.9 years ago Raito92 ▴ 60

score 2 · Answer 1 · 2019-04-28

2

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

Your mapping percentages sound reasonable to me. You will expect typically 80 percent or higher mapping rate for Illumina reads. But the 454 reads definitely have a lower mapping rate. Other than the reasons mentioned by @thokall, higher sequencing error rate (particularly indel errors) is another factor contributing to the low mapping percentage.