Question: Improving the performance of the align function from Rsubread for old Roche454 reads
0
gravatar for Raito92
6 months ago by
Raito9240
Italy
Raito9240 wrote:

Hello again! I'm running a RNASeq analysis with some old, unpublished Roche454 data, using the workflow RnaSeqGeneEdgeRQL. However, reads from Roche454 aren't the best for this kind of analysis, being longer than Illumina ones and the method itself is quite outdated. So, in my opinion, it's likely the alignments between the reference indexed genome and my .fastq RNASeq data could have been underestimated.

That's the R documentation page on the align function:

https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/align

I wonder whether the options of the featureCounts function could matter too...

https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts

In your opinion, which options should I work on to improve the results of my analysis? The vignette of the workflow itself suggests:

"Ideally, the proportion of mapped reads should be above 80%"

But my samples are far from that result (the best one reaches about 74%), as you can see:

propmapped(all.bam)

enter image description here

Any suggestion is more than welcome! Thanks in advance!

rnaseq rsubread align roche454 • 227 views
ADD COMMENTlink modified 5 months ago by Wei Shi3.2k • written 6 months ago by Raito9240
1

I do not have any idea on to weather the alignment step could be improved on 454 data, but I have seen o few data set with low proportion of mapped reads and just want to highlight that there are many plausible explanation that perhaps should be explored before looking at the alignment/counting approach taken.

  • Large amounts of rRNA in the RNA extraction
  • Incomplete genome reference and DNA contamination in the RNA extraction
  • Contamination from other species, I would align to plausible contaminants and see if the reads that do not map to the target species map to these.
ADD REPLYlink written 6 months ago by thokall160

Hello, thanks for your answer and suggestion! Even by setting a higher mismatch threshold (7 instead of the default 3) and editing the length read to an ever larger tolerance range (the latter making no difference at all), I was only able to obtain an average 64% of mapped reads rather than the original 58%... So it wasn't really worth it... Contamination is possible, since my samples are plant extracts, so there may be some yeasts... or even human DNA if not treated properly. The genome of the species is not fully sequenced yet and has many parts, due to large heterozigosity, which remained unassembled... So actually both these reasons are possible. I'm pretty sure instead there was little rRNA.

ADD REPLYlink modified 6 months ago • written 6 months ago by Raito9240
1

Without a complete reference genome this figures are not in any way extreme in my eyes. Not very good that a large fraction of reads are not useful, but maybe the reads that do map are okay to use. In my experience both 454 and solid data are today seldom worth spending many hours on.

ADD REPLYlink written 6 months ago by thokall160
1

I thought over it better. The genome is complete, at least in its coding parts, which are the ones used for RNA reads alignment. So, having an incomplete reference in non-coding areas shouldn't really affect the result... Instead, the great heterozigosity of the genome of the organism I'm working on may be the cause of my lower than expected results.

I'm analysing the results of the workflow now and they are coherent, so I guess I'll use my data anyway! I'm working on Illumina reads in the future, but I just had these old ones to process and trying to make sense of.

However, thanks for your answer, again!

ADD REPLYlink written 5 months ago by Raito9240
Answer: Improving the performance of the align function from Rsubread for old Roche454 r
2
gravatar for Wei Shi
5 months ago by
Wei Shi3.2k
Australia
Wei Shi3.2k wrote:

Your mapping percentages sound reasonable to me. You will expect typically 80 percent or higher mapping rate for Illumina reads. But the 454 reads definitely have a lower mapping rate. Other than the reasons mentioned by @thokall, higher sequencing error rate (particularly indel errors) is another factor contributing to the low mapping percentage.

ADD COMMENTlink written 5 months ago by Wei Shi3.2k

Thanks for your answer!

ADD REPLYlink written 5 months ago by Raito9240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 247 users visited in the last hour