Search
Question: Perfectly aligning sequences not mapping to worm genome with Subread-align?
0
gravatar for rogangrant
7 weeks ago by
rogangrant0
rogangrant0 wrote:

I am currently trying to optimize RNAseq analysis for stress conditions, in which splicing dysregulation (among other issues) is a major obstacle to accurate mapping. I therefore thought to try genome/split-read alignment over direct transcriptome mapping. Using the subread package with existing heat-shock data, I obtained very good mapping (~95%) for all of my control samples, but significantly lower mapping (~83%) under stress conditions. I understand that 83% mapping is not bad, but I'm interested to know why these mapping results are so consistently different from control conditions.

The interesting bit: when I manually search the unmapped sequences using BLAT, most seem to be stress-related genes (small sample size, so far). More perplexingly, they have 100% sequence identity with a single gene. My question, then, is why featureCounts is failing to map these sequences? I've thought to change the stringency settings, but I doubt that will make any difference for perfect alignments.

Background:

  • C. elegans RNAseq data, aligned to the WBcel215 (ce10) genome assembly
  • Trimmed using Trimmomatic
    • ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:2

Subread code:

subread-buildindex -o WBcel215_index *.fa

for i in *.qc.fq;
do
  subread-align -t 0 -T 4 -d 25 -i WBcel215_index -r $i -o $i.sr.bam;
done

 

Example sequences (picked randomly by hand, believe it or not):

  • GTCAGCATATGTAGTGAATGTTTTGCAGGCTTTTGTTGGAATTCGAGTG
  • GGAATTCGAGTGTTTCGATCAATCAGATTGGTCATCACACCTCCAGCTG
  • ATTGTTAACAATGATCAAAAGTTTGCCATAAATCTCAATGTCTCGCAGT
  • GTTGAAGTTGATTCACTGTTTGAAGGAATTGATCTATGCACAAAGATCA
  • CTCAACTGGTTCCAGAGTTTTCCGGAATAGATCAGCACACAGTTCTTCG
ADD COMMENTlink modified 7 weeks ago by Gordon Smyth35k • written 7 weeks ago by rogangrant0
2
gravatar for Gordon Smyth
7 weeks ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

Since you've used BLAT, you would know already that all the sequences that you list are multi-mapping, i.e., they map perfectly to more than one location in the C. Elegans genome. By default, Subread does not report multi-mapping reads, so these reads will not be included in your results.

If you want to include them, use the --multMapping and -B options. For example

subread-align --multiMapping -B 3 etc

will report up to 3 best options for each mult-mapping reads.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Gordon Smyth35k

I re-ran the alignment with STAR yesterday, and the output confirms that multi-mapping is the issue, thank you! Still left to wonder why multi-mapping is higher in data from stress conditions, but that's probably more of a biological question. Now I just need to find a way to rescue some of these multi-mapped reads, which are all from genes of interest.

ADD REPLYlink written 7 weeks ago by rogangrant0

I already showed you how to rescue the multi-mapped reads. It's all explained in the Subread documentation.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Gordon Smyth35k
2
gravatar for Wei Shi
7 weeks ago by
Wei Shi2.9k
Australia
Wei Shi2.9k wrote:

First let me point out that featureCounts is not a read aligner. It is a read counting tool. Programs included in Subread package for read alignment are subread-align and subjunc.

You may try the following to see if they can improve the mapping of your stress condition samples:

(1) You do not need to perform trimming before the alignment as subread-align does the trimming itself.

(2) You can provide an annotation to subread-align to assist the mapping. This should be particularly useful for the mapping of exon-spanning reads.  See the '-a' option for more details.

(3) You may include multi-mapping reads in your mapping output. We found that this is beneficial for the quantification of genes and exons.

(4) You may try the subjunc program - it performs full alignment of reads including exon-spanning reads and it might work better for your stress condition data.

(5) I suggest you to use the Rsubread package (http://bioconductor.org/packages/release/bioc/html/Rsubread.html). It is easier to use in most cases.

Hope this helps.

Wei

 

ADD COMMENTlink written 7 weeks ago by Wei Shi2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 441 users visited in the last hour