Question

Perfectly aligning sequences not mapping to worm genome with Subread-align?

0

Entering edit mode

rogangrant • 0

@rogangrant-17017

Last seen 3.7 years ago

United States

I am currently trying to optimize RNAseq analysis for stress conditions, in which splicing dysregulation (among other issues) is a major obstacle to accurate mapping. I therefore thought to try genome/split-read alignment over direct transcriptome mapping. Using the subread package with existing heat-shock data, I obtained very good mapping (~95%) for all of my control samples, but significantly lower mapping (~83%) under stress conditions. I understand that 83% mapping is not bad, but I'm interested to know why these mapping results are so consistently different from control conditions.

The interesting bit: when I manually search the unmapped sequences using BLAT, most seem to be stress-related genes (small sample size, so far). More perplexingly, they have 100% sequence identity with a single gene. My question, then, is why featureCounts is failing to map these sequences? I've thought to change the stringency settings, but I doubt that will make any difference for perfect alignments.

Background:

C. elegans RNAseq data, aligned to the WBcel215 (ce10) genome assembly
Trimmed using Trimmomatic
- ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:2

Subread code:

subread-buildindex -o WBcel215_index *.fa

for i in *.qc.fq;
do
  subread-align -t 0 -T 4 -d 25 -i WBcel215_index -r $i -o $i.sr.bam;
done

Example sequences (picked randomly by hand, believe it or not):

GTCAGCATATGTAGTGAATGTTTTGCAGGCTTTTGTTGGAATTCGAGTG
GGAATTCGAGTGTTTCGATCAATCAGATTGGTCATCACACCTCCAGCTG
ATTGTTAACAATGATCAAAAGTTTGCCATAAATCTCAATGTCTCGCAGT
GTTGAAGTTGATTCACTGTTTGAAGGAATTGATCTATGCACAAAGATCA
CTCAACTGGTTCCAGAGTTTTCCGGAATAGATCAGCACACAGTTCTTCG

featurecounts rnaseq subread • 2.2k views

ADD COMMENT • link updated 7.5 years ago by Gordon Smyth 53k • written 7.5 years ago by rogangrant • 0

2

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 27 days ago

Australia/Melbourne

First let me point out that featureCounts is not a read aligner. It is a read counting tool. Programs included in Subread package for read alignment are subread-align and subjunc.

You may try the following to see if they can improve the mapping of your stress condition samples:

(1) You do not need to perform trimming before the alignment as subread-align does the trimming itself.

(2) You can provide an annotation to subread-align to assist the mapping. This should be particularly useful for the mapping of exon-spanning reads. See the '-a' option for more details.

(3) You may include multi-mapping reads in your mapping output. We found that this is beneficial for the quantification of genes and exons.

(4) You may try the subjunc program - it performs full alignment of reads including exon-spanning reads and it might work better for your stress condition data.

(5) I suggest you to use the Rsubread package (http://bioconductor.org/packages/release/bioc/html/Rsubread.html). It is easier to use in most cases.

Hope this helps.

Wei

ADD COMMENT • link 7.5 years ago Wei Shi ★ 3.6k

score 2 · Accepted Answer · 2018-08-23

2

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

Since you've used BLAT, you would know already that all the sequences that you list are multi-mapping, i.e., they map perfectly to more than one location in the C. Elegans genome. By default, Subread does not report multi-mapping reads, so these reads will not be included in your results.

If you want to include them, use the --multMapping and -B options. For example

subread-align --multiMapping -B 3 etc

will report up to 3 best options for each mult-mapping reads.

ADD COMMENT • link 7.5 years ago Gordon Smyth 53k

0

Entering edit mode

I re-ran the alignment with STAR yesterday, and the output confirms that multi-mapping is the issue, thank you! Still left to wonder why multi-mapping is higher in data from stress conditions, but that's probably more of a biological question. Now I just need to find a way to rescue some of these multi-mapped reads, which are all from genes of interest.

ADD REPLY • link 7.5 years ago rogangrant • 0

0

Entering edit mode

I already showed you how to rescue the multi-mapped reads. It's all explained in the Subread documentation.

ADD REPLY • link 7.5 years ago Gordon Smyth 53k