How to use subjunc for differential splicing analysis?
1
0
Entering edit mode
Assa Yeroslaviz ★ 1.5k
@assa-yeroslaviz-1597
Last seen 8 weeks ago
Germany

I would like to test the mapping using the subread::subjunc aligner, but I'm not sure what feature and attributes to use when adding a gtf file. Usually one summarize the data on the gene level using the gene_id and exon parameters of the gtf file.

But I would like to work on the transcripts. Would it than make more sense to summarize the reads on the transcript_id and exon instead in order to better estimate the differences in transcript levels?

Or just running the normal parameters to get the counts on a gene level and later using the diffSpliceDGE command on the normalized object (like shown in the user's guide).

thanks

subjunc edgeR • 2.2k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 55 minutes ago
WEHI, Melbourne, Australia

If you want to do test for differential gene expression, then you need gene counts.

If you want to test for differential exon usage (aka differential splicing) then you need exon counts and exon-junction counts.

If you want to test for differential transcript expression, then you need transcript counts.

Which do you want to do? The second approach, using diffSpliceDGE on exon counts and exon-junctions, is a more powerful way of detecting differential splicing than a transcript-level analysis, and also makes fewer assumptions.

ADD COMMENT
0
Entering edit mode

thanks again. so How do I get the exon counts?

This is the command I use now?

subjunc -T 10 -i $subjuncIndex/Cel.WBcel235 -r $rawdata/$base\_R1.fastq.gz -R $rawdata/$base\_R2.fastq.gz \
    --multiMapping -B 10 --sortReadsByCoordinates --allJunctions \
    -a $gtf --gtfFeature 'exon' --gtfAttr 'gene_id' \
    -o subjunc/$base.bam

Would this give me the counts per exon or per gene?

I'm working with a gtf file from C. elegans. In the third column I have the exon feature and in the 9th column the gene_id attribute (or transcript_id).

ADD REPLY
1
Entering edit mode

Annotation to the subjunc aligner helps the aligner to better map the reads. The aligner generates BAM files, which contains mapped reads (and also the reads than cannot be mapped). However the aligner itself doesn't give you the read counts to genes or exons.

You need to run featureCounts for having the read counts. FeatureCounts can assign reads on the meta-feature level (i.e. genes) or the feature level (i.e. exons). Please find the users guides and tutorials on http://subread.sourceforge.net/. There is also a case study for the entire workflow of read mapping, counting and DE analysis on http://subread.sourceforge.net/RNAseqCaseStudy.html. Although the case study uses Rsubread, which runs in the R environment, the R functions in the Rsubread package and the command-line programs in the Subread package are well mapped, including the parameters. Hence the case study can also provide useful information for running the read aligner and featureCounts in the command-line Subread package.

ADD REPLY
0
Entering edit mode

I assume, that it would make more sense to quantify the reads based on the gene-level and not the exon-level, following by using the diffSpliceDGE test for differential splicing analysis.

How does the quantification on the eon level works on overlapped genes? Are these reads being discarded or assigned randomly. I guess both of these options are not really perfect.

ADD REPLY
1
Entering edit mode

Read alignment and read counting are separate steps. subjunc only aligns and is the same regardless of whether you are planning on a gene or exon analysis. You have to use featureCounts to get gene or exon counts.

You are perhaps getting mislead by the fact that subjunc optionally uses GTF annotation, but it does so only to assist the alignment, not to count reads.

ADD REPLY
0
Entering edit mode

Thank you both Gordon Smyth and @Yang Liao.

I know there is a difference between the mapping step (subjunc) and the quantification (featureCounts).

I would like to understand how to quantify the data after the mapping. From the answers I got here, I understand that it is best to quantify on the exon-level something like that:

featureCounts -p -f -t exon -g exon_id -a genes.gtf -o counts_output.txt Bam.bam

Would these next steps make sense:

  1. counting on exon-level (using both -g and -t with exons)
  2. load the data set into edgeR and create a DEGList object 3.run the data through the diffSpliceDGE command. (Can this command also work with exon-level based counting, or should I use gene-level counts here?)

thanks again

ADD REPLY

Login before adding your answer.

Traffic: 668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6