First things first, I know that this is the data and if the results are not significant, it is how it is. But to make sure that I understood the workflow i would like to ask for an advice. Unfortunately, the gff files from the User manual are not to be found on ncbi, so I can't reproduce the code in it (D. melanogaster gff release 5_48).
I want to run a differential Splicing analysis on my data sets. I have data from C. elegans. This is the workflow I used:
- mapping the data using the
subjunc
aligner.
subjunc -T 10 -i Cel.WBcel235.index -r R1.fastq.gz -R R2.fastq.gz
--multiMapping -B 10 --sortReadsByCoordinates --allJunctions -a Cel.gtf
--gtfFeature 'exon' --gtfAttr 'gene_id'
- quantify the reads based on the exons
featureCounts -a $gtf -t exon -g exon_id -p -T 12 -o featureCounts.exonLevel.txt subjunc/*.bam
- upload the data into R, create a
DGEList
object with the counts and run a DE analysis usingedgeR
After creating the DGEList
I get an object like this one:
An object of class "DGEList"
$counts
L4_Input_1 L4_Input_2 L4_Input_3 L4_IP_1 L4_IP_2 L4_IP_3
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 3 0 0 0 4 2
5 0 0 0 0 0 0
181052 more rows ...
$samples
group lib.size norm.factors
L4_Input_1 1 34905462 1
L4_Input_2 1 40341042 1
L4_Input_3 1 36104053 1
L4_IP_1 1 40379665 1
L4_IP_2 1 41997525 1
L4_IP_3 1 36977558 1
$genes
Geneid Chr Start End Strand Length
1 cTel3X.2.e1 V 180 329 + 150
2 cTel3X.3.e1 V 180 329 - 150
3 B0348.5a.1.e1 V 1663 1782 + 120
4 B0348.5a.1.e2 V 2851 3019 + 169
5 B0348.5a.1.e3 V 5538 5966 + 429
181052 more rows ...
But in my objects I have the exons as geneID, so when I run the diffSpliceDGE
command I get the following error:
> sp <- diffSpliceDGE(fit, coef=2, geneid="Geneid", exonid="Start")
Total number of exons: 34861
Total number of genes: 34861
Number of genes with 1 exon: 34861
Mean number of exons in a gene: 1
Max number of exons in a gene: 1
Error in diffSpliceDGE(fit, coef = 2, geneid = "Geneid", exonid = "Start") :
No genes with more than one exon
The solution I used here was to remove the suffix .e*
(e.g. .e1, .e2, .e10, etc.) from the gene name But nowI am not sure, if I quantified the reads correctly. So, is it correct to do the quantification of the reads, with both the feature and the attribute assigned to exon?
I'm adding here a few lines of my gtf file used in the mapping and quantification for a better understanding of the annotations file:
V WormBase gene 1663 6632 . + . gene_id "WBGene00015153"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding";
V WormBase transcript 1663 6632 . + . gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
V WormBase exon 1663 1782 . + . gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "1"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "B0348.5a.1.e1";
V WormBase exon 2851 3019 . + . gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "2"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "B0348.5a.1.e2";
V WormBase CDS 2851 3019 . + 0 gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "2"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "B0348.5a.1";
V WormBase start_codon 2851 2853 . + 0 gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "2"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
Or should I take the transcript_id
as attribute instead?
thanks
Assa
Thanks James, but this is exactly my question. I would like to understand how they quantify it in such a way, that they have the genes are in the GeneID column, but separated by exons.
I know they quantify on the exons. This I did as well. But in their data the IDs are genes.
Did they just add them afterwards via the annotations? This is not mentioned anywhere.
I guess I don't understand your question. In the example, each row is the quantification for an exon for a given gene. Just like you have. The only difference is that there is an extra column in the example that just has the individual gene ID. So there is a column of gene IDs and a column of exon IDs. And as in your example, the gene IDs are part of the exon IDs, so it's a simple matter of extracting that part out into the gene ID column.
Sorry for not being clear enough. Your remark is exactly my question.
When I do the quantification on the exon-level, the primary IDs are those of the exons. I was just not sure, if that is the correct way to quantify the reads, as in their example their main ID is that of the gene name. So maybe they just manually added the gene names later on without mentioning it in the manual. This is the only way I can imagine, how the table was created.