Question

Differential Splicing using edgeR

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 6 days ago

Germany

First things first, I know that this is the data and if the results are not significant, it is how it is. But to make sure that I understood the workflow i would like to ask for an advice. Unfortunately, the gff files from the User manual are not to be found on ncbi, so I can't reproduce the code in it (D. melanogaster gff release 5_48).

I want to run a differential Splicing analysis on my data sets. I have data from C. elegans. This is the workflow I used:

mapping the data using the subjunc aligner.

subjunc -T 10 -i Cel.WBcel235.index -r R1.fastq.gz -R R2.fastq.gz 
                    --multiMapping -B 10 --sortReadsByCoordinates --allJunctions -a Cel.gtf 
                    --gtfFeature 'exon' --gtfAttr 'gene_id'

quantify the reads based on the exons

featureCounts -a $gtf -t exon -g exon_id -p -T 12 -o featureCounts.exonLevel.txt subjunc/*.bam

upload the data into R, create a DGEList object with the counts and run a DE analysis using edgeR

After creating the DGEList I get an object like this one:

An object of class "DGEList"
$counts
  L4_Input_1 L4_Input_2 L4_Input_3 L4_IP_1 L4_IP_2 L4_IP_3
1          0          0          0       0       0       0
2          0          0          0       0       0       0
3          0          0          0       0       0       0
4          3          0          0       0       4       2
5          0          0          0       0       0       0
181052 more rows ...

$samples
           group lib.size norm.factors
L4_Input_1     1 34905462            1
L4_Input_2     1 40341042            1
L4_Input_3     1 36104053            1
L4_IP_1        1 40379665            1
L4_IP_2        1 41997525            1
L4_IP_3        1 36977558            1

$genes
         Geneid Chr Start  End Strand Length
1   cTel3X.2.e1   V   180  329      +    150
2   cTel3X.3.e1   V   180  329      -    150
3 B0348.5a.1.e1   V  1663 1782      +    120
4 B0348.5a.1.e2   V  2851 3019      +    169
5 B0348.5a.1.e3   V  5538 5966      +    429
181052 more rows ...

But in my objects I have the exons as geneID, so when I run the diffSpliceDGE command I get the following error:

> sp <- diffSpliceDGE(fit, coef=2, geneid="Geneid", exonid="Start")
Total number of exons:  34861 
Total number of genes:  34861 
Number of genes with 1 exon:  34861 
Mean number of exons in a gene:  1 
Max number of exons in a gene:  1 
Error in diffSpliceDGE(fit, coef = 2, geneid = "Geneid", exonid = "Start") : 
  No genes with more than one exon

The solution I used here was to remove the suffix .e* (e.g. .e1, .e2, .e10, etc.) from the gene name But nowI am not sure, if I quantified the reads correctly. So, is it correct to do the quantification of the reads, with both the feature and the attribute assigned to exon?

I'm adding here a few lines of my gtf file used in the mapping and quantification for a better understanding of the annotations file:

V   WormBase    gene    1663    6632    .   +   .   gene_id "WBGene00015153"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding";
V   WormBase    transcript  1663    6632    .   +   .   gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
V   WormBase    exon    1663    1782    .   +   .   gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "1"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "B0348.5a.1.e1";
V   WormBase    exon    2851    3019    .   +   .   gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "2"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "B0348.5a.1.e2";
V   WormBase    CDS 2851    3019    .   +   0   gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "2"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "B0348.5a.1";
V   WormBase    start_codon 2851    2853    .   +   0   gene_id "WBGene00015153"; transcript_id "B0348.5a.1"; exon_number "2"; gene_name "B0348.5"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";

Or should I take the transcript_id as attribute instead?

thanks

Assa

diffSpliceDGE edgeR DifferentialSplicing • 2.1k views

ADD COMMENT • link 3.4 years ago Assa Yeroslaviz ★ 1.5k

score 0 · Answer 1 · 2022-01-26

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

I believe the help page for diffSpliceDGE accurately represents what you should be using as input.

 ## after running 
> example("diffSpliceDGE")
> y
An object of class "DGEList"
$counts
  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
1     127      82     102       2      19       7
2      87      78      70      86     105     134
3      60      89      90      85      86     108
4      56      87     149     123     114      99
5      79      93      71     104      97     108
995 more rows ...

$samples
        group lib.size norm.factors
Sample1     1    1e+06            1
Sample2     1    1e+06            1
Sample3     1    1e+06            1
Sample4     1    1e+06            1
Sample5     1    1e+06            1
Sample6     1    1e+06            1

$genes
  GeneID Gene.Exon
1  Gene1 Gene1.Ex1
2  Gene1 Gene1.Ex2
3  Gene1 Gene1.Ex3
4  Gene1 Gene1.Ex4
5  Gene1 Gene1.Ex5
995 more rows ...

>

Which is not what you have in the 'genes' list item of your DGEList.

ADD COMMENT • link 3.4 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks James, but this is exactly my question. I would like to understand how they quantify it in such a way, that they have the genes are in the GeneID column, but separated by exons.

I know they quantify on the exons. This I did as well. But in their data the IDs are genes.

Did they just add them afterwards via the annotations? This is not mentioned anywhere.

ADD REPLY • link 3.4 years ago Assa Yeroslaviz ★ 1.5k

1

Entering edit mode

I guess I don't understand your question. In the example, each row is the quantification for an exon for a given gene. Just like you have. The only difference is that there is an extra column in the example that just has the individual gene ID. So there is a column of gene IDs and a column of exon IDs. And as in your example, the gene IDs are part of the exon IDs, so it's a simple matter of extracting that part out into the gene ID column.

ADD REPLY • link 3.4 years ago James W. MacDonald 68k

0

Entering edit mode

Sorry for not being clear enough. Your remark is exactly my question.

When I do the quantification on the exon-level, the primary IDs are those of the exons. I was just not sure, if that is the correct way to quantify the reads, as in their example their main ID is that of the gene name. So maybe they just manually added the gene names later on without mentioning it in the manual. This is the only way I can imagine, how the table was created.

ADD REPLY • link 3.4 years ago Assa Yeroslaviz ★ 1.5k