Question

How should I handle read counts derived from SGSeq when they were passed to limma/edgeR

0

Entering edit mode

zliu • 0

@zliu-23226

Last seen 3.5 years ago

I am planning to do alternative splicing analyses with SGSeq and limma/edgeR, which was suggested in SGSeq vignette. But I could not find any literature to do it this way. So, I am trying to figure it out how to do it, and I am using a Chr22 RNA-seq data from an experiment of our lab to demonstrate it.

I obtained the SGVariant Count object, and got the count and variant information as shown below. I created a DGEList with these data. When I observed these data, I found one EventID always corresponds to 2 or more variantIDs. And these 2 variantIDs have the same genomic ranges. However, they have very different read counts. Usually, one variant has very low counts, and the other has fair counts. My understanding is that the counts from different variants, which are according to different paths of a splice graph. Am I right?

When I used these data to do alternative splicing analyses with limma/edgeR, I could not decide how to handle these data. First, the low count variants are always filtered out. Also, to my understanding, the diffSplice() function compared an exon with other exons in the same gene to determine whether that exon is differentially expressed. However, in this count table, the exon information for a single gene is incomplete, although there are many variant information inside one gene. I tried to use "gene_id" and "EventID" showed in the anno data frame to define geneid in the diffSplice() function, and they gave different results. But when I check the results with IGV, none of them seems correct.

Another way is to treat every variant as a single gene. However, in this way, without the whole gene correction, we cannot distinguish differential variant expression from differential gene expression.

So, I wonder what is the count-read engine behind? Is it possible to do read counting of every exon of the whole genes using the same engine? This may be quite important for the alternative splicing quantification analysis.

Anyone has done this before? Any suggestions would be helpful. Thanks!

I didn't encounter any errors, so I did not include the session information here.

> ## Get count table
> counts <- counts(sgvc_pred)
> head(counts, n =30)
      shCTRL1 shCTRL2 shCTRL3 shPRPF8_1_1 shPRPF8_1_2 shPRPF8_1_3
 [1,]       0       0       0           3           1           4
 [2,]       0       0       0           0           0           1
 [3,]       0       0       0           1           4           3
 [4,]       0       0       0           0           0           1
 [5,]       0       0       0           1           0           0
 [6,]       0       0       0           1           0           0
 [7,]       2       2       4           0           8           2
 [8,]       2       3      16           8          10          11
 [9,]       0       0       0           0           1           0
[10,]       0       3       0           2           0           2
[11,]       0       3       0           1           1           5
[12,]       0       0       2           2           2           2
[13,]       0       0       0           1           0           0
[14,]       7       5       9          10          13          35
[15,]       1       0       1           2           0           3
[16,]       0       3       1           1           0           3
[17,]     104     117      99          62          91          88
[18,]      13      12       9           4           9           3
[19,]       1       0       0           0           0           0
[20,]      14       4      18           5          14           9
[21,]       0       2       0           0           0           0
[22,]     203     220     189         114         183         161
[23,]     152     163     147          82         149         127
[24,]       1       4       5           0           0           1
[25,]       0       0       0           0           0           2
[26,]      48      30      48          13          36          36
[27,]       0       0       0           1           0           0
[28,]      30      19      22           6          13          15
[29,]       1       0       0           0           0           0
[30,]      19      18      14           5          12           5
> ## Obtain geneName, txName, featureID, geneID, ect. for exon annotations.
> gene.name <- sapply(sapply(geneName(sgvc_pred), unlist), function(x) {paste(x,collapse = "+")})
> seqnames<- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 2)
> starts <- as.numeric(sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 3))
> strands <- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 4)
> ends <- as.numeric(sapply(stri_split_fixed(to(sgvc_pred),":"), "[[", 3))
> 
> anno <- data.frame(VariantID = variantID(sgvc_pred), 
+                    Chr = seqnames, 
+                    Start = starts,
+                    End = ends,
+                    Strand = strands,
+                    Length = abs(starts - ends),
+                    gene_id = geneID(sgvc_pred),
+                    gene_name = gene.name,
+                    type = variantName(sgvc_pred),
+                    EventID = eventID(sgvc_pred))
> 

> head(anno, n = 30)
   VariantID Chr    Start      End Strand Length gene_id       gene_name           type EventID
1         39  22 11910219 11910989      +    770       8                  8_13_1/2_A3SS      13
2         40  22 11910219 11910989      +    770       8                  8_13_2/2_A3SS      13
3         41  22 11910219 11911025      +    806       8                  8_14_1/2_A3SS      14
4         42  22 11910219 11911025      +    806       8                  8_14_2/2_A3SS      14
5         43  22 11910219 11913040      +   2821       8                    8_15_1/2_SE      15
6         44  22 11910219 11913040      +   2821       8                    8_15_2/2_SE      15
7         88  22 11916034 11922926      +   6892      10                 10_7_1/2_OTHER      31
8         89  22 11916034 11922926      +   6892      10                 10_7_2/2_OTHER      31
9         90  22 11917076 11918800      +   1724      11                 11_1_1/2_OTHER      32
10        91  22 11917076 11918800      +   1724      11                 11_1_2/2_OTHER      32
11        95  22 11932586 11933775      +   1189       8                 8_25_1/2_OTHER      33
12        96  22 11932586 11933775      +   1189       8                 8_25_2/2_OTHER      33
13        97  22 11937533 11938702      +   1169       8                 8_26_1/2_OTHER      34
14        98  22 11937533 11938702      +   1169       8                 8_26_2/2_OTHER      34
15        99  22 15777331 15778010      +    679      21 ENSG00000225255  21_1_1/2_A5SS      35
16       100  22 15777331 15778010      +    679      21 ENSG00000225255  21_1_2/2_A5SS      35
17       247  22 15787282 15788552      +   1270      22 ENSG00000206195   22_17_1/2_RI      52
18       253  22 15787282 15788552      +   1270      22 ENSG00000206195   22_17_2/2_RI      52
19       254  22 15787415 15787444      +     29      22 ENSG00000206195   22_20_1/2_RI      55
20       255  22 15787415 15787444      +     29      22 ENSG00000206195   22_20_2/2_RI      55
21       279  22 15788633 15788668      +     35      22 ENSG00000206195   22_28_1/2_RI      63
22       280  22 15788633 15788668      +     35      22 ENSG00000206195   22_28_2/2_RI      63
23       281  22 15788699 15788820      +    121      22 ENSG00000206195   22_29_1/2_RI      64
24       282  22 15788699 15788820      +    121      22 ENSG00000206195   22_29_2/2_RI      64
25       492  22 15816282 15816370      +     88      22                   22_61_1/2_RI      99
26       493  22 15816282 15816370      +     88      22                   22_61_2/2_RI      99
27       494  22 15816566 15816697      +    131      26                    26_1_1/2_RI     100
28       495  22 15816566 15816697      +    131      26                    26_1_2/2_RI     100
29       496  22 15816589 15816654      +     65      26                    26_2_1/2_RI     101
30       497  22 15816589 15816654      +     65      26                    26_2_2/2_RI     101

SGSeq limma edgeR • 1.1k views

ADD COMMENT • link updated 3.5 years ago by Leonard Goldstein ▴ 170 • written 3.5 years ago by zliu • 0

score 0 · Answer 1 · 2020-10-09

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 48 minutes ago

WEHI, Melbourne, Australia

I am not familiar at all with SGSeq and I don't know, for example, why it returns multiple rows for the same exon. If you use diffSplice in limma you could set geneid = gene_id and leave exonid unset. The SGSeq authors would need to chime in here if anything else is required.

ADD COMMENT • link 3.5 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks for your response, Dr. Smyth. It was a typo, I used "gene_id" and "EventID" in the anno data frame to define geneid in the diffSplice() function. I corrected it. Appreciate it!

ADD REPLY • link 3.5 years ago zliu • 0

score 0 · Answer 2 · 2020-10-13

0

Entering edit mode

Leonard Goldstein ▴ 170

@leonard-goldstein-6845

Last seen 14 months ago

Australia

The SGVariantCounts object contains read counts for splice variants. Each splice variant belongs to a splice event, which is made up of two or more variants. The vignette section on differential splicing is only brief – the idea is to test for differential variant usage within a splice event, analogous to testing for differential exon usage within a gene. For this type of analysis diffSplice arguments geneid and exonid correspond to SGSeq event_id and variant_id, respectively. For a reference that uses DEXSeq, please see Srinivasan et al. Cell Reports 2016 https://www.nature.com/articles/ncomms11295.