I am planning to do alternative splicing analyses with SGSeq and limma/edgeR, which was suggested in SGSeq vignette. But I could not find any literature to do it this way. So, I am trying to figure it out how to do it, and I am using a Chr22 RNA-seq data from an experiment of our lab to demonstrate it.
I obtained the SGVariant Count object, and got the count and variant information as shown below. I created a DGEList with these data. When I observed these data, I found one EventID always corresponds to 2 or more variantIDs. And these 2 variantIDs have the same genomic ranges. However, they have very different read counts. Usually, one variant has very low counts, and the other has fair counts. My understanding is that the counts from different variants, which are according to different paths of a splice graph. Am I right?
When I used these data to do alternative splicing analyses with limma/edgeR, I could not decide how to handle these data. First, the low count variants are always filtered out. Also, to my understanding, the diffSplice() function compared an exon with other exons in the same gene to determine whether that exon is differentially expressed. However, in this count table, the exon information for a single gene is incomplete, although there are many variant information inside one gene. I tried to use "gene_id" and "EventID" showed in the anno data frame to define geneid in the diffSplice() function, and they gave different results. But when I check the results with IGV, none of them seems correct.
Another way is to treat every variant as a single gene. However, in this way, without the whole gene correction, we cannot distinguish differential variant expression from differential gene expression.
So, I wonder what is the count-read engine behind? Is it possible to do read counting of every exon of the whole genes using the same engine? This may be quite important for the alternative splicing quantification analysis.
Anyone has done this before? Any suggestions would be helpful. Thanks!
I didn't encounter any errors, so I did not include the session information here.
> ## Get count table
> counts <- counts(sgvc_pred)
> head(counts, n =30)
      shCTRL1 shCTRL2 shCTRL3 shPRPF8_1_1 shPRPF8_1_2 shPRPF8_1_3
 [1,]       0       0       0           3           1           4
 [2,]       0       0       0           0           0           1
 [3,]       0       0       0           1           4           3
 [4,]       0       0       0           0           0           1
 [5,]       0       0       0           1           0           0
 [6,]       0       0       0           1           0           0
 [7,]       2       2       4           0           8           2
 [8,]       2       3      16           8          10          11
 [9,]       0       0       0           0           1           0
[10,]       0       3       0           2           0           2
[11,]       0       3       0           1           1           5
[12,]       0       0       2           2           2           2
[13,]       0       0       0           1           0           0
[14,]       7       5       9          10          13          35
[15,]       1       0       1           2           0           3
[16,]       0       3       1           1           0           3
[17,]     104     117      99          62          91          88
[18,]      13      12       9           4           9           3
[19,]       1       0       0           0           0           0
[20,]      14       4      18           5          14           9
[21,]       0       2       0           0           0           0
[22,]     203     220     189         114         183         161
[23,]     152     163     147          82         149         127
[24,]       1       4       5           0           0           1
[25,]       0       0       0           0           0           2
[26,]      48      30      48          13          36          36
[27,]       0       0       0           1           0           0
[28,]      30      19      22           6          13          15
[29,]       1       0       0           0           0           0
[30,]      19      18      14           5          12           5
> ## Obtain geneName, txName, featureID, geneID, ect. for exon annotations.
> gene.name <- sapply(sapply(geneName(sgvc_pred), unlist), function(x) {paste(x,collapse = "+")})
> seqnames<- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 2)
> starts <- as.numeric(sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 3))
> strands <- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 4)
> ends <- as.numeric(sapply(stri_split_fixed(to(sgvc_pred),":"), "[[", 3))
> 
> anno <- data.frame(VariantID = variantID(sgvc_pred), 
+                    Chr = seqnames, 
+                    Start = starts,
+                    End = ends,
+                    Strand = strands,
+                    Length = abs(starts - ends),
+                    gene_id = geneID(sgvc_pred),
+                    gene_name = gene.name,
+                    type = variantName(sgvc_pred),
+                    EventID = eventID(sgvc_pred))
> 
> head(anno, n = 30)
   VariantID Chr    Start      End Strand Length gene_id       gene_name           type EventID
1         39  22 11910219 11910989      +    770       8                  8_13_1/2_A3SS      13
2         40  22 11910219 11910989      +    770       8                  8_13_2/2_A3SS      13
3         41  22 11910219 11911025      +    806       8                  8_14_1/2_A3SS      14
4         42  22 11910219 11911025      +    806       8                  8_14_2/2_A3SS      14
5         43  22 11910219 11913040      +   2821       8                    8_15_1/2_SE      15
6         44  22 11910219 11913040      +   2821       8                    8_15_2/2_SE      15
7         88  22 11916034 11922926      +   6892      10                 10_7_1/2_OTHER      31
8         89  22 11916034 11922926      +   6892      10                 10_7_2/2_OTHER      31
9         90  22 11917076 11918800      +   1724      11                 11_1_1/2_OTHER      32
10        91  22 11917076 11918800      +   1724      11                 11_1_2/2_OTHER      32
11        95  22 11932586 11933775      +   1189       8                 8_25_1/2_OTHER      33
12        96  22 11932586 11933775      +   1189       8                 8_25_2/2_OTHER      33
13        97  22 11937533 11938702      +   1169       8                 8_26_1/2_OTHER      34
14        98  22 11937533 11938702      +   1169       8                 8_26_2/2_OTHER      34
15        99  22 15777331 15778010      +    679      21 ENSG00000225255  21_1_1/2_A5SS      35
16       100  22 15777331 15778010      +    679      21 ENSG00000225255  21_1_2/2_A5SS      35
17       247  22 15787282 15788552      +   1270      22 ENSG00000206195   22_17_1/2_RI      52
18       253  22 15787282 15788552      +   1270      22 ENSG00000206195   22_17_2/2_RI      52
19       254  22 15787415 15787444      +     29      22 ENSG00000206195   22_20_1/2_RI      55
20       255  22 15787415 15787444      +     29      22 ENSG00000206195   22_20_2/2_RI      55
21       279  22 15788633 15788668      +     35      22 ENSG00000206195   22_28_1/2_RI      63
22       280  22 15788633 15788668      +     35      22 ENSG00000206195   22_28_2/2_RI      63
23       281  22 15788699 15788820      +    121      22 ENSG00000206195   22_29_1/2_RI      64
24       282  22 15788699 15788820      +    121      22 ENSG00000206195   22_29_2/2_RI      64
25       492  22 15816282 15816370      +     88      22                   22_61_1/2_RI      99
26       493  22 15816282 15816370      +     88      22                   22_61_2/2_RI      99
27       494  22 15816566 15816697      +    131      26                    26_1_1/2_RI     100
28       495  22 15816566 15816697      +    131      26                    26_1_2/2_RI     100
29       496  22 15816589 15816654      +     65      26                    26_2_1/2_RI     101
30       497  22 15816589 15816654      +     65      26                    26_2_2/2_RI     101

Thanks for your answer, Dr. Goldstein!
Dear Dr. Goldstein,
I was checking the paper that you mentioned. From the DEXSeq normally we have Padj values and log2foldchanges. In this paper, they used the final result from DEXSeq and in addition, they showed delta PSI for each event as they mentioned: "The variant frequencies are what is called PSI ("percent spliced-in") by some authors. We also provide the statistic deltaVF, which means "delta variant frequency": the difference in variant frequencies between the two groups (called delta-PSI by some authors). This statistic may be more easily interpreted than the fold-change statistic."
But I am not sure how to calculate psi per group and get the delta psi; then use it in the final result from DEXSeq for each event and variant.
here after getting analyzeVariants
I've got the Variantfrequency:
then I set the psi from the variant frequency as below:
then I set the variantNames as the row names for psi:
but then here is my problem, I have a single psi value for every event and each sample. But how to calculate the psi for controls and cases; to have one psi value for the control group, and one for the case group. consider:
controls are: N1, N2,N3,N4 cases are: T1,T2,T3,T4
Now, I have psi values for each event per sample, how to calculate to have one psi value per group (control and patient).
For this variant 79791_1_1/2_SE, my goal is to present something like below:
should I get the mean of individual psi values for N1, N2,N3,N4 samples and say the mean is the psi value for the control group? I am not sure if it is the right way.
we have the psi values for each event and each sample but at the end, the goal is to have one psi value for each variant per group, how do we calculate it?
Thank you in advance, Looking forward to hearing from you!