clusterProfiler, AnnotationHub, and discrepancies in GO Terms
1
0
Entering edit mode
Filipe • 0
@ff4bfcd3
Last seen 5 weeks ago
Norway

I am currently running data analysis on QuantSeq data in Atlantic salmon.

I initially worked with part of this data in 2022, and having sequenced more samples in the meantime, I reanalyzed everything. The issue I ran into while doing Over Representation Analysis is that I do not get the same results as in 2022.

Now, in 2022, the reference genome assembly was ICSASG_v2, while in 2024, the reference is Ssal_v3.1.

I used clusterProfiler to perform ORA in R, and I used AnnotationHub to fetch Atlantic salmon's most recent annotation file at the time (AH107424) and, as an example, I obtained this plot, comparing downregulated (A) vs upregulated (B) GO Terms: dnavaccine_vs_conu_4wpc_2022

While re-analyzing the (same) data this year, the annotation file on AnnotationHub was updated (AH114250), and I get a different plot: dnavaccine_vs_conu_4wpc_2024

The point of the post is if this sort of discrepancy is normal and, if it is, how do you control for it? I was wondering if, since Ssal_v3.1 is relatively recent, if the annotation of GO Terms to the reference genome is still 'catching up'? Does that make sense?

Summarizing, 2022 plot used ICSASG_v2 and AH107424, while 2024 plot used Ssal_v3.1 and AH114250. The sequencing data input is the same for both plots.

Atlanticsalmon clusterPro Salmosalar AnnotationHub GeneOntology • 320 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

There are some differences, but it's not extensive.

## run this code on current release
> library(AnnotationHub)
> hub <- AnnotationHub()
> z <- hub[["AH114250"]]
> zz <- mapIds(z, keys(z), "GOALL", "ENTREZID", multiVals = "list")

## then did the same on an R-4.2.2 install, using AH107424, and saved as 'salmo_old.Rds'
## read those results in
> oldzz <- readRDS("salmo_old.Rds")
> all.equal(names(zz), names(oldzz))
[1] "Lengths (65492, 71043) differ (string compare on first 65492)"
[2] "16390 string mismatches"                                      
> length(zz)
[1] 65492
> length(oldzz)
[1] 71043
## for simplicity we'll just use the intersection, but do note that we have ~5500 fewer genes in the new OrgDb
> int <- intersect(names(zz), names(oldzz))
> zzz <- mapply(function(x,y) length(setdiff(x,y)), zz[int], oldzz[int])

> table(zzz)
zzz
    0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
51593  9103  1237   333   665   355   216   195   304   105    85    95    68    90   197    79   190    41    50 
   19    20    21    22    23    24    25    26    27    28    29    30    31    32    33    34    35    36    38 
   37    29    30    19    36    24    22    16     5    17     7    14     3    21     4    15     9     7    11 
   39    40    41    42    43    44    45    46    47    49    50    51    53    54    55    56    57    58    59 
    2     1     4    13    11     2    21     3     5     5     4    11    15     2     1     7     4     1     5 
   60    61    62    64    69    71    73    77    80    83    87    91    95    97    99   101   102   111   119 
    2     1     2     9     1     5     1     1     2     1     1     1     1     2     1     2     7     7     1 

## the vast majority (~60,000 genes) only differ by one or fewer GO terms
## here's the biggest difference:
> oldzz <- oldzz[int]
> zz[zzz == 119L]
$`106601220`
  [1] "GO:1904888" "GO:0048484" "GO:0050935" "GO:0030318" "GO:0022011" "GO:0050768" "GO:0001755" "GO:0021772"
  [9] "GO:0048709" "GO:0071600" "GO:0048840" "GO:0010628" "GO:0048752" "GO:0050936" "GO:0008150" "GO:0032502"
 [17] "GO:0048856" "GO:0007275" "GO:0007399" "GO:0032501" "GO:0048483" "GO:0048731" "GO:0009987" "GO:0030154"
 [25] "GO:0043473" "GO:0048066" "GO:0048869" "GO:0050931" "GO:0007272" "GO:0007422" "GO:0008366" "GO:0010001"
 [33] "GO:0014037" "GO:0014044" "GO:0021782" "GO:0022008" "GO:0032292" "GO:0042063" "GO:0042552" "GO:0048468"
 [41] "GO:0010721" "GO:0045595" "GO:0045596" "GO:0048519" "GO:0048523" "GO:0050767" "GO:0050789" "GO:0050793"
 [49] "GO:0050794" "GO:0051093" "GO:0051239" "GO:0051241" "GO:0051960" "GO:0051961" "GO:0060284" "GO:0065007"
 [57] "GO:2000026" "GO:0001667" "GO:0009888" "GO:0014032" "GO:0014033" "GO:0016477" "GO:0048513" "GO:0048762"
 [65] "GO:0048863" "GO:0048864" "GO:0048870" "GO:0060485" "GO:0090497" "GO:0007417" "GO:0007420" "GO:0021537"
 [73] "GO:0021988" "GO:0030900" "GO:0060322" "GO:0002009" "GO:0007423" "GO:0009653" "GO:0009790" "GO:0009887"
 [81] "GO:0035239" "GO:0035295" "GO:0042471" "GO:0042472" "GO:0043583" "GO:0048562" "GO:0048568" "GO:0048598"
 [89] "GO:0048729" "GO:0048839" "GO:0060429" "GO:0060562" "GO:0071599" "GO:0090596" "GO:0008152" "GO:0009058"
 [97] "GO:0009059" "GO:0009889" "GO:0009891" "GO:0009893" "GO:0010467" "GO:0010468" "GO:0010556" "GO:0010557"
[105] "GO:0010604" "GO:0019222" "GO:0031323" "GO:0031325" "GO:0031326" "GO:0031328" "GO:0043170" "GO:0044237"
[113] "GO:0044249" "GO:0048518" "GO:0048522" "GO:0060255" "GO:0071704" "GO:1901576" "GO:0060872" "GO:0005634"
[121] "GO:0005575" "GO:0005622" "GO:0043226" "GO:0043227" "GO:0043229" "GO:0043231" "GO:0110165" "GO:0003677"
[129] "GO:0003674" "GO:0003676" "GO:0005488" "GO:0097159" "GO:1901363"

> oldzz[zzz == 119L]
$`106601220`
 [1] "GO:0005634" "GO:0005575" "GO:0005622" "GO:0043226" "GO:0043227" "GO:0043229" "GO:0043231" "GO:0110165"
 [9] "GO:0003677" "GO:0003674" "GO:0003676" "GO:0005488" "GO:0097159" "GO:1901363"

So I might expect some small differences, but maybe not as many as you see. Although you will have to delve deeper yourself to see what causes the changes.

Login before adding your answer.

Traffic: 716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6