Question

Exon probesets as "core" probsets in pd.mogene.2.0.st

0

Entering edit mode

captainfats • 0

@captainfats-7180

Last seen 11.1 years ago

United States

Using Oligo/limma for analysis of the Mouse Gene 2.0 ST array, I noticed a few un annotated results that lack any gene annotation, upon inspecting some of these IDs on Netaffx they turned out to be the "exon" probesetids not the transcript_ids. I noticed that for some genes both the Transcript_cluster_id and the individual probeset_ids (exon?) were present in my output.

ID	symbol	name	Ensembl	adj.P	logFC
17219448	Copa	coatomer...	ENSMUSG00000026553	0.2239127307	0.1570203949
17219450	NA	NA	NA	0.3382192058	0.148945971
17219451	NA	NA	NA	0.2129264336	0.2089252306
17219452	NA	NA	NA	0.3294590243	0.1762376171
17219453	NA	NA	NA	0.0730023042	0.2049750825
17219454	NA	NA	NA	0.4965835277	0.0907620885

Looking into pd.mogene.2.0.st it does appear that this happens a times, where the transcript_cluster_id and the individual exon id are both in the "core" set. Is this an affymetrix annotation issue/feature and does anyone have a suggestion on what to do.

con <- db(pd.mogene.2.0.st)
head(dbGetQuery(con, "select * from pmfeature inner join 
+ core_mps on pmfeature.fsetid=core_mps.fsetid where 
+ core_mps.meta_fsetid='17219448';"))
      fid   fsetid  atom    x    y meta_fsetid transcript_cluster_id   fsetid
1  216932 17219449 16055  923  134    17219448              17219448 17219449
2 1015902 17219450 16056  341  630    17219448              17219448 17219450
3 2307453 17219451 16057  680 1431    17219448              17219448 17219451
4  164045 17219452 16058 1232  101    17219448              17219448 17219452
5 1692866 17219453 16059  265 1050    17219448              17219448 17219453
6  811295 17219454 16060  458  503    17219448              17219448 17219454

For each fsetid of meta_fesetid 17219448 each of the fsetid (which I presume is an exon probeset) also appears as a core_mps

dbGetQuery(con, "select * from pmfeature inner join 
+ core_mps on pmfeature.fsetid=core_mps.fsetid where 
+ core_mps.meta_fsetid='17219449';")
     fid   fsetid  atom   x   y meta_fsetid transcript_cluster_id   fsetid
1 216932 17219449 16055 923 134    17219449              17219449 17219449

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pd.mogene.2.0.st_2.14.1 oligo_1.30.0            Biobase_2.26.0          oligoClasses_1.28.0     RSQLite_1.0.0          
 [6] DBI_0.3.1               Biostrings_2.34.1       XVector_0.6.0           IRanges_2.0.1           S4Vectors_0.4.0        
[11] BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] affxparser_1.38.0     affyio_1.34.0         BiocInstaller_1.16.1  bit_1.1-12            codetools_0.2-9       ff_2.2-13            
 [7] foreach_1.4.2         GenomeInfoDb_1.2.3    GenomicRanges_1.18.3  iterators_1.0.7       preprocessCore_1.28.0 splines_3.1.2        
[13] tools_3.1.2           zlibbioc_1.12.0

oligo pd.mogene.2.0.st • 1.5k views

ADD COMMENT • link updated 11.1 years ago by James W. MacDonald 68k • written 11.1 years ago by captainfats • 0

score 1 · Answer 1 · 2014-12-16

This is one of the fun aspects of the newer generation of Affymetrix arrays; they tend to reuse the exon level probesets as controls when you summarize at the transcript level.

There does however appear to be a problem with the pd.mogene.2.0.st package:

> dbGetQuery(con, "select fsetid, type, meta_fsetid from featureSet inner join core_mps using(fsetid) where meta_fsetid in ('17219448','17219450','17219451','17219452');")
     fsetid type meta_fsetid
1  17219449    1    17219448
2  17219450    1    17219448
3  17219451    1    17219448
4  17219452    1    17219448
5  17219453    1    17219448
6  17219454    1    17219448
7  17219455    1    17219448
8  17219456    1    17219448
9  17219457    1    17219448
10 17219459    1    17219448
11 17219460    1    17219448
12 17219461    1    17219448
13 17219462    1    17219448
14 17219463    1    17219448
15 17219464    1    17219448
16 17219465    1    17219448
17 17219466    1    17219448
18 17219469    1    17219448
19 17219470    1    17219448
20 17219471    1    17219448
21 17219472    1    17219448
22 17219473    1    17219448
23 17219474    1    17219448
24 17219475    1    17219448
25 17219476    1    17219448
26 17219477    1    17219448
27 17219478    1    17219448
28 17219479    1    17219448
29 17219480    1    17219448
30 17219481    1    17219448
31 17219482    1    17219448
32 17219483    1    17219448
33 17219484    1    17219448
34 17219486    1    17219448
35 17219487    1    17219448
36 17219488    1    17219448
37 17219450    1    17219450
38 17219451    1    17219451
39 17219452    1    17219452

But the last three lines there should have a 'type' of 6, rather than 1:

> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2              control->affx
3     3              control->chip
4     4  control->bgp->antigenomic
5     5      control->bgp->genomic
6     6             normgene->exon
7     7           normgene->intron
8     8   rescue->FLmRNA->unmapped
9     9   control->affx->bac_spike
10   10             oligo_spike_in
11   11            r1_bac_spike_at
12   12 control->affx->polya_spike
13   13        control->affx->ercc
14   14  control->affx->ercc->step

If we look at the netaffxTranscript.rda data that come with this package:

> load(paste0(path.package("pd.mogene.2.0.st"), "/extdata/netaffxTranscript.rda"))
> pd <- pData(netaffxTranscript)
> pd[pd$transcriptclusterid %in% c("17219448","17219450","17219451","17219452"),c(1:2,18)]
         transcriptclusterid probesetid       category
17219448            17219448   17219448           main
17219450            17219450   17219450 normgene->exon
17219451            17219451   17219451 normgene->exon
17219452            17219452   17219452 normgene->exon

Ideally you would be able to excise these control probesets so you don't have this stuff popping up in your top hits, but evidently the normgene->exon controls are slipping through:

> table(dbGetQuery(con, "select type from featureSet;")[,1])

     1      2      4      7      9     12
263551     18     23   5331     18     39

> table(pd[,18])

             control->affx   control->affx->bac_spike
                        18                         18
 control->affx->ercc-spike control->affx->polya_spike
                        92                         39
 control->bgp->antigenomic                       main
                        23                      33793
            normgene->exon           normgene->intron
                      1352                       5331
                  reporter                     rescue
                        82                        558
          rescue->flmrna->                       rrna
                        14                         25

There are evidently lots of control probes that are sneaking through the cracks, so pdInfoBuilder needs some updating to handle these arrays.