Question

Annotation for Affymetrix GeneChip PrimeView.

1

Entering edit mode

ammasakshay • 0

@ammasakshay-23464

Last seen 3.3 years ago

Affymetrix GeneChip PrimeView.

Which Annotation Package do I use? Their main website allows you to download their annotation as a .csv, but the getSYMBOL function requires a package. Is there any way I can annotate from a CSV or a package I can use?

annotation • 4.1k views

ADD COMMENT • link updated 4.0 years ago by Kevin Blighe ★ 3.9k • written 4.0 years ago by ammasakshay • 0

score 4 · Accepted Answer · 2020-05-02

4

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 15 days ago

Republic of Ireland

Hey,

I am pretty sure that the PrimeView does not have an official annotation package in Bioconductor - it is certainly not the most commonly used of the Affymetrix chips.

You could retrieve the current NetAffx annotation CSV from Affymetrix's web-site ( PrimeView™ Human Gene Expression Array Plate - Support Materials ), though, and build your own package:

1, create the SQlite database and build the new package

require(AnnotationForge)
makeDBPackage('HUMANCHIP_DB',
  affy = TRUE,
  prefix = 'primeview',
  fileName = 'PrimeView.na36.annot.csv', 
  baseMapType = 'eg',
  outputDir = '.',
  author = 'Bioconductor',
  version = '0.99.1',
  manufacturer = 'Affymetrix',
  manufacturerUrl = 'http://www.affymetrix.com')

2, install and load it

install.packages('primeview.db', repos = NULL, type = 'source')
require(primeview.db)

3, perform various lookup operations

probes <- c('11715262_at', '1715107_s_at',
  '11715112_at', '11715113_x_at', '11715115_s_at',
  '11715116_s_at', '11715247_s_at', '11715248_s_at',
  '11715258_s_at', '11715269_x_at', '11715264_s_at')

keytypes(primeview.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
[21] "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIGENE"      "UNIPROT"

mapIds(primeview.db, keys = probes,
  column = c('ENSEMBL'), keytype = 'PROBEID')
'select()' returned 1:1 mapping between keys and columns
      11715262_at      1715107_s_at       11715112_at     11715113_x_at 
"ENSG00000168412"                NA "ENSG00000178965" "ENSG00000158483" 
    11715115_s_at     11715116_s_at     11715247_s_at     11715248_s_at 
"ENSG00000278588" "ENSG00000276966" "ENSG00000100150" "ENSG00000124490" 
    11715258_s_at     11715269_x_at     11715264_s_at 
"ENSG00000159335" "ENSG00000189366" "ENSG00000181408" 


mapIds(primeview.db, keys = probes,
  column = c('SYMBOL'), keytype = 'PROBEID')
'select()' returned 1:1 mapping between keys and columns
  11715262_at  1715107_s_at   11715112_at 11715113_x_at 11715115_s_at 
     "MTNR1A"            NA      "ERICH3"     "FAM86C1"      "H2BC10" 
11715116_s_at 11715247_s_at 11715248_s_at 11715258_s_at 11715269_x_at 
       "H4C5"      "DEPDC5"      "CRISP2"        "PTMS"       "ALG1L" 
11715264_s_at 
      "UTS2R" 

head(select(primeview.db, keys = probes,
  columns = c('PROBEID', 'SYMBOL', 'GENENAME', 'ENSEMBL', 'GO')))
      PROBEID SYMBOL              GENENAME         ENSEMBL         GO EVIDENCE
1 11715262_at MTNR1A melatonin receptor 1A ENSG00000168412 GO:0004930      IBA
2 11715262_at MTNR1A melatonin receptor 1A ENSG00000168412 GO:0005515      IPI
3 11715262_at MTNR1A melatonin receptor 1A ENSG00000168412 GO:0005886      IBA
4 11715262_at MTNR1A melatonin receptor 1A ENSG00000168412 GO:0005886      TAS
5 11715262_at MTNR1A melatonin receptor 1A ENSG00000168412 GO:0005887      IDA
6 11715262_at MTNR1A melatonin receptor 1A ENSG00000168412 GO:0007186      TAS
  ONTOLOGY
1       MF
2       MF
3       CC
4       CC
5       CC
6       BP


annotate::getSYMBOL(probes, 'primeview.db')
  11715262_at  1715107_s_at   11715112_at 11715113_x_at 11715115_s_at 
     "MTNR1A"            NA      "ERICH3"     "FAM86C1"      "H2BC10" 
11715116_s_at 11715247_s_at 11715248_s_at 11715258_s_at 11715269_x_at 
       "H4C5"      "DEPDC5"      "CRISP2"        "PTMS"       "ALG1L" 
11715264_s_at 
      "UTS2R"

There's a lot of useful information here: AnnotationDbi: Introduction To Bioconductor Annotation Packages

Kevin

ADD COMMENT • link 4.0 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

This fixed it. Thank you so much.

ADD REPLY • link 4.0 years ago ammasakshay • 0

0

Entering edit mode

Some questions. Do the author, version number, manufacturer and manufacture ID have any relevance to the code? By that I mean when creating the SQL if I put something else there would it change it?

Second, what are the lookup operators for? (Or from what I understand is it you are trying the annotation with a small subset of the genes and seeing if it works [basically testing purposes])

ADD REPLY • link 4.0 years ago ammasakshay • 0

0

Entering edit mode

Final question. What exactly is the schema. For example, both HumanDB and Humanchip db are listed as available schema. How do you select it?

ADD REPLY • link 4.0 years ago ammasakshay • 0

0

Entering edit mode

As the package is just created locally on your computer, I do not think that the values for author, version number, etc. are too important. If anything they are just for your own benefit so that you can keep track of (and version control) your annotation databases. You also need to be wary about the package name, because you would not want it to clash with a pre-existing official Bioconductor or CRAN package.

By showing the different lookup operations, I was merely providing some examples for you in order to see how you can use the information contained in the database. Hope that they helped!

Regarding the schema, I am pretty sure that it just pre-defines a template of, e.g., expected columns in your input data / file, and, ultimately, the columns that will be stored in the SQlite databsae that's created. There are 3 types of DBs that can be created:

ChipDb
OrgDb
TxDb

Perhaps reading the first few sections here will help:

https://bioconductor.org/packages/devel/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf

ADD REPLY • link 4.0 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Hi Kevin, I run your code to create preview.db package:

library(AnnotationForge)
makeDBPackage('HUMANCHIP_DB',
              affy = TRUE,
              prefix = 'primeview',
              fileName = 'D:/Tania_2/PhD/microarray_platform_anno/PrimeView-na36-annot-csv/PrimeView.na36.annot.csv', 
              baseMapType = 'eg',
              outputDir = 'D:/Tania_2/PhD/microarray_platform_anno/primeview.db',
              author = 'Bioconductor',
              version = '0.99.1',
              manufacturer = 'Affymetrix',
              manufacturerUrl = 'http://www.affymetrix.com')

# install and load:
install.packages('primeview.db', repos = NULL, type = 'source')
library(primeview.db)

but when I try to do this:

rawData <- read.celfiles(pathCELfiles)

gives the error:

Package 'pd.primeview' was not found in the BioConductor repository.
The 'pdInfoBuilder' package can often be used in situations like this.
Error in read.celfiles(pathCELfiles) : 
  The annotation package, pd.primeview, could not be loaded.

Do you know what am I doing wrong?

ADD REPLY • link 3.7 years ago salamandra ▴ 20

0

Entering edit mode

Hello again. Can you show how you ran the first section of code (marked '1')? In this example, 'primeview.db' should be a directory in your current working directory.

With this code, we are essentially creating our own local copy of a new package called 'primeview.db'

ADD REPLY • link 3.7 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Now I did:

library(AnnotationForge)
makeDBPackage('HUMANCHIP_DB',
              affy = TRUE,
              prefix = 'primeview',
              fileName = 'D:/Tania_2/PhD/microarray_platform_anno/PrimeView-na36-annot-csv/PrimeView.na36.annot.csv', 
              baseMapType = 'eg',
              outputDir = '.',
              author = 'Bioconductor',
              version = '0.99.1',
              manufacturer = 'Affymetrix',
              manufacturerUrl = 'http://www.affymetrix.com')
# install and load:
install.packages('primeview.db', repos = NULL, type = 'source')
library(primeview.db)

It created a package automatically in directory where other anno packages are. I think the package is installed correctly, because library(primeview.db) doesn't throw an error. But when I try to read cell files gives the error:

Package 'pd.primeview' was not found in the BioConductor repository.
The 'pdInfoBuilder' package can often be used in situations like this.
Error in read.celfiles(pathCELfiles) : 
  The annotation package, pd.primeview, could not be loaded.

I though it could be because the name of the package 'primeview.db' is different what expected 'pd.primeview' so I set package name when reading cell files:

rawData <- read.celfiles(pathCELfiles, pkgname = 'primeview.db')

but gave another error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘kind’ for signature ‘"ChipDb"’

ADD REPLY • link 3.7 years ago salamandra ▴ 20

2

Entering edit mode

You are mixing up 2 things: The primeview.db package you generated is indeed an annotation package. However, oligo requires a PlatformDesign (PdInfo) package for this array (to map the probes to probesets), called pd.primeview. In principle you should be able to generate such PdInfo package using the library pdInfoBuilder. This is what is reported in the error message (2nd block of code in your post).

Complicating factor may be that because of multi-mapped probes this may not be possible... At least, that was the status a couple of years ago as you can read in this thread here. Pay special attention to James MacDonald's answer/post here: https://support.bioconductor.org/p/53302/#53318 (this is the last one in the thread).

ADD REPLY • link 3.7 years ago Guido Hooiveld ★ 3.9k

1

Entering edit mode

The issue in the thread you link to had to do with the makecdfenv package, which in general doesn't like multi-mapping probes. IIRC, Ben Bolstad added some code to allow that, but the original Affy arrays never had such things, so they ended up getting ignored.

The pdInfoBuilder package has no such constraints though, as from the Exon series onward Affy has made use of many multi-mapping probes (the miRNA arrays being a hilarious example, given that an miRNA is shorter than the usual 25-mer, and Affy stuck with the multiple probes per probeset idea).

So yeah, the OP just needs to generate a pd.primeview package using pdInfoBuilder.

ADD REPLY • link 3.7 years ago James W. MacDonald 65k

0

Entering edit mode

where can we find the primeview .cdf file to use pdInfoBuilder. do you know?

ADD REPLY • link 3.7 years ago salamandra ▴ 20

2

Entering edit mode

I'd use the AffyCompatible package. It's a pain to find things on Fisher's website. You will need to have a username/password for Affy though. If you don't have one, go to netaffx.com and try to get something - It will send you to a login page.

> rsrc <- NetAffxResource("jmacdon@med.umich.edu", pwd)
> grep("primeview", names(rsrc), ignore.case = TRUE, value = TRUE)
[1] "Human PrimeView"
> cdf <- readAnnotation(rsrc, annotation = rsrc[["Human PrimeView","CDF Library File"]])
trying URL 'http://media.affymetrix.com/analysis/downloads/lf/ivt/PrimeView/PrimeView.cdf.zip'
Content type 'application/zip' length 3341427 bytes (3.2 MB)
downloaded 3.2 MB

returning path to file of affxType 'CDF'
> probetab <- readAnnotation(rsrc, annotation = rsrc[["Human PrimeView","Probe Sequences, tabular format"]], content = FALSE)
## Note the extra 'content' argument - I don't want to read that file into R.
## unzip both
> unzip(cdf)
> unzip(probetab)
## You need a CEL file as well - I don't have one, so got a random one from GEO
> getGEOSuppFiles("GSM2157164")
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2157nnn/GSM2157164/suppl//GSM2157164_IN_6391_B_II_PrimeView_.CEL.gz?tool=geoquery'
Content type 'application/x-gzip' length 2331172 bytes (2.2 MB)
downloaded 2.2 MB

## can't be gzipped either
> gunzip("C:/Users/jmacdon/Desktop/GSM2157164/GSM2157164_IN_6391_B_II_PrimeView_.CEL.gz")

> seed <- new("AffyExpressionPDInfoPkgSeed", cdfFile = "PrimeView.cdf", celFile = "C:/Users/jmacdon/Desktop/GSM2157164/GSM2157164_IN_6391_B_II_PrimeView_.CEL", tabSeqFile ="PrimeView.probe_tab" , author = "Me", email = "me@mine.org", organism = "Human", species = "Homo sapiens")
> makePdInfoPackage(seed)
================================================================================
Building annotation package for Affymetrix Expression array
CDF...............:  PrimeView.cdf 
CEL...............:  GSM2157164_IN_6391_B_II_PrimeView_.CEL 
Sequence TAB-Delim:  PrimeView.probe_tab 
================================================================================
Parsing file: PrimeView.cdf... OK
Parsing file: GSM2157164_IN_6391_B_II_PrimeView_.CEL... OK
Parsing file: PrimeView.probe_tab... OK
Getting information for featureSet table... OK
Getting information for pm/mm feature tables... OK
Combining probe information with sequence information... OK
Getting PM probes and sequences... OK
Done parsing.
Creating package in ./pd.primeview 
Inserting 49395 rows into table featureSet... OK
Inserting 609663 rows into table pmfeature... OK
Inserting 180 rows into table mmfeature... OK
Counting rows in featureSet
Counting rows in mmfeature
Counting rows in pmfeature
Creating index idx_pmfsetid on pmfeature... OK
Creating index idx_pmfid on pmfeature... Error: UNIQUE constraint failed: pmfeature.fid
In addition: Warning messages:
1: In parseCdfCelProbe(object@cdfFile, object@celFile, object@tabSeqFile,  :
  Probe sequences were not found for all PM probes. These probes will be removed from the pmSequence object.
2: In parseCdfCelProbe(object@cdfFile, object@celFile, object@tabSeqFile,  :
  Probe sequences were not found for all MM probes. These probes will be removed from the mmSequence object.
3: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
4: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
5: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
6: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
7: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
8: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().

## ignore those warnings.
> install.packages("pd.primeview/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/4.0'
(as 'lib' is unspecified)
* installing *source* package 'pd.primeview' ...
** using staged installation
** R
<other stuff="" happens="" here="">

ADD REPLY • link 3.7 years ago James W. MacDonald 65k

1

Entering edit mode

Note that it looks like there was a problem:

Creating index idx_pmfid on pmfeature... Error: UNIQUE constraint failed: pmfeature.fid

But maybe there wasn't? I built the cdfenv using makecdfenv, and it looks like some of the multiply-mapped probes got dropped by makecdfenv but not by pdInfoBuilder? Note that what I am calling a 'fid' is the field ID for the CEL file. When you read in a CEL file, you get a column of numbers"

> head(exprs(z))
  GSM2157164_IN_6391_B_II_PrimeView_.CEL
1                                    440
2                                  12831
3                                    644
4                                  12827
5                                    366
6                                    254

Where the FID is the row of the read-in data. So for example

> get("11715100_at", primeviewcdf)
          pm  mm
 [1,] 107740 NaN
 [2,] 182874 NaN
 [3,] 487733 NaN
 [4,] 496347 NaN
 [5,] 416415 NaN
 [6,]  78350 NaN
 [7,] 196959 NaN
 [8,] 424007 NaN
 [9,] 286779 NaN
[10,] 183515 NaN
[11,] 129465 NaN

Means that the probes for that probeset are in rows 107740, 182874, etc. And if there are multiply-mapping probes, they will show up more than once in the cdfenv or the pdInfoPackage, indicating that we use the same data more than once, for different probesets.

As an example

## query pdInfoPackage database directly
> con <- db(pd.primeview)
> fids.pd <- dbGetQuery(con, "select fid from pmfeature;")[,1]
> head(fids.pd[duplicated(fids.pd)])
[1] 19766 19766 19766 38066 38066 38066
## check the first duplicate
> dbGetQuery(con, "select * from pmfeature inner join featureSet using(fsetid)  where fid='19766';")
    fid fsetid x  y    man_fsetid strand
1 19766   4418 1 27 11719315_a_at      1
2 19766   4418 1 27 11719315_a_at      1
3 19766   4420 1 27 11719317_x_at      1
4 19766   4420 1 27 11719317_x_at      1
 ## weirdly this same probe is used twice for each probeset?
> dbGetQuery(con, "select * from featureSet inner join pmfeature using(fsetid) where man_fsetid='11719315_a_at';")
   fsetid    man_fsetid strand    fid   x   y
1    4418 11719315_a_at      1  19766   1  27
2    4418 11719315_a_at      1  19766   1  27
3    4418 11719315_a_at      1 374843  58 512
4    4418 11719315_a_at      1 382215 110 522
5    4418 11719315_a_at      1 153104 115 209
6    4418 11719315_a_at      1 432038 157 590
7    4418 11719315_a_at      1 432038 157 590
8    4418 11719315_a_at      1 283596 311 387
9    4418 11719315_a_at      1 338581 396 462
10   4418 11719315_a_at      1 370130 469 505
11   4418 11719315_a_at      1 268396 483 366
12   4418 11719315_a_at      1 239951 586 327
13   4418 11719315_a_at      1 485303 718 662
## two of them are used twice. Weird. Let's look at the cdfenv
> mget(c("11719315_a_at","11719317_x_at"), primeviewcdf)
$`11719315_a_at`
          pm  mm
 [1,] 485303 NaN
 [2,] 370130 NaN
 [3,] 432038 NaN
 [4,]  19766 NaN
 [5,] 374843 NaN
 [6,] 239951 NaN
 [7,] 338581 NaN
 [8,] 153104 NaN
 [9,] 283596 NaN
[10,] 268396 NaN
[11,] 382215 NaN

$`11719317_x_at`
          pm  mm
 [1,] 497071 NaN
 [2,] 371353 NaN
 [3,] 338364 NaN
 [4,] 383873 NaN
 [5,] 156170 NaN
 [6,] 432038 NaN
 [7,] 349744 NaN
 [8,] 489346 NaN
 [9,]  19766 NaN
[10,] 150707 NaN
[11,] 337348 NaN
## dups are only used once.

So the 'pdInfoPackage` has some weird duplications, but I think overall either that or the cdf package should be fine.

ADD REPLY • link 3.7 years ago James W. MacDonald 65k