Mapping Affymetrix annotations with Bioconductor annotations
1
0
Entering edit mode
@joao-sollari-lopes-6122
Last seen 9.6 years ago
Hi, I am trying to compare the annotations provided by Affymetrix with the ones provided by Bioconductor for Zebrafish Gene 1.1 ST Array Strip I have compared the files zebgene11stdrentrezg.db_17.1.0.tar.gz (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17 .1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) The trouble is that the first identifies the Units as "100000002_at", "100000006_at", ..., "84703_at" and the second as "12943944", "12943954", ..., "13276104". Is there an easy way to know which correspond to which? Thanks in advance, Joao Lopes Instituto Gulbenkian de Ciencia, Portugal
• 1.4k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States
Hi Joao, On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote: > Hi, > > I am trying to compare the annotations provided by Affymetrix with the > ones provided by Bioconductor for > > Zebrafish Gene 1.1 ST Array Strip > > I have compared the files > > zebgene11stdrentrezg.db_17.1.0.tar.gz > (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ 17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) That file isn't supplied by Bioconductor, it is supplied by MBNI at University of Michigan. In addition, (if you read what they have on their site to know what you are using) the probesets for that CDF no longer correspond in any way to the original probesets that Affy defined. So comparing the two doesn't make any sense. Best, Jim > > > ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip > (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) > > > The trouble is that the first identifies the Units as "100000002_at", > "100000006_at", ..., "84703_at" and the second as "12943944", > "12943954", ..., "13276104". Is there an easy way to know which > correspond to which? > > Thanks in advance, > Joao Lopes > Instituto Gulbenkian de Ciencia, Portugal > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Hi Jim, Thanks for your quick reply. Actually I was able to do some kind of mapping through the position of the probes in the strips using the files: zebgene11stdrentrezgprobe_17.1.0.tar.gz (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17 .1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) and pd.zebgene.1.1.st (provided by Bioconductor) The annotations compare very well with each other, however the info provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat more complete. The trouble of working with Affymetrix Array Strip is that there seems to be little support in bioconductor for it in what concerns annotation. Particularly, because packages "annotate" and "annaffy" seem to work only with Affymetrix Chips. I know I have plenty of reading to do, but is there a best-way to work with Array Strips and still use packages "annotate" and "annaffy"? At the moment I am using package "oligo". Thanks, Joao On 08/29/2013 03:15 PM, James W. MacDonald wrote: > Hi Joao, > > On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote: >> Hi, >> >> I am trying to compare the annotations provided by Affymetrix with the >> ones provided by Bioconductor for >> >> Zebrafish Gene 1.1 ST Array Strip >> >> I have compared the files >> >> zebgene11stdrentrezg.db_17.1.0.tar.gz >> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF /17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) >> > > That file isn't supplied by Bioconductor, it is supplied by MBNI at > University of Michigan. > > In addition, (if you read what they have on their site to know what > you are using) the probesets for that CDF no longer correspond in any > way to the original probesets that Affy defined. So comparing the two > doesn't make any sense. > > Best, > > Jim > > >> >> >> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip >> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) >> >> >> >> The trouble is that the first identifies the Units as "100000002_at", >> "100000006_at", ..., "84703_at" and the second as "12943944", >> "12943954", ..., "13276104". Is there an easy way to know which >> correspond to which? >> >> Thanks in advance, >> Joao Lopes >> Instituto Gulbenkian de Ciencia, Portugal >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099
ADD REPLY
0
Entering edit mode
Hi Joao, Unfortunately there are no readily available packages for annotating all the new model organism arrays from Affy. However, the functions to create your own annotation package do exist. If you look at the AnnotationForge package, specifically the SQLForge vignette (http://www.bioconductor.org/packages/release/bioc/vignettes/Annotatio nForge/inst/doc/SQLForge.pdf), it is pretty straightforward to make your own annotation package. I am assuming you are summarizing at the transcript level, so would want to make a zebgene11sttranscriptcluster.db package. For this you need the transcript csv file from Affy (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). From this you want to generate a two-column file with the probeset ID in the first column, and then GenBank or RefSeq IDs in the second. This is the tough part, as the annotation files need to be parsed to create this file. I wrote an Rscript to parse these files that you could use. It is pretty naive, but seems to do a relatively reasonable job. You will obviously need to change the first line to point to the correct directory, and will have to have the org.Dr.eg.db package installed, but this should <copy from="" below=""> #!/data/programs/lib64/R/bin/Rscript args <- commandArgs(TRUE) if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R <transcript.csv> <organism.db package=""> ", "<mrna column="" header=""> (optional)\n", call. = FALSE)) probefile <- args[1] orgpkg <- args[2] fileout <- args[3] if(length(args) == 4) headercol <- args[4] else headercol <- "mrna_assignment" dat <- read.csv(probefile, comment.char = "#", stringsAsFactors=FALSE, na.string = "---") mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x) grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value = TRUE)[1]) ens <- grep("^ENS", mrna, value = TRUE) require(orgpkg, character.only = TRUE) || stop(paste("You need to install the", orgpkg, "package first!")) ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS") ens <- ens[!duplicated(ens[,1]),] ## use accnum if refseq is NA ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3] ## put mapped data back in mrna vector mrna[match(ens[,1], mrna)] <- ens[,2] mrna[grep("^ENS|^GENSCAN", mrna)] <- NA ## write out write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE, na = "") <to here=""> Paste this into a file, make it executable (if on linux or macosx), and change the path in the first line to point to the location of your Rscript and it should create a fairly reasonable file for input to AnnotationForge. You just call this script from the command line: parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv org.Dr.eg.db zebgene_mapper.txt then after a while you will have a file zebgene_mapper.txt that you can use as input to AnnotationForge Best, Jim On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote: > Hi Jim, > > Thanks for your quick reply. Actually I was able to do some kind of > mapping through the position of the probes in the strips using the files: > > zebgene11stdrentrezgprobe_17.1.0.tar.gz > (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ 17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) > > > and > > pd.zebgene.1.1.st (provided by Bioconductor) > > The annotations compare very well with each other, however the info > provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat > more complete. > > The trouble of working with Affymetrix Array Strip is that there seems > to be little support in bioconductor for it in what concerns > annotation. Particularly, because packages "annotate" and "annaffy" > seem to work only with Affymetrix Chips. > > I know I have plenty of reading to do, but is there a best-way to work > with Array Strips and still use packages "annotate" and "annaffy"? At > the moment I am using package "oligo". > > Thanks, > Joao > > On 08/29/2013 03:15 PM, James W. MacDonald wrote: >> Hi Joao, >> >> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote: >>> Hi, >>> >>> I am trying to compare the annotations provided by Affymetrix with the >>> ones provided by Bioconductor for >>> >>> Zebrafish Gene 1.1 ST Array Strip >>> >>> I have compared the files >>> >>> zebgene11stdrentrezg.db_17.1.0.tar.gz >>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCD F/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) >>> >> >> That file isn't supplied by Bioconductor, it is supplied by MBNI at >> University of Michigan. >> >> In addition, (if you read what they have on their site to know what >> you are using) the probesets for that CDF no longer correspond in any >> way to the original probesets that Affy defined. So comparing the two >> doesn't make any sense. >> >> Best, >> >> Jim >> >> >>> >>> >>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip >>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene- 33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) >>> >>> >>> >>> The trouble is that the first identifies the Units as "100000002_at", >>> "100000006_at", ..., "84703_at" and the second as "12943944", >>> "12943954", ..., "13276104". Is there an easy way to know which >>> correspond to which? >>> >>> Thanks in advance, >>> Joao Lopes >>> Instituto Gulbenkian de Ciencia, Portugal >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD REPLY
0
Entering edit mode
Hi Jim, Many thanks for that! All the best, Joao On 08/29/2013 04:22 PM, James W. MacDonald wrote: > Hi Joao, > > Unfortunately there are no readily available packages for annotating > all the new model organism arrays from Affy. However, the functions to > create your own annotation package do exist. If you look at the > AnnotationForge package, specifically the SQLForge vignette > (http://www.bioconductor.org/packages/release/bioc/vignettes/Annotat ionForge/inst/doc/SQLForge.pdf), > it is pretty straightforward to make your own annotation package. > > I am assuming you are summarizing at the transcript level, so would > want to make a zebgene11sttranscriptcluster.db package. For this you > need the transcript csv file from Affy > (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). > From this you want to generate a two-column file with the probeset ID > in the first column, and then GenBank or RefSeq IDs in the second. > > This is the tough part, as the annotation files need to be parsed to > create this file. > > I wrote an Rscript to parse these files that you could use. It is > pretty naive, but seems to do a relatively reasonable job. You will > obviously need to change the first line to point to the correct > directory, and will have to have the org.Dr.eg.db package installed, > but this should > > <copy from="" below=""> > > #!/data/programs/lib64/R/bin/Rscript > args <- commandArgs(TRUE) > if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R > <transcript.csv> <organism.db package=""> ", > "<mrna column="" header=""> (optional)\n", call. = > FALSE)) > probefile <- args[1] > orgpkg <- args[2] > fileout <- args[3] > if(length(args) == 4) headercol <- args[4] else headercol <- > "mrna_assignment" > > dat <- read.csv(probefile, comment.char = "#", stringsAsFactors=FALSE, > na.string = "---") > mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x) > grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value = > TRUE)[1]) > > ens <- grep("^ENS", mrna, value = TRUE) > require(orgpkg, character.only = TRUE) || stop(paste("You need to > install the", orgpkg, "package first!")) > ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS") > ens <- ens[!duplicated(ens[,1]),] > ## use accnum if refseq is NA > ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3] > ## put mapped data back in mrna vector > mrna[match(ens[,1], mrna)] <- ens[,2] > mrna[grep("^ENS|^GENSCAN", mrna)] <- NA > ## write out > write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote = > FALSE, row.names = FALSE, col.names = FALSE, na = "") > > <to here=""> > > Paste this into a file, make it executable (if on linux or macosx), > and change the path in the first line to point to the location of your > Rscript and it should create a fairly reasonable file for input to > AnnotationForge. > > You just call this script from the command line: > > parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv > org.Dr.eg.db zebgene_mapper.txt > > then after a while you will have a file zebgene_mapper.txt that you > can use as input to AnnotationForge > > Best, > > Jim > > > > > > On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote: >> Hi Jim, >> >> Thanks for your quick reply. Actually I was able to do some kind of >> mapping through the position of the probes in the strips using the >> files: >> >> zebgene11stdrentrezgprobe_17.1.0.tar.gz >> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF /17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) >> >> >> >> and >> >> pd.zebgene.1.1.st (provided by Bioconductor) >> >> The annotations compare very well with each other, however the info >> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat >> more complete. >> >> The trouble of working with Affymetrix Array Strip is that there seems >> to be little support in bioconductor for it in what concerns >> annotation. Particularly, because packages "annotate" and "annaffy" >> seem to work only with Affymetrix Chips. >> >> I know I have plenty of reading to do, but is there a best-way to work >> with Array Strips and still use packages "annotate" and "annaffy"? At >> the moment I am using package "oligo". >> >> Thanks, >> Joao >> >> On 08/29/2013 03:15 PM, James W. MacDonald wrote: >>> Hi Joao, >>> >>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote: >>>> Hi, >>>> >>>> I am trying to compare the annotations provided by Affymetrix with the >>>> ones provided by Bioconductor for >>>> >>>> Zebrafish Gene 1.1 ST Array Strip >>>> >>>> I have compared the files >>>> >>>> zebgene11stdrentrezg.db_17.1.0.tar.gz >>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomC DF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) >>>> >>>> >>> >>> That file isn't supplied by Bioconductor, it is supplied by MBNI at >>> University of Michigan. >>> >>> In addition, (if you read what they have on their site to know what >>> you are using) the probesets for that CDF no longer correspond in any >>> way to the original probesets that Affy defined. So comparing the two >>> doesn't make any sense. >>> >>> Best, >>> >>> Jim >>> >>> >>>> >>>> >>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip >>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene- 33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) >>>> >>>> >>>> >>>> >>>> The trouble is that the first identifies the Units as "100000002_at", >>>> "100000006_at", ..., "84703_at" and the second as "12943944", >>>> "12943954", ..., "13276104". Is there an easy way to know which >>>> correspond to which? >>>> >>>> Thanks in advance, >>>> Joao Lopes >>>> Instituto Gulbenkian de Ciencia, Portugal >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099
ADD REPLY
0
Entering edit mode

Hi Jim,

Many thanks for this script as it has been very useful in helping me build an annotation file. I have used your script to build annotation for Affymetrix HTA 2.0 array: (GeneChip® Human Transcriptome Array 2.0). I am able to create the mapper file (so your script works great!) But when I use Annotation Forge to build the package I get an error: 

Error in sqliteSendQuery(con, statement, bind.data) : 
  error in statement: no such table: src.accession

The code I ran is below, any help with this would be much appreciated!

affytranscriptfile <- "HTA-2_0.na34.hg19.transcript.csv"
orgdb <- "org.Hs.eg.db"
mapperfile <- "HTA20_mapper.txt"
dbschema <- "HUMANCHIP_DB"
fileprefix <- "hta20transcriptcluster"
annoversion <- "1.0.0"
chipname <- "Human Transcriptome Array 2.0"

 

makeDBPackage(
schema=dbschema,
affy=FALSE,
prefix=fileprefix,
fileName=mapperfile,
baseMapType="gbNRef",
outputDir = getwd(),
version= annoversion,
manufacturer = "Affymetrix",
chipName = chipname,
manufacturerUrl = "http://www.affymetrix.com")

 

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationForge_1.12.2 org.Hs.eg.db_3.2.3     RSQLite_1.0.0          DBI_0.3.1              AnnotationDbi_1.32.3  
 [6] IRanges_2.4.8          S4Vectors_0.8.11       Biobase_2.30.0         BiocGenerics_0.16.1    knitr_1.12.3          

loaded via a namespace (and not attached):
[1] tools_3.2.5

 

 

ADD REPLY

Login before adding your answer.

Traffic: 899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6