Mapping Affymetrix annotations with Bioconductor annotations

0

Entering edit mode

Joao Sollari Lopes ▴ 80

@joao-sollari-lopes-6122

Last seen 9.6 years ago

Hi Jim, Following on the discussion on annotation in Affymetrix Gene ST arrays, I wonder if there is a standard way to deal with multiple mRNAs (from different genes) that are assigned to the same transcript cluster. Is it generally accepted to follow the naive approach of picking the first mrna of the list. I know that the mRNA Assignments are ordered in a ranking so is it safe just to assume the ranking already performed by Affymetrix? Joao On 08/29/2013 04:22 PM, James W. MacDonald wrote: > Hi Joao, > > Unfortunately there are no readily available packages for annotating > all the new model organism arrays from Affy. However, the functions to > create your own annotation package do exist. If you look at the > AnnotationForge package, specifically the SQLForge vignette > (http://www.bioconductor.org/packages/release/bioc/vignettes/Annotat ionForge/inst/doc/SQLForge.pdf), > it is pretty straightforward to make your own annotation package. > > I am assuming you are summarizing at the transcript level, so would > want to make a zebgene11sttranscriptcluster.db package. For this you > need the transcript csv file from Affy > (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). > From this you want to generate a two-column file with the probeset ID > in the first column, and then GenBank or RefSeq IDs in the second. > > This is the tough part, as the annotation files need to be parsed to > create this file. > > I wrote an Rscript to parse these files that you could use. It is > pretty naive, but seems to do a relatively reasonable job. You will > obviously need to change the first line to point to the correct > directory, and will have to have the org.Dr.eg.db package installed, > but this should > > <copy from="" below=""> > > #!/data/programs/lib64/R/bin/Rscript > args <- commandArgs(TRUE) > if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R > <transcript.csv> <organism.db package=""> ", > "<mrna column="" header=""> (optional)\n", call. = > FALSE)) > probefile <- args[1] > orgpkg <- args[2] > fileout <- args[3] > if(length(args) == 4) headercol <- args[4] else headercol <- > "mrna_assignment" > > dat <- read.csv(probefile, comment.char = "#", stringsAsFactors=FALSE, > na.string = "---") > mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x) > grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value = > TRUE)[1]) > > ens <- grep("^ENS", mrna, value = TRUE) > require(orgpkg, character.only = TRUE) || stop(paste("You need to > install the", orgpkg, "package first!")) > ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS") > ens <- ens[!duplicated(ens[,1]),] > ## use accnum if refseq is NA > ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3] > ## put mapped data back in mrna vector > mrna[match(ens[,1], mrna)] <- ens[,2] > mrna[grep("^ENS|^GENSCAN", mrna)] <- NA > ## write out > write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote = > FALSE, row.names = FALSE, col.names = FALSE, na = "") > > <to here=""> > > Paste this into a file, make it executable (if on linux or macosx), > and change the path in the first line to point to the location of your > Rscript and it should create a fairly reasonable file for input to > AnnotationForge. > > You just call this script from the command line: > > parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv > org.Dr.eg.db zebgene_mapper.txt > > then after a while you will have a file zebgene_mapper.txt that you > can use as input to AnnotationForge > > Best, > > Jim > > > > > > On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote: >> Hi Jim, >> >> Thanks for your quick reply. Actually I was able to do some kind of >> mapping through the position of the probes in the strips using the >> files: >> >> zebgene11stdrentrezgprobe_17.1.0.tar.gz >> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF /17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) >> >> >> >> and >> >> pd.zebgene.1.1.st (provided by Bioconductor) >> >> The annotations compare very well with each other, however the info >> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat >> more complete. >> >> The trouble of working with Affymetrix Array Strip is that there seems >> to be little support in bioconductor for it in what concerns >> annotation. Particularly, because packages "annotate" and "annaffy" >> seem to work only with Affymetrix Chips. >> >> I know I have plenty of reading to do, but is there a best-way to work >> with Array Strips and still use packages "annotate" and "annaffy"? At >> the moment I am using package "oligo". >> >> Thanks, >> Joao >> >> On 08/29/2013 03:15 PM, James W. MacDonald wrote: >>> Hi Joao, >>> >>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote: >>>> Hi, >>>> >>>> I am trying to compare the annotations provided by Affymetrix with the >>>> ones provided by Bioconductor for >>>> >>>> Zebrafish Gene 1.1 ST Array Strip >>>> >>>> I have compared the files >>>> >>>> zebgene11stdrentrezg.db_17.1.0.tar.gz >>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomC DF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) >>>> >>>> >>> >>> That file isn't supplied by Bioconductor, it is supplied by MBNI at >>> University of Michigan. >>> >>> In addition, (if you read what they have on their site to know what >>> you are using) the probesets for that CDF no longer correspond in any >>> way to the original probesets that Affy defined. So comparing the two >>> doesn't make any sense. >>> >>> Best, >>> >>> Jim >>> >>> >>>> >>>> >>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip >>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene- 33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) >>>> >>>> >>>> >>>> >>>> The trouble is that the first identifies the Units as "100000002_at", >>>> "100000006_at", ..., "84703_at" and the second as "12943944", >>>> "12943954", ..., "13276104". Is there an easy way to know which >>>> correspond to which? >>>> >>>> Thanks in advance, >>>> Joao Lopes >>>> Instituto Gulbenkian de Ciencia, Portugal >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099

Annotation Organism zebrafish cdf affy AnnotationForge Annotation Organism zebrafish cdf • 1.8k views

ADD COMMENT • link updated 10.5 years ago by James W. MacDonald 65k • written 10.5 years ago by Joao Sollari Lopes ▴ 80

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Joao, There isn't a standard way that I am familiar with. But this illustrates a conceptual difference between the purpose of these arrays and what people end up using them for. I have run headlong into this issue lately, trying to create annotation packages for the new 2.X ST arrays. The annotations for these arrays are primarily directed towards the _transcripts_ that a given probeset measures, rather than the underlying gene. So the data we get from these arrays are supposed to represent the relative abundance of a given transcript, and the 'duplicate' probesets on the array are supposed to measure transcript variants (at least I assume this is in general true, as the new TAC software is supposed to work with Gene ST arrays). We know that there actually are transcript variants for various genes, and that these variants may give rise to phenotypic differences. So it may well be interesting to measure these variants and try to figure out if they have a meaningful effect on a phenotype we might be interested in. However, 100% of the researchers I come into contact with are completely uninterested in such things, and just want to know if there are differences in expression at the _gene_ level. This is true BTW for RNA-Seq as well. This may have more to do with the crowd I run with, rather that the general desires of the average biologist, so I may just be suffering from confirmation bias here. But I think it is a bit ironic that Affymetrix keeps trying to push transcript level data on us (Exon arrays, Gene ST arrays, now HTA arrays), and we push back just as hard, collapsing all these data to gene level. I am not sure if this is a lack of imagination on our part or a failure to understand the customer on Affy's part. Or maybe it's just that I don't hang with the cool kids. Best, Jim On Friday, October 04, 2013 1:29:36 PM, Joao Sollari Lopes wrote: > Hi Jim, > > Following on the discussion on annotation in Affymetrix Gene ST > arrays, I wonder if there is a standard way to deal with multiple > mRNAs (from different genes) that are assigned to the same transcript > cluster. Is it generally accepted to follow the naive approach of > picking the first mrna of the list. > I know that the mRNA Assignments are ordered in a ranking so is it > safe just to assume the ranking already performed by Affymetrix? > Joao > > On 08/29/2013 04:22 PM, James W. MacDonald wrote: >> Hi Joao, >> >> Unfortunately there are no readily available packages for annotating >> all the new model organism arrays from Affy. However, the functions >> to create your own annotation package do exist. If you look at the >> AnnotationForge package, specifically the SQLForge vignette >> (http://www.bioconductor.org/packages/release/bioc/vignettes/Annota tionForge/inst/doc/SQLForge.pdf), >> it is pretty straightforward to make your own annotation package. >> >> I am assuming you are summarizing at the transcript level, so would >> want to make a zebgene11sttranscriptcluster.db package. For this you >> need the transcript csv file from Affy >> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3 /ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). >> From this you want to generate a two-column file with the probeset ID >> in the first column, and then GenBank or RefSeq IDs in the second. >> >> This is the tough part, as the annotation files need to be parsed to >> create this file. >> >> I wrote an Rscript to parse these files that you could use. It is >> pretty naive, but seems to do a relatively reasonable job. You will >> obviously need to change the first line to point to the correct >> directory, and will have to have the org.Dr.eg.db package installed, >> but this should >> >> <copy from="" below=""> >> >> #!/data/programs/lib64/R/bin/Rscript >> args <- commandArgs(TRUE) >> if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R >> <transcript.csv> <organism.db package=""> ", >> "<mrna column="" header=""> (optional)\n", call. = >> FALSE)) >> probefile <- args[1] >> orgpkg <- args[2] >> fileout <- args[3] >> if(length(args) == 4) headercol <- args[4] else headercol <- >> "mrna_assignment" >> >> dat <- read.csv(probefile, comment.char = "#", >> stringsAsFactors=FALSE, na.string = "---") >> mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x) >> grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value = >> TRUE)[1]) >> >> ens <- grep("^ENS", mrna, value = TRUE) >> require(orgpkg, character.only = TRUE) || stop(paste("You need to >> install the", orgpkg, "package first!")) >> ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS") >> ens <- ens[!duplicated(ens[,1]),] >> ## use accnum if refseq is NA >> ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3] >> ## put mapped data back in mrna vector >> mrna[match(ens[,1], mrna)] <- ens[,2] >> mrna[grep("^ENS|^GENSCAN", mrna)] <- NA >> ## write out >> write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote = >> FALSE, row.names = FALSE, col.names = FALSE, na = "") >> >> <to here=""> >> >> Paste this into a file, make it executable (if on linux or macosx), >> and change the path in the first line to point to the location of >> your Rscript and it should create a fairly reasonable file for input >> to AnnotationForge. >> >> You just call this script from the command line: >> >> parseAffyTranscriptCsv.R ZebGene- 1_1-st-v1.na33.3.zv9.transcript.csv >> org.Dr.eg.db zebgene_mapper.txt >> >> then after a while you will have a file zebgene_mapper.txt that you >> can use as input to AnnotationForge >> >> Best, >> >> Jim >> >> >> >> >> >> On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote: >>> Hi Jim, >>> >>> Thanks for your quick reply. Actually I was able to do some kind of >>> mapping through the position of the probes in the strips using the >>> files: >>> >>> zebgene11stdrentrezgprobe_17.1.0.tar.gz >>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCD F/17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) >>> >>> >>> >>> and >>> >>> pd.zebgene.1.1.st (provided by Bioconductor) >>> >>> The annotations compare very well with each other, however the info >>> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat >>> more complete. >>> >>> The trouble of working with Affymetrix Array Strip is that there seems >>> to be little support in bioconductor for it in what concerns >>> annotation. Particularly, because packages "annotate" and "annaffy" >>> seem to work only with Affymetrix Chips. >>> >>> I know I have plenty of reading to do, but is there a best-way to work >>> with Array Strips and still use packages "annotate" and "annaffy"? At >>> the moment I am using package "oligo". >>> >>> Thanks, >>> Joao >>> >>> On 08/29/2013 03:15 PM, James W. MacDonald wrote: >>>> Hi Joao, >>>> >>>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote: >>>>> Hi, >>>>> >>>>> I am trying to compare the annotations provided by Affymetrix with >>>>> the >>>>> ones provided by Bioconductor for >>>>> >>>>> Zebrafish Gene 1.1 ST Array Strip >>>>> >>>>> I have compared the files >>>>> >>>>> zebgene11stdrentrezg.db_17.1.0.tar.gz >>>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/Custom CDF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) >>>>> >>>>> >>>> >>>> That file isn't supplied by Bioconductor, it is supplied by MBNI at >>>> University of Michigan. >>>> >>>> In addition, (if you read what they have on their site to know what >>>> you are using) the probesets for that CDF no longer correspond in any >>>> way to the original probesets that Affy defined. So comparing the two >>>> doesn't make any sense. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>>> >>>>> >>>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip >>>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene- 33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) >>>>> >>>>> >>>>> >>>>> >>>>> The trouble is that the first identifies the Units as "100000002_at", >>>>> "100000006_at", ..., "84703_at" and the second as "12943944", >>>>> "12943954", ..., "13276104". Is there an easy way to know which >>>>> correspond to which? >>>>> >>>>> Thanks in advance, >>>>> Joao Lopes >>>>> Instituto Gulbenkian de Ciencia, Portugal >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.5 years ago James W. MacDonald 65k

Login before adding your answer.