Affymetrix annotation using R or NetAffx

0

Entering edit mode

Natasha ▴ 440

@natasha-4640

Last seen 9.6 years ago

Dear List, Normally I use the batch query mode of the NetAffx (affymetrix website) to annotate my gene lists (downside: It can do only 10K probes at a time). I decided to try the R packages to do the same. However, I decided to check a couple of probes to see if the output in either case is the same but it is not! So I do not know where the error lies: in my code, in the meta data packages or the website or if I have the incorrect array package? > eset ExpressionSet (storageMode: lockedEnvironment) assayData: 32321 features, 21 samples element names: exprs, se.exprs protocolData sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene-1_0-st-v1).CEL (21 total) varLabels: ScanDate varMetadata: labelDescription phenoData sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene-1_0-st-v1).CEL (21 total) varLabels: sample varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: hugene10stv1 > library(annotate) > library(hugene10stv1cdf) > library(hugene10sttranscriptcluster.db) > ls("package:hugene10sttranscriptcluster.db") > ID <- featureNames(eset) > Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db") > Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME")) > Entrez <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENTREZID")) > tail(Symbol) 8180413 8180414 8180415 8180416 8180417 8180418 "ST6GALNAC4" "ST6GALNAC1" "MCART1" "OR3A2" "METTL2B" NA > tail(ID) [1] "8180413" "8180414" "8180415" "8180416" "8180417" "8180418" > tail(Entrez) [1] "27090" "55808" "92014" "4995" "55798" "NA" Which does not match the output from the website below for the same set of IDs: * GeneChip(r) Human Gene 1.0 ST Array Displaying Results: 1-6 of 6. Transcript Cluster ID GO Description Chromosome Gene Title Pathway Gene Symbol Entrez Gene ID Cytoband 8180416 8180417 8180418 8180413 8180414 8180415 > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hugene10sttranscriptcluster.db_8.0.1 hugene10stprobeset.db_8.0.1 [3] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 [5] DBI_0.2-5 annotate_1.34.0 [7] rgl_0.92.892 R.utils_1.12.1 [9] R.oo_1.9.3 R.methodsS3_1.2.2 [11] plotrix_3.4 hugene10stv1cdf_2.10.0 [13] AnnotationDbi_1.18.0 scatterplot3d_0.3-33 [15] WriteXLS_2.1.0 affyQCReport_1.34.0 [17] lattice_0.20-6 affy_1.34.0 [19] Biobase_2.16.0 BiocGenerics_0.2.0 [21] gdata_2.8.2 limma_3.12.0 loaded via a namespace (and not attached): [1] affyio_1.24.0 affyPLM_1.32.0 BiocInstaller_1.4.7 [4] Biostrings_2.24.1 gcrma_2.28.0 genefilter_1.38.0 [7] grid_2.15.0 gtools_2.6.2 IRanges_1.14.2 [10] preprocessCore_1.18.0 RColorBrewer_1.0-5 simpleaffy_2.32.0 [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 [16] tcltk_2.15.0 tools_2.15.0 xtable_1.7-0 [19] zlibbioc_1.2.0 Any help/suggestion would be appreciated. Apologies if this is posted more than once!! Many Thanks, Natasha [[alternative HTML version deleted]]

annotate annotate • 4.5k views

ADD COMMENT • link updated 11.7 years ago by James W. MacDonald 65k • written 11.7 years ago by Natasha ▴ 440

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Hi Natasha, On 8/15/2012 12:05 PM, Natasha Sahgal wrote: > Dear List, > > > > Normally I use the batch query mode of the NetAffx (affymetrix website) to annotate my gene lists (downside: It can do only 10K probes at a time). > > I decided to try the R packages to do the same. > > > > However, I decided to check a couple of probes to see if the output in either case is the same but it is not! So I do not know where the error lies: in my code, in the meta data packages or the website or if I have the incorrect array package? Well, it appears this is some weird disagreement between Affy's netaffx database and the transcript files they release. Note that the annotation files we supply are simply repackaged versions of data we get from the manufacturer. There is no implied warranty that the data are right. And in the case of these particular data packages, we don't even make them - they are supplied by Arthur Li. So back to the task at hand. If you query netaffx for the first ID you supply (8180413), it will tell you that it is unmapped. However, if you download the transcript annotations for this chip and look for that ID, you get this: NM_175039.3 // --- // Homo sapiens ST6 (alpha-N-acetyl-neuraminyl-2,3-beta- galactosyl-1,3)-N-acetylgalactosaminide alpha-2,6-sialyltransferase 4 (ST6GALNAC4), transcript variant 1, mRNA // --- // --- // --- // --- // --- // --- in the mrna_assignment column. In addition, in the 'category' column it says flmrna->unmapped So this probeset is either unmapped, or it queries a transcript variant of ST6GALNAC4. No telling which. But the annotation package claims the variant because that is what is in the annotation file. I suppose you could take all the probe sequences for this probeset and Blat them against the genome to see where they bind. Or you could just remove all control probesets before annotating. The gene and exon chips have way more control probes than the old school 3' biased chips had, and they don't have the convenient AFFX handle that would let you know they were creeping into your set of significant genes. On one hand, these should never end up in your list of significant genes because they are controls. On the other hand, what exactly is a control, and who is to say it won't change expression (seriously - does this probeset measure something or not?). Recently I have taken to summarily excluding all non-main probesets from my analyses, right after the eBayes() step in limma. You can do something super cool like > library(pd.hugene.1.0.st.v1) > con <- db(pd.hugene.1.0.st.v1) > types <- dbGetQuery(con, "select meta_fsetid, type from featureSet inner join core_mps on featureSet.fsetid=core_mps.fsetid;") > head(types) meta_fsetid type 1 7892501 6 2 7892502 7 3 7892503 7 4 7892504 7 5 7892505 7 6 7892506 7 ind <- types[,2] == 1 and then you can subset using the ind variable. Best, Jim > > > > > >> eset > ExpressionSet (storageMode: lockedEnvironment) > > assayData: 32321 features, 21 samples > > element names: exprs, se.exprs > > protocolData > > sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 > > 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene-1_0-st-v1).CEL (21 > > total) > > varLabels: ScanDate > > varMetadata: labelDescription > > phenoData > > sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 > > 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene-1_0-st-v1).CEL (21 > > total) > > varLabels: sample > > varMetadata: labelDescription > > featureData: none > > experimentData: use 'experimentData(object)' > > Annotation: hugene10stv1 > > > >> library(annotate) >> library(hugene10stv1cdf) >> library(hugene10sttranscriptcluster.db) >> ls("package:hugene10sttranscriptcluster.db") >> ID<- featureNames(eset) >> Symbol<- getSYMBOL(ID, "hugene10sttranscriptcluster.db") >> Name<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME")) >> Entrez<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENTREZID")) > > >> tail(Symbol) > 8180413 8180414 8180415 8180416 8180417 8180418 > > "ST6GALNAC4" "ST6GALNAC1" "MCART1" "OR3A2" "METTL2B" NA > > > >> tail(ID) > [1] "8180413" "8180414" "8180415" "8180416" "8180417" "8180418" > > > >> tail(Entrez) > [1] "27090" "55808" "92014" "4995" "55798" "NA" > > > > > > Which does not match the output from the website below for the same set of IDs: > > * GeneChip(r) Human Gene 1.0 ST Array > > Displaying Results: 1-6 of 6. > > > > Transcript Cluster ID > > GO Description > > Chromosome > > Gene Title > > Pathway > > Gene Symbol > > Entrez Gene ID > > Cytoband > > > 8180416 > > > > > > > > > > > > > > > > 8180417 > > > > > > > > > > > > > > > > 8180418 > > > > > > > > > > > > > > > > 8180413 > > > > > > > > > > > > > > > > 8180414 > > > > > > > > > > > > > > > > 8180415 > > > > > > > > > > > > > > > > > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] hugene10sttranscriptcluster.db_8.0.1 hugene10stprobeset.db_8.0.1 > > [3] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 > > [5] DBI_0.2-5 annotate_1.34.0 > > [7] rgl_0.92.892 R.utils_1.12.1 > > [9] R.oo_1.9.3 R.methodsS3_1.2.2 > > [11] plotrix_3.4 hugene10stv1cdf_2.10.0 > > [13] AnnotationDbi_1.18.0 scatterplot3d_0.3-33 > > [15] WriteXLS_2.1.0 affyQCReport_1.34.0 > > [17] lattice_0.20-6 affy_1.34.0 > > [19] Biobase_2.16.0 BiocGenerics_0.2.0 > > [21] gdata_2.8.2 limma_3.12.0 > > > > loaded via a namespace (and not attached): > > [1] affyio_1.24.0 affyPLM_1.32.0 BiocInstaller_1.4.7 > > [4] Biostrings_2.24.1 gcrma_2.28.0 genefilter_1.38.0 > > [7] grid_2.15.0 gtools_2.6.2 IRanges_1.14.2 > > [10] preprocessCore_1.18.0 RColorBrewer_1.0-5 simpleaffy_2.32.0 > > [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 > > [16] tcltk_2.15.0 tools_2.15.0 xtable_1.7-0 > > [19] zlibbioc_1.2.0 > > > > > > Any help/suggestion would be appreciated. > > Apologies if this is posted more than once!! > > > > > > Many Thanks, > > Natasha > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.7 years ago James W. MacDonald 65k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Hi Natasha, Please don't take things off list. On 8/17/2012 12:06 PM, Natasha Sahgal wrote: > Dear Jim, > > Thank you for this. It is indeed strange that the website query differs from the downloadable annotation file and the R/BioConductor package. > > I'll try the filter as you suggested (it makes sense, as I managed to find out 1=main all others are controls!) and try the annotation using R again. > > I did have a few (naive) questions though: > > 1) Why did you use the inner join command in the sql query? As it(result) does not appear to be different when not used (I have basic SQL knowledge but it is well rusty as I have not used it in a very long time!) Ugh. For want of a DISTINCT... The fsetid in the featureSet table pertains to the probeset (roughly exon) level, whereas the meta_fsetid pertains to the transcript level. The mapping between transcript and probeset is captured in the core_mps file, so we have to join on that table. But you do have to remember that distinct clause. > table(dbGetQuery(con, "select distinct meta_fsetid, type from featureSet inner join core_mps on featureSet.fsetid=core_mps.fsetid;")$type) 1 2 4 6 7 28869 57 45 1195 2904 You have presumably summarized at the transcript level (as I would strongly advise - there are LOTS of single probe probesets at the probeset level). So if you want to filter out the controls, you have to do the join. > >> types2 = dbGetQuery(con, "select fsetid, type from featureSet;") >> dim(types2) > [1] 257430 2 >> table(types2$type) > 1 2 4 6 7 > 253002 57 45 1195 2904 > >> types1 = dbGetQuery(con, "select meta_fsetid, type from featureSet inner join core_mps on featureSet.fsetid=core_mps.fsetid;") >> dim(types1) > [1] 257430 2 >> table(types1$type) > 1 2 4 6 7 > 253002 57 45 1195 2904 > > 2) In general when filtering low expressed (as in Illumina) or control probes (the method you mentioned below for the affy data) would it not be better to normalise and perform limma after filtering data? It's probably a toss up, but I tend to leave things in, and then filter after the eBayes() step. The idea behind the eBayes() step is to estimate the expected variability of probesets on the chip. As we can see, the control probes do actually vary, so can contribute some information. An alternative point of view would be that these probes aren't supposed to vary, so are only contributing noise and should be filtered first. But I don't ascribe to that POV, as I can't accommodate the cognitive dissonance inherent in thinking apparent differences in 'control' probesets are due only to noise, yet differences in 'main' probesets are due to biology. I haven't yet reconciled arbitrarily filtering out the controls either, but cannot explain to clients why they appear in the list of top genes, so I am going with expediency. > > 3) I just realised that (after coming across a older thread) that the oligo package should be used for Human gene arrays and not affy....would it affect my results drastically as I have used affy? Should I rather redo the analysis using oligo? I doubt it would affect your results much, and certainly not drastically (or I should hope not - I have never compared). Technically speaking the affy package isn't/wasn't designed for the random primer based arrays. However, Ben Bolstad went back and made changes to the underlying code so it should do the right thing. I use oligo for these arrays, but I am pedantic I suppose. I also don't hammer screws, nor do I pound nails with the butt-end of a screwdriver. Well, unless I am really far away from a hammer and it is a relatively small nail. ;-D So it has been my recommendation for years now that people should use oligo for these arrays. Best, Jim > > > Many thanks, > Natasha > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at uw.edu] > Sent: 16 August 2012 16:38 > To: Natasha Sahgal > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Affymetrix annotation using R or NetAffx > > Hi Natasha, > > On 8/15/2012 12:05 PM, Natasha Sahgal wrote: >> Dear List, >> >> >> >> Normally I use the batch query mode of the NetAffx (affymetrix website) to annotate my gene lists (downside: It can do only 10K probes at a time). >> >> I decided to try the R packages to do the same. >> >> >> >> However, I decided to check a couple of probes to see if the output in either case is the same but it is not! So I do not know where the error lies: in my code, in the meta data packages or the website or if I have the incorrect array package? > Well, it appears this is some weird disagreement between Affy's netaffx database and the transcript files they release. > > Note that the annotation files we supply are simply repackaged versions of data we get from the manufacturer. There is no implied warranty that the data are right. And in the case of these particular data packages, we don't even make them - they are supplied by Arthur Li. > > So back to the task at hand. If you query netaffx for the first ID you supply (8180413), it will tell you that it is unmapped. However, if you download the transcript annotations for this chip and look for that ID, you get this: > > NM_175039.3 // --- // Homo sapiens ST6 > (alpha-N-acetyl-neuraminyl-2,3-beta- galactosyl-1,3)-N-acetylgalactosaminide > alpha-2,6-sialyltransferase 4 (ST6GALNAC4), transcript variant 1, mRNA // --- // --- // --- // --- // --- // --- > > > in the mrna_assignment column. In addition, in the 'category' column it says > flmrna->unmapped > > > So this probeset is either unmapped, or it queries a transcript variant of ST6GALNAC4. No telling which. But the annotation package claims the variant because that is what is in the annotation file. > > I suppose you could take all the probe sequences for this probeset and Blat them against the genome to see where they bind. Or you could just remove all control probesets before annotating. > > The gene and exon chips have way more control probes than the old school 3' biased chips had, and they don't have the convenient AFFX handle that would let you know they were creeping into your set of significant genes. On one hand, these should never end up in your list of significant genes because they are controls. On the other hand, what exactly is a control, and who is to say it won't change expression (seriously - does this probeset measure something or not?). > > Recently I have taken to summarily excluding all non-main probesets from my analyses, right after the eBayes() step in limma. You can do something super cool like > >> library(pd.hugene.1.0.st.v1) >> con<- db(pd.hugene.1.0.st.v1) >> types<- dbGetQuery(con, "select meta_fsetid, type from featureSet > inner join core_mps on featureSet.fsetid=core_mps.fsetid;") >> head(types) > meta_fsetid type > 1 7892501 6 > 2 7892502 7 > 3 7892503 7 > 4 7892504 7 > 5 7892505 7 > 6 7892506 7 > > ind<- types[,2] == 1 > > and then you can subset using the ind variable. > > Best, > > Jim > > >> >> >> >> >>> eset >> ExpressionSet (storageMode: lockedEnvironment) >> >> assayData: 32321 features, 21 samples >> >> element names: exprs, se.exprs >> >> protocolData >> >> sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 >> >> 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene- 1_0-st-v1).CEL >> (21 >> >> total) >> >> varLabels: ScanDate >> >> varMetadata: labelDescription >> >> phenoData >> >> sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 >> >> 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene- 1_0-st-v1).CEL >> (21 >> >> total) >> >> varLabels: sample >> >> varMetadata: labelDescription >> >> featureData: none >> >> experimentData: use 'experimentData(object)' >> >> Annotation: hugene10stv1 >> >> >> >>> library(annotate) >>> library(hugene10stv1cdf) >>> library(hugene10sttranscriptcluster.db) >>> ls("package:hugene10sttranscriptcluster.db") >>> ID<- featureNames(eset) >>> Symbol<- getSYMBOL(ID, "hugene10sttranscriptcluster.db") >>> Name<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", >>> "GENENAME")) >>> Entrez<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", >>> "ENTREZID")) >> >>> tail(Symbol) >> 8180413 8180414 8180415 8180416 8180417 8180418 >> >> "ST6GALNAC4" "ST6GALNAC1" "MCART1" "OR3A2" "METTL2B" NA >> >> >> >>> tail(ID) >> [1] "8180413" "8180414" "8180415" "8180416" "8180417" "8180418" >> >> >> >>> tail(Entrez) >> [1] "27090" "55808" "92014" "4995" "55798" "NA" >> >> >> >> >> >> Which does not match the output from the website below for the same set of IDs: >> >> * GeneChip(r) Human Gene 1.0 ST Array >> >> Displaying Results: 1-6 of 6. >> >> >> >> Transcript Cluster ID >> >> GO Description >> >> Chromosome >> >> Gene Title >> >> Pathway >> >> Gene Symbol >> >> Entrez Gene ID >> >> Cytoband >> >> >> 8180416 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180417 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180418 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180413 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180414 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180415 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] hugene10sttranscriptcluster.db_8.0.1 hugene10stprobeset.db_8.0.1 >> >> [3] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 >> >> [5] DBI_0.2-5 annotate_1.34.0 >> >> [7] rgl_0.92.892 R.utils_1.12.1 >> >> [9] R.oo_1.9.3 R.methodsS3_1.2.2 >> >> [11] plotrix_3.4 hugene10stv1cdf_2.10.0 >> >> [13] AnnotationDbi_1.18.0 scatterplot3d_0.3-33 >> >> [15] WriteXLS_2.1.0 affyQCReport_1.34.0 >> >> [17] lattice_0.20-6 affy_1.34.0 >> >> [19] Biobase_2.16.0 BiocGenerics_0.2.0 >> >> [21] gdata_2.8.2 limma_3.12.0 >> >> >> >> loaded via a namespace (and not attached): >> >> [1] affyio_1.24.0 affyPLM_1.32.0 BiocInstaller_1.4.7 >> >> [4] Biostrings_2.24.1 gcrma_2.28.0 genefilter_1.38.0 >> >> [7] grid_2.15.0 gtools_2.6.2 IRanges_1.14.2 >> >> [10] preprocessCore_1.18.0 RColorBrewer_1.0-5 simpleaffy_2.32.0 >> >> [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 >> >> [16] tcltk_2.15.0 tools_2.15.0 xtable_1.7-0 >> >> [19] zlibbioc_1.2.0 >> >> >> >> >> >> Any help/suggestion would be appreciated. >> >> Apologies if this is posted more than once!! >> >> >> >> >> >> Many Thanks, >> >> Natasha >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.7 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Jim, Many Thanks for your response. Sorry about taking off the list, it's (a rather bad) habit to hit reply rather than reply all mostly :). BW, Natasha -----Original Message----- From: James W. MacDonald [mailto:jmacdon@uw.edu] Sent: 17 August 2012 17:50 To: Natasha Sahgal Cc: Bioconductor at r-project.org Subject: Re: [BioC] Affymetrix annotation using R or NetAffx Hi Natasha, Please don't take things off list. On 8/17/2012 12:06 PM, Natasha Sahgal wrote: > Dear Jim, > > Thank you for this. It is indeed strange that the website query differs from the downloadable annotation file and the R/BioConductor package. > > I'll try the filter as you suggested (it makes sense, as I managed to find out 1=main all others are controls!) and try the annotation using R again. > > I did have a few (naive) questions though: > > 1) Why did you use the inner join command in the sql query? As > it(result) does not appear to be different when not used (I have basic > SQL knowledge but it is well rusty as I have not used it in a very > long time!) Ugh. For want of a DISTINCT... The fsetid in the featureSet table pertains to the probeset (roughly exon) level, whereas the meta_fsetid pertains to the transcript level. The mapping between transcript and probeset is captured in the core_mps file, so we have to join on that table. But you do have to remember that distinct clause. > table(dbGetQuery(con, "select distinct meta_fsetid, type from featureSet inner join core_mps on featureSet.fsetid=core_mps.fsetid;")$type) 1 2 4 6 7 28869 57 45 1195 2904 You have presumably summarized at the transcript level (as I would strongly advise - there are LOTS of single probe probesets at the probeset level). So if you want to filter out the controls, you have to do the join. > >> types2 = dbGetQuery(con, "select fsetid, type from featureSet;") >> dim(types2) > [1] 257430 2 >> table(types2$type) > 1 2 4 6 7 > 253002 57 45 1195 2904 > >> types1 = dbGetQuery(con, "select meta_fsetid, type from featureSet >> inner join core_mps on featureSet.fsetid=core_mps.fsetid;") >> dim(types1) > [1] 257430 2 >> table(types1$type) > 1 2 4 6 7 > 253002 57 45 1195 2904 > > 2) In general when filtering low expressed (as in Illumina) or control probes (the method you mentioned below for the affy data) would it not be better to normalise and perform limma after filtering data? It's probably a toss up, but I tend to leave things in, and then filter after the eBayes() step. The idea behind the eBayes() step is to estimate the expected variability of probesets on the chip. As we can see, the control probes do actually vary, so can contribute some information. An alternative point of view would be that these probes aren't supposed to vary, so are only contributing noise and should be filtered first. But I don't ascribe to that POV, as I can't accommodate the cognitive dissonance inherent in thinking apparent differences in 'control' probesets are due only to noise, yet differences in 'main' probesets are due to biology. I haven't yet reconciled arbitrarily filtering out the controls either, but cannot explain to clients why they appear in the list of top genes, so I am going with expediency. > > 3) I just realised that (after coming across a older thread) that the oligo package should be used for Human gene arrays and not affy....would it affect my results drastically as I have used affy? Should I rather redo the analysis using oligo? I doubt it would affect your results much, and certainly not drastically (or I should hope not - I have never compared). Technically speaking the affy package isn't/wasn't designed for the random primer based arrays. However, Ben Bolstad went back and made changes to the underlying code so it should do the right thing. I use oligo for these arrays, but I am pedantic I suppose. I also don't hammer screws, nor do I pound nails with the butt-end of a screwdriver. Well, unless I am really far away from a hammer and it is a relatively small nail. ;-D So it has been my recommendation for years now that people should use oligo for these arrays. Best, Jim > > > Many thanks, > Natasha > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at uw.edu] > Sent: 16 August 2012 16:38 > To: Natasha Sahgal > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Affymetrix annotation using R or NetAffx > > Hi Natasha, > > On 8/15/2012 12:05 PM, Natasha Sahgal wrote: >> Dear List, >> >> >> >> Normally I use the batch query mode of the NetAffx (affymetrix website) to annotate my gene lists (downside: It can do only 10K probes at a time). >> >> I decided to try the R packages to do the same. >> >> >> >> However, I decided to check a couple of probes to see if the output in either case is the same but it is not! So I do not know where the error lies: in my code, in the meta data packages or the website or if I have the incorrect array package? > Well, it appears this is some weird disagreement between Affy's netaffx database and the transcript files they release. > > Note that the annotation files we supply are simply repackaged versions of data we get from the manufacturer. There is no implied warranty that the data are right. And in the case of these particular data packages, we don't even make them - they are supplied by Arthur Li. > > So back to the task at hand. If you query netaffx for the first ID you supply (8180413), it will tell you that it is unmapped. However, if you download the transcript annotations for this chip and look for that ID, you get this: > > NM_175039.3 // --- // Homo sapiens ST6 > (alpha-N-acetyl-neuraminyl-2,3-beta- galactosyl-1,3)-N-acetylgalactosam > inide alpha-2,6-sialyltransferase 4 (ST6GALNAC4), transcript variant > 1, mRNA // --- // --- // --- // --- // --- // --- > > > in the mrna_assignment column. In addition, in the 'category' column > it says > flmrna->unmapped > > > So this probeset is either unmapped, or it queries a transcript variant of ST6GALNAC4. No telling which. But the annotation package claims the variant because that is what is in the annotation file. > > I suppose you could take all the probe sequences for this probeset and Blat them against the genome to see where they bind. Or you could just remove all control probesets before annotating. > > The gene and exon chips have way more control probes than the old school 3' biased chips had, and they don't have the convenient AFFX handle that would let you know they were creeping into your set of significant genes. On one hand, these should never end up in your list of significant genes because they are controls. On the other hand, what exactly is a control, and who is to say it won't change expression (seriously - does this probeset measure something or not?). > > Recently I have taken to summarily excluding all non-main probesets > from my analyses, right after the eBayes() step in limma. You can do > something super cool like > >> library(pd.hugene.1.0.st.v1) >> con<- db(pd.hugene.1.0.st.v1) >> types<- dbGetQuery(con, "select meta_fsetid, type from featureSet > inner join core_mps on featureSet.fsetid=core_mps.fsetid;") >> head(types) > meta_fsetid type > 1 7892501 6 > 2 7892502 7 > 3 7892503 7 > 4 7892504 7 > 5 7892505 7 > 6 7892506 7 > > ind<- types[,2] == 1 > > and then you can subset using the ind variable. > > Best, > > Jim > > >> >> >> >> >>> eset >> ExpressionSet (storageMode: lockedEnvironment) >> >> assayData: 32321 features, 21 samples >> >> element names: exprs, se.exprs >> >> protocolData >> >> sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 >> >> 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene- 1_0-st-v1).CEL >> (21 >> >> total) >> >> varLabels: ScanDate >> >> varMetadata: labelDescription >> >> phenoData >> >> sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8 >> >> 5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene- 1_0-st-v1).CEL >> (21 >> >> total) >> >> varLabels: sample >> >> varMetadata: labelDescription >> >> featureData: none >> >> experimentData: use 'experimentData(object)' >> >> Annotation: hugene10stv1 >> >> >> >>> library(annotate) >>> library(hugene10stv1cdf) >>> library(hugene10sttranscriptcluster.db) >>> ls("package:hugene10sttranscriptcluster.db") >>> ID<- featureNames(eset) >>> Symbol<- getSYMBOL(ID, "hugene10sttranscriptcluster.db") >>> Name<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", >>> "GENENAME")) >>> Entrez<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", >>> "ENTREZID")) >> >>> tail(Symbol) >> 8180413 8180414 8180415 8180416 8180417 8180418 >> >> "ST6GALNAC4" "ST6GALNAC1" "MCART1" "OR3A2" "METTL2B" NA >> >> >> >>> tail(ID) >> [1] "8180413" "8180414" "8180415" "8180416" "8180417" "8180418" >> >> >> >>> tail(Entrez) >> [1] "27090" "55808" "92014" "4995" "55798" "NA" >> >> >> >> >> >> Which does not match the output from the website below for the same set of IDs: >> >> * GeneChip(r) Human Gene 1.0 ST Array >> >> Displaying Results: 1-6 of 6. >> >> >> >> Transcript Cluster ID >> >> GO Description >> >> Chromosome >> >> Gene Title >> >> Pathway >> >> Gene Symbol >> >> Entrez Gene ID >> >> Cytoband >> >> >> 8180416 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180417 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180418 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180413 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180414 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 8180415 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] hugene10sttranscriptcluster.db_8.0.1 hugene10stprobeset.db_8.0.1 >> >> [3] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 >> >> [5] DBI_0.2-5 annotate_1.34.0 >> >> [7] rgl_0.92.892 R.utils_1.12.1 >> >> [9] R.oo_1.9.3 R.methodsS3_1.2.2 >> >> [11] plotrix_3.4 hugene10stv1cdf_2.10.0 >> >> [13] AnnotationDbi_1.18.0 scatterplot3d_0.3-33 >> >> [15] WriteXLS_2.1.0 affyQCReport_1.34.0 >> >> [17] lattice_0.20-6 affy_1.34.0 >> >> [19] Biobase_2.16.0 BiocGenerics_0.2.0 >> >> [21] gdata_2.8.2 limma_3.12.0 >> >> >> >> loaded via a namespace (and not attached): >> >> [1] affyio_1.24.0 affyPLM_1.32.0 BiocInstaller_1.4.7 >> >> [4] Biostrings_2.24.1 gcrma_2.28.0 genefilter_1.38.0 >> >> [7] grid_2.15.0 gtools_2.6.2 IRanges_1.14.2 >> >> [10] preprocessCore_1.18.0 RColorBrewer_1.0-5 simpleaffy_2.32.0 >> >> [13] splines_2.15.0 stats4_2.15.0 survival_2.36-14 >> >> [16] tcltk_2.15.0 tools_2.15.0 xtable_1.7-0 >> >> [19] zlibbioc_1.2.0 >> >> >> >> >> >> Any help/suggestion would be appreciated. >> >> Apologies if this is posted more than once!! >> >> >> >> >> >> Many Thanks, >> >> Natasha >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.7 years ago Natasha ▴ 440

Login before adding your answer.