biomaRt and Ensembl probe set filter....

0

Entering edit mode

Jesper Ryge ▴ 110

@jesper-ryge-1960

Last seen 11.3 years ago

Hi everybody q1. I have been using biomaRt to filter Affymetrix probe sets prior to statistical testing such as limma or cyberT. That is, I only include probe sets that are annotated in ensembl. In this sense I get rid of probe set that do not align correctly to the intended genes - at least that was my intention. I know this has been debated before, i.e. cdf file and probe set filtering of miss-aligned probe set and I find this to be the easiest way to exclude probes that might hybridize to wrong transcripts. I now find that since 2007 the amount of annotated probe sets on the Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of 31099 (i was redoing some analysis and found this discrepancy between the analysis i did in 2007 and the one conducted on the new ensembl database). I find that to be a rather drastic decrease, but perhaps thats not so? In essence I "loose" a lot of probes, but if those that are filtered are "false positives" it is of course worth it! that was my logic so forth at least... So, first i would like to know if anybody considers this strategy wise/unwise? it just seems to me a bit surprising that the probe sets on the affy chips mismatch to such a large extend that only roughly a third of the probes remain in the analysis? I then wanted to check this decrease in affy annotated probe sets which leads me to question 2, a pure biomaRt issue: q2. I wish to access earlier ensembl versions to check and possible make a graph of the decrease of the annotated probe sets for the rat 230_2 chip over time. but i run into a problem: > mart <- useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl" ,archive=T) Error in useMart("ensembl_mart_46", dataset = "rnorvegicus_gene_ensembl", : Incorrect BioMart name, use the listMarts function to see which BioMart databases are available though they are listed in the archive: > listMarts(archive=T) biomart version 1 ensembl_mart_47 ENSEMBL GENES 47 (SANGER) 2 genomic_features_mart_47 Genomic Features 3 snp_mart_47 SNP 4 vega_mart_47 Vega 5 compara_mart_homology_47 Compara homology 6 compara_mart_multiple_ga_47 Compara multiple alignments 7 compara_mart_pairwise_ga_47 Compara pairwise alignments 8 ensembl_mart_46 ENSEMBL GENES 46 (SANGER) 9 genomic_features_mart_46 Genomic Features 10 snp_mart_46 SNP 11 vega_mart_46 Vega 12 compara_mart_homology_46 Compara homology 13 compara_mart_multiple_ga_46 Compara multiple alignments 14 compara_mart_pairwise_ga_46 Compara pairwise alignments 15 ensembl_mart_45 ENSEMBL GENES 45 (SANGER) 16 snp_mart_45 SNP 17 vega_mart_45 Vega 18 compara_mart_homology_45 Compara homology 19 compara_mart_multiple_ga_45 Compara multiple alignments 20 compara_mart_pairwise_ga_45 Compara pairwise alignments 21 ensembl_mart_44 ENSEMBL GENES 44 (SANGER) 22 snp_mart_44 SNP 23 vega_mart_44 Vega 24 compara_mart_homology_44 Compara homology 25 compara_mart_pairwise_ga_44 Compara pairwise alignments 26 ensembl_mart_43 ENSEMBL GENES 43 (SANGER) 27 snp_mart_43 SNP 28 vega_mart_43 Vega 29 compara_mart_homology_43 Compara homology 30 compara_mart_pairwise_ga_43 Compara pairwise alignments > sessionInfo() R version 2.8.0 (2008-10-20) i386-apple-darwin9.5.0 locale: C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] rat2302cdf_2.3.0 biomaRt_1.99.2 affy_1.20.0 Biobase_2.2.1 loaded via a namespace (and not attached): [1] RCurl_0.94-0 XML_1.98-1 affyio_1.10.1 [4] preprocessCore_1.4.0 > cheers, Jesper Ryge, PhD student karolinska Institutet Dep. of Neuroscience

cdf probe affy graph limma biomaRt cdf probe affy graph limma biomaRt • 2.0k views

ADD COMMENT • link updated 16.9 years ago by James W. MacDonald 68k • written 16.9 years ago by Jesper Ryge ▴ 110

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 9 days ago

United States

HI Jesper, Jesper Ryge wrote: > Hi everybody > > q1. I have been using biomaRt to filter Affymetrix probe sets prior to statistical testing such > as limma or cyberT. That is, I only include probe sets that are annotated in ensembl. In this > sense I get rid of probe set that do not align correctly to the intended genes - at least that > was my intention. I know this has been debated before, i.e. cdf file and probe set filtering of > miss-aligned probe set and I find this to be the easiest way to exclude probes that might > hybridize to wrong transcripts. > I now find that since 2007 the amount of annotated probe sets on the Affymetrix Rat 230_2 > has decreased from 17931 -> 12919 out of 31099 (i was redoing some analysis and found > this discrepancy between the analysis i did in 2007 and the one conducted on the new > ensembl database). I find that to be a rather drastic decrease, but perhaps thats not so? In > essence I "loose" a lot of probes, but if those that are filtered are "false positives" it is of > course worth it! that was my logic so forth at least... So, first i would like to know if anybody > considers this strategy wise/unwise? it just seems to me a bit surprising that the probe sets > on the affy chips mismatch to such a large extend that only roughly a third of the probes > remain in the analysis? I think you are making a pretty strong assumption here. Do you know how Ensembl is annotating Affy Probe IDs to transcript? It seems to me that you are assuming that Ensembl is somehow checking to see what transcript the probes are complementary to, whereas they may in fact be simply taking data from Affy and accepting them verbatim. I personally have no idea, but would want to know that before I filtered data in this way. > > I then wanted to check this decrease in affy annotated probe sets which leads me to question > 2, a pure biomaRt issue: > > q2. I wish to access earlier ensembl versions to check and possible make a graph of the > decrease of the annotated probe sets for the rat 230_2 chip over time. but i run into a > problem: > >> mart <- useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl ",archive=T) > Error in useMart("ensembl_mart_46", dataset = "rnorvegicus_gene_ensembl", : > Incorrect BioMart name, use the listMarts function to see which BioMart databases are > available > > though they are listed in the archive: I don't know if this is the problem, but you have mixed a devel version of biomaRt in your release version of R. This works for me with a release version of biomaRt: mart <- useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T ) Checking attributes and filters ... ok > > sessionInfo() R version 2.8.0 (2008-10-20) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_1.16.0 fortunes_1.3-6 [3] RMySQL_0.6-1 DBI_0.2-4 [5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1 [7] Biostrings_2.10.1 IRanges_1.0.2 loaded via a namespace (and not attached): [1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16 RCurl_0.92-0 [5] tools_2.8.0 XML_1.94-0.1 Best, Jim > >> listMarts(archive=T) > biomart version > 1 ensembl_mart_47 ENSEMBL GENES 47 (SANGER) > 2 genomic_features_mart_47 Genomic Features > 3 snp_mart_47 SNP > 4 vega_mart_47 Vega > 5 compara_mart_homology_47 Compara homology > 6 compara_mart_multiple_ga_47 Compara multiple alignments > 7 compara_mart_pairwise_ga_47 Compara pairwise alignments > 8 ensembl_mart_46 ENSEMBL GENES 46 (SANGER) > 9 genomic_features_mart_46 Genomic Features > 10 snp_mart_46 SNP > 11 vega_mart_46 Vega > 12 compara_mart_homology_46 Compara homology > 13 compara_mart_multiple_ga_46 Compara multiple alignments > 14 compara_mart_pairwise_ga_46 Compara pairwise alignments > 15 ensembl_mart_45 ENSEMBL GENES 45 (SANGER) > 16 snp_mart_45 SNP > 17 vega_mart_45 Vega > 18 compara_mart_homology_45 Compara homology > 19 compara_mart_multiple_ga_45 Compara multiple alignments > 20 compara_mart_pairwise_ga_45 Compara pairwise alignments > 21 ensembl_mart_44 ENSEMBL GENES 44 (SANGER) > 22 snp_mart_44 SNP > 23 vega_mart_44 Vega > 24 compara_mart_homology_44 Compara homology > 25 compara_mart_pairwise_ga_44 Compara pairwise alignments > 26 ensembl_mart_43 ENSEMBL GENES 43 (SANGER) > 27 snp_mart_43 SNP > 28 vega_mart_43 Vega > 29 compara_mart_homology_43 Compara homology > 30 compara_mart_pairwise_ga_43 Compara pairwise alignments > >> sessionInfo() > R version 2.8.0 (2008-10-20) > i386-apple-darwin9.5.0 > > locale: > C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rat2302cdf_2.3.0 biomaRt_1.99.2 affy_1.20.0 Biobase_2.2.1 > > loaded via a namespace (and not attached): > [1] RCurl_0.94-0 XML_1.98-1 affyio_1.10.1 > [4] preprocessCore_1.4.0 > > cheers, > Jesper Ryge, PhD student > karolinska Institutet > Dep. of Neuroscience > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Hildebrandt Lab 8220D MSRB III 1150 W. Medical Center Drive Ann Arbor MI 48109-0646 734-936-8662

ADD COMMENT • link 16.9 years ago James W. MacDonald 68k

0

Entering edit mode

James W. MacDonald wrote: > HI Jesper, > > Jesper Ryge wrote: >> Hi everybody >> >> q1. I have been using biomaRt to filter Affymetrix probe sets prior to >> statistical testing such as limma or cyberT. That is, I only include >> probe sets that are annotated in ensembl. In this sense I get rid of >> probe set that do not align correctly to the intended genes - at >> least that was my intention. I know this has been debated before, >> i.e. cdf file and probe set filtering of miss-aligned probe set and I >> find this to be the easiest way to exclude probes that might hybridize >> to wrong transcripts. >> I now find that since 2007 the amount of annotated probe sets on the >> Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of 31099 (i >> was redoing some analysis and found this discrepancy between the >> analysis i did in 2007 and the one conducted on the new ensembl >> database). I find that to be a rather drastic decrease, but perhaps >> thats not so? In essence I "loose" a lot of probes, but if those that >> are filtered are "false positives" it is of course worth it! that was >> my logic so forth at least... So, first i would like to know if >> anybody considers this strategy wise/unwise? it just seems to me a bit >> surprising that the probe sets on the affy chips mismatch to such a >> large extend that only roughly a third of the probes remain in the >> analysis? > > I think you are making a pretty strong assumption here. Do you know how > Ensembl is annotating Affy Probe IDs to transcript? It seems to me that > you are assuming that Ensembl is somehow checking to see what transcript > the probes are complementary to, whereas they may in fact be simply > taking data from Affy and accepting them verbatim. I personally have no > idea, but would want to know that before I filtered data in this way. From http://www.ensembl.org/info/docs/microarray_probe_set_mapping.html Step One: Genome Sequence Mapping In the first step individual probes (oligonucleotides) are mapped to the genome sequence. The Ensembl analysis and annotation pipeline uses the Exonerate sequence comparison and alignment tool (Slater et al., 2005) and tolerates only 1 bp mismatch between the probe and the genome sequence assembly. Probes that hit to 100 or more locations (e.g. suspected Alu repeats) are discarded and not stored in the database. Step Two: Ensembl Transcript Mapping In the second step, we aim to associate microarray probe sets with Ensembl transcript predictions (ENST...). Individual probes are grouped into probe sets and generally it is required that more than 50% of the probes in a probe set hit a given transcript sequence. Probe set sizes are determined dynamically on a per probe set basis, rather than taking the array-wide value documented by the manufacturer. Transcript cDNA sequences are extended by the length of the UTR. Where annotated UTRs are absent a default UTR length is used, calculated for both five and three prime UTRs as the highest of either the mean or the median of all annotated UTRs for a given species. Probes mapping across exon boundaries are not currently captured as the transcript annotations are based on the genomic mappings from step one. > >> >> I then wanted to check this decrease in affy annotated probe sets >> which leads me to question 2, a pure biomaRt issue: >> >> q2. I wish to access earlier ensembl versions to check and possible >> make a graph of the decrease of the annotated probe sets for the rat >> 230_2 chip over time. but i run into a problem: >> >>> mart <- >>> useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archi ve=T) >> Error in useMart("ensembl_mart_46", dataset = >> "rnorvegicus_gene_ensembl", : Incorrect BioMart name, use the >> listMarts function to see which BioMart databases are available >> >> though they are listed in the archive: > > I don't know if this is the problem, but you have mixed a devel version > of biomaRt in your release version of R. This works for me with a > release version of biomaRt: > > mart <- > useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive =T) > Checking attributes and filters ... ok > > > > sessionInfo() > R version 2.8.0 (2008-10-20) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_1.16.0 fortunes_1.3-6 > [3] RMySQL_0.6-1 DBI_0.2-4 > [5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1 > [7] Biostrings_2.10.1 IRanges_1.0.2 > > loaded via a namespace (and not attached): > [1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16 RCurl_0.92-0 > [5] tools_2.8.0 XML_1.94-0.1 > > Best, > > Jim > > >> >>> listMarts(archive=T) >> biomart version >> 1 ensembl_mart_47 ENSEMBL GENES 47 (SANGER) >> 2 genomic_features_mart_47 Genomic Features >> 3 snp_mart_47 SNP >> 4 vega_mart_47 Vega >> 5 compara_mart_homology_47 Compara homology >> 6 compara_mart_multiple_ga_47 Compara multiple alignments >> 7 compara_mart_pairwise_ga_47 Compara pairwise alignments >> 8 ensembl_mart_46 ENSEMBL GENES 46 (SANGER) >> 9 genomic_features_mart_46 Genomic Features >> 10 snp_mart_46 SNP >> 11 vega_mart_46 Vega >> 12 compara_mart_homology_46 Compara homology >> 13 compara_mart_multiple_ga_46 Compara multiple alignments >> 14 compara_mart_pairwise_ga_46 Compara pairwise alignments >> 15 ensembl_mart_45 ENSEMBL GENES 45 (SANGER) >> 16 snp_mart_45 SNP >> 17 vega_mart_45 Vega >> 18 compara_mart_homology_45 Compara homology >> 19 compara_mart_multiple_ga_45 Compara multiple alignments >> 20 compara_mart_pairwise_ga_45 Compara pairwise alignments >> 21 ensembl_mart_44 ENSEMBL GENES 44 (SANGER) >> 22 snp_mart_44 SNP >> 23 vega_mart_44 Vega >> 24 compara_mart_homology_44 Compara homology >> 25 compara_mart_pairwise_ga_44 Compara pairwise alignments >> 26 ensembl_mart_43 ENSEMBL GENES 43 (SANGER) >> 27 snp_mart_43 SNP >> 28 vega_mart_43 Vega >> 29 compara_mart_homology_43 Compara homology >> 30 compara_mart_pairwise_ga_43 Compara pairwise alignments >> >>> sessionInfo() >> R version 2.8.0 (2008-10-20) i386-apple-darwin9.5.0 >> locale: >> C >> >> attached base packages: >> [1] tools stats graphics grDevices utils datasets >> methods [8] base >> other attached packages: >> [1] rat2302cdf_2.3.0 biomaRt_1.99.2 affy_1.20.0 Biobase_2.2.1 >> loaded via a namespace (and not attached): >> [1] RCurl_0.94-0 XML_1.98-1 affyio_1.10.1 [4] >> preprocessCore_1.4.0 >> >> cheers, >> Jesper Ryge, PhD student >> karolinska Institutet >> Dep. of Neuroscience >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 16.9 years ago Cei Abreu-Goodger ▴ 830

0

Entering edit mode

hi james ----- Original Message ----- From: Cei Abreu-Goodger <cei@ebi.ac.uk> Date: Monday, January 26, 2009 3:24 pm Subject: Re: [BioC] biomaRt and Ensembl probe set filter.... To: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> Cc: Jesper.Ryge at ki.se, bioC <bioconductor at="" stat.math.ethz.ch=""> > > > James W. MacDonald wrote: > > HI Jesper, > > > > Jesper Ryge wrote: > >> Hi everybody > >> > >> q1. I have been using biomaRt to filter Affymetrix probe sets > prior to > >> statistical testing such as limma or cyberT. That is, I only > include > >> probe sets that are annotated in ensembl. In this sense I get > rid of > >> probe set that do not align correctly to the intended genes - > at > >> least that was my intention. I know this has been debated > before, > >> i.e. cdf file and probe set filtering of miss-aligned probe set > and I > >> find this to be the easiest way to exclude probes that might > hybridize > >> to wrong transcripts. > >> I now find that since 2007 the amount of annotated probe sets on > the > >> Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of > 31099 (i > >> was redoing some analysis and found this discrepancy between the > >> analysis i did in 2007 and the one conducted on the new ensembl > >> database). I find that to be a rather drastic decrease, but > perhaps > >> thats not so? In essence I "loose" a lot of probes, but if those > that > >> are filtered are "false positives" it is of course worth it! > that was > >> my logic so forth at least... So, first i would like to know if > >> anybody considers this strategy wise/unwise? it just seems to me > a bit > >> surprising that the probe sets on the affy chips mismatch to > such a > >> large extend that only roughly a third of the probes remain in > the > >> analysis? > > > > I think you are making a pretty strong assumption here. Do you > know how > > Ensembl is annotating Affy Probe IDs to transcript? It seems to > me that > > you are assuming that Ensembl is somehow checking to see what > transcript > > the probes are complementary to, whereas they may in fact be > simply > > taking data from Affy and accepting them verbatim. I personally > have no > > idea, but would want to know that before I filtered data in this > way. > From > http://www.ensembl.org/info/docs/microarray_probe_set_mapping.html > Step One: Genome Sequence Mapping > > In the first step individual probes (oligonucleotides) are mapped > to the > genome sequence. The Ensembl analysis and annotation pipeline uses > the > Exonerate sequence comparison and alignment tool (Slater et al., > 2005) > and tolerates only 1 bp mismatch between the probe and the genome > sequence assembly. Probes that hit to 100 or more locations (e.g. > suspected Alu repeats) are discarded and not stored in the database. > > Step Two: Ensembl Transcript Mapping > > In the second step, we aim to associate microarray probe sets with > Ensembl transcript predictions (ENST...). Individual probes are > grouped > into probe sets and generally it is required that more than 50% of > the > probes in a probe set hit a given transcript sequence. Probe set > sizes > are determined dynamically on a per probe set basis, rather than > taking > the array-wide value documented by the manufacturer. Transcript > cDNA > sequences are extended by the length of the UTR. Where annotated > UTRs > are absent a default UTR length is used, calculated for both five > and > three prime UTRs as the highest of either the mean or the median of > all > annotated UTRs for a given species. Probes mapping across exon > boundaries are not currently captured as the transcript annotations > are > based on the genomic mappings from step one. > > hm, it seems to me that the ensembl team is doing a decent effort to filter out non-specific probes... but im not convinced either way yet - to filter or not? if I do a test to determine significantly differentially expressed genes ( e.g. limma) on the full data set, i end up with a list of genes that contains probe sets that by enseml are discarded (not annotated to a gene or transcript, but the probe set alignment information is available). this can be because they are not specific or that they have mismatching probes that potentially cross-hybridize to several other transcripts. these are obviously not very trustworthy probe sets and they were my initial reason for eliminating them from my data set prior to any statistical analysis (in order for them not to affect the false discovery rate)... does anybody have any comment or experiences with this? surely the probes as they are designed by affymetrix are not perfect and as the data base sequence quality increases it mak es sense to filter out the probes that shows to have sequence similarity to regions not originally intended. but how? cheers, jesper

ADD REPLY • link 16.9 years ago Jesper Ryge ▴ 110

0

Entering edit mode

Jesper Ryge wrote: > hi james > > > ----- Original Message ----- > From: Cei Abreu-Goodger <cei at="" ebi.ac.uk=""> > Date: Monday, January 26, 2009 3:24 pm > Subject: Re: [BioC] biomaRt and Ensembl probe set filter.... > To: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> > Cc: Jesper.Ryge at ki.se, bioC <bioconductor at="" stat.math.ethz.ch=""> > >> >> James W. MacDonald wrote: >>> HI Jesper, >>> >>> Jesper Ryge wrote: >>>> Hi everybody >>>> >>>> q1. I have been using biomaRt to filter Affymetrix probe sets >> prior to >>>> statistical testing such as limma or cyberT. That is, I only >> include >>>> probe sets that are annotated in ensembl. In this sense I get >> rid of >>>> probe set that do not align correctly to the intended genes - >> at >>>> least that was my intention. I know this has been debated >> before, >>>> i.e. cdf file and probe set filtering of miss-aligned probe set >> and I >>>> find this to be the easiest way to exclude probes that might >> hybridize >>>> to wrong transcripts. >>>> I now find that since 2007 the amount of annotated probe sets on >> the >>>> Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of >> 31099 (i >>>> was redoing some analysis and found this discrepancy between the >>>> analysis i did in 2007 and the one conducted on the new ensembl >>>> database). I find that to be a rather drastic decrease, but >> perhaps >>>> thats not so? In essence I "loose" a lot of probes, but if those >> that >>>> are filtered are "false positives" it is of course worth it! >> that was >>>> my logic so forth at least... So, first i would like to know if >>>> anybody considers this strategy wise/unwise? it just seems to me >> a bit >>>> surprising that the probe sets on the affy chips mismatch to >> such a >>>> large extend that only roughly a third of the probes remain in >> the >>>> analysis? >>> I think you are making a pretty strong assumption here. Do you >> know how >>> Ensembl is annotating Affy Probe IDs to transcript? It seems to >> me that >>> you are assuming that Ensembl is somehow checking to see what >> transcript >>> the probes are complementary to, whereas they may in fact be >> simply >>> taking data from Affy and accepting them verbatim. I personally >> have no >>> idea, but would want to know that before I filtered data in this >> way. >> From >> http://www.ensembl.org/info/docs/microarray_probe_set_mapping.html >> Step One: Genome Sequence Mapping >> >> In the first step individual probes (oligonucleotides) are mapped >> to the >> genome sequence. The Ensembl analysis and annotation pipeline uses >> the >> Exonerate sequence comparison and alignment tool (Slater et al., >> 2005) >> and tolerates only 1 bp mismatch between the probe and the genome >> sequence assembly. Probes that hit to 100 or more locations (e.g. >> suspected Alu repeats) are discarded and not stored in the database. >> >> Step Two: Ensembl Transcript Mapping >> >> In the second step, we aim to associate microarray probe sets with >> Ensembl transcript predictions (ENST...). Individual probes are >> grouped >> into probe sets and generally it is required that more than 50% of >> the >> probes in a probe set hit a given transcript sequence. Probe set >> sizes >> are determined dynamically on a per probe set basis, rather than >> taking >> the array-wide value documented by the manufacturer. Transcript >> cDNA >> sequences are extended by the length of the UTR. Where annotated >> UTRs >> are absent a default UTR length is used, calculated for both five >> and >> three prime UTRs as the highest of either the mean or the median of >> all >> annotated UTRs for a given species. Probes mapping across exon >> boundaries are not currently captured as the transcript annotations >> are >> based on the genomic mappings from step one. >> >> > > hm, it seems to me that the ensembl team is doing a decent effort to filter out non-specific > probes... but im not convinced either way yet - to filter or not? if I do a test to determine > significantly differentially expressed genes ( e.g. limma) on the full data set, i end up with a > list of genes that contains probe sets that by enseml are discarded (not annotated to a gene > or transcript, but the probe set alignment information is available). this can be because they > are not specific or that they have mismatching probes that potentially cross-hybridize to > several other transcripts. these are obviously not very trustworthy probe sets and they were > my initial reason for eliminating them from my data set prior to any statistical analysis (in > order for them not to affect the false discovery rate)... does anybody have any comment or > experiences with this? surely the probes as they are designed by affymetrix are not perfect > and as the data base sequence quality increases it mak > es sense to filter out the probes that shows to have sequence similarity to regions not > originally intended. but how? > One way to do this is to throw away the probes believed to be irrelevant very early in the analysis, and build new probeset descriptions (aka alternative CDF environments) based on sequence similarity. You'll find relatively easily bibliographic references where this is shown to improve results for Affymetrix arrays. There is also an earlier thread on the topic: https://stat.ethz.ch/pipermail/bioconductor/2006-January/011474.html L. > cheers, > jesper > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.9 years ago Laurent Gautier ★ 2.3k

0

Entering edit mode

hi jim thanx for the fast answer:-) still small bumps on the way but getting better.... installed biomaRt_1.16.0 and now i can run: mart <- useMart("ensembl_mart_47", dataset="rnorvegicus_gene_ensembl",archive=T) also for ensembl_mart_46, but not for mart_45 and downwards... even thought they are listed in listlistMarts(mart, archive=T)? down to ensembl_mart_43, so the functionality seems a little reduced??? cheers, jesper jesper Ryge karolinska Institutet tlf: +46 707 146 879 ----- Original Message ----- From: "James W. MacDonald" <jmacdon@med.umich.edu> Date: Monday, January 26, 2009 3:16 pm Subject: Re: [BioC] biomaRt and Ensembl probe set filter.... To: Jesper.Ryge at ki.se Cc: bioC <bioconductor at="" stat.math.ethz.ch=""> > HI Jesper, > > Jesper Ryge wrote: > > Hi everybody > > > > q1. I have been using biomaRt to filter Affymetrix probe sets > prior to statistical testing such > > as limma or cyberT. That is, I only include probe sets that are > annotated in ensembl. In this > > sense I get rid of probe set that do not align correctly to the > intended genes - at least that > > was my intention. I know this has been debated before, i.e. cdf > file and probe set filtering of > > miss-aligned probe set and I find this to be the easiest way to > exclude probes that might > > hybridize to wrong transcripts. > > I now find that since 2007 the amount of annotated probe sets on > the Affymetrix Rat 230_2 > > has decreased from 17931 -> 12919 out of 31099 (i was redoing > some analysis and found > > this discrepancy between the analysis i did in 2007 and the one > conducted on the new > > ensembl database). I find that to be a rather drastic decrease, > but perhaps thats not so? In > > essence I "loose" a lot of probes, but if those that are filtered > are "false positives" it is of > > course worth it! that was my logic so forth at least... So, > first i would like to know if anybody > > considers this strategy wise/unwise? it just seems to me a bit > surprising that the probe sets > > on the affy chips mismatch to such a large extend that only > roughly a third of the probes > > remain in the analysis? > > I think you are making a pretty strong assumption here. Do you know > how > Ensembl is annotating Affy Probe IDs to transcript? It seems to me > that > you are assuming that Ensembl is somehow checking to see what > transcript > the probes are complementary to, whereas they may in fact be simply > taking data from Affy and accepting them verbatim. I personally > have no > idea, but would want to know that before I filtered data in this way. > > > > > > I then wanted to check this decrease in affy annotated probe sets > which leads me to question > > 2, a pure biomaRt issue: > > > > q2. I wish to access earlier ensembl versions to check and > possible make a graph of the > > decrease of the annotated probe sets for the rat 230_2 chip over > time. but i run into a > > problem: > > > >> mart <- > useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive =T)> Error in useMart("ensembl_mart_46", dataset = "rnorvegicus_gene_ensembl", : > > Incorrect BioMart name, use the listMarts function to see which > BioMart databases are > > available > > > > though they are listed in the archive: > > I don't know if this is the problem, but you have mixed a devel > version > of biomaRt in your release version of R. This works for me with a > release version of biomaRt: > > mart <- > useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive =T) > Checking attributes and filters ... ok > > > > sessionInfo() > R version 2.8.0 (2008-10-20) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_1.16.0 fortunes_1.3-6 > [3] RMySQL_0.6-1 DBI_0.2-4 > [5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1 > [7] Biostrings_2.10.1 IRanges_1.0.2 > > loaded via a namespace (and not attached): > [1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16 > RCurl_0.92-0 > [5] tools_2.8.0 XML_1.94-0.1 > > Best, > > Jim > > > > > >> listMarts(archive=T) > > biomart version > > 1 ensembl_mart_47 ENSEMBL GENES 47 (SANGER) > > 2 genomic_features_mart_47 Genomic Features > > 3 snp_mart_47 SNP > > 4 vega_mart_47 Vega > > 5 compara_mart_homology_47 Compara homology > > 6 compara_mart_multiple_ga_47 Compara multiple alignments > > 7 compara_mart_pairwise_ga_47 Compara pairwise alignments > > 8 ensembl_mart_46 ENSEMBL GENES 46 (SANGER) > > 9 genomic_features_mart_46 Genomic Features > > 10 snp_mart_46 SNP > > 11 vega_mart_46 Vega > > 12 compara_mart_homology_46 Compara homology > > 13 compara_mart_multiple_ga_46 Compara multiple alignments > > 14 compara_mart_pairwise_ga_46 Compara pairwise alignments > > 15 ensembl_mart_45 ENSEMBL GENES 45 (SANGER) > > 16 snp_mart_45 SNP > > 17 vega_mart_45 Vega > > 18 compara_mart_homology_45 Compara homology > > 19 compara_mart_multiple_ga_45 Compara multiple alignments > > 20 compara_mart_pairwise_ga_45 Compara pairwise alignments > > 21 ensembl_mart_44 ENSEMBL GENES 44 (SANGER) > > 22 snp_mart_44 SNP > > 23 vega_mart_44 Vega > > 24 compara_mart_homology_44 Compara homology > > 25 compara_mart_pairwise_ga_44 Compara pairwise alignments > > 26 ensembl_mart_43 ENSEMBL GENES 43 (SANGER) > > 27 snp_mart_43 SNP > > 28 vega_mart_43 Vega > > 29 compara_mart_homology_43 Compara homology > > 30 compara_mart_pairwise_ga_43 Compara pairwise alignments > > > >> sessionInfo() > > R version 2.8.0 (2008-10-20) > > i386-apple-darwin9.5.0 > > > > locale: > > C > > > > attached base packages: > > [1] tools stats graphics grDevices utils datasets > methods > > [8] base > > > > other attached packages: > > [1] rat2302cdf_2.3.0 biomaRt_1.99.2 affy_1.20.0 > Biobase_2.2.1 > > > > loaded via a namespace (and not attached): > > [1] RCurl_0.94-0 XML_1.98-1 affyio_1.10.1 > > [4] preprocessCore_1.4.0 > > > > cheers, > > Jesper Ryge, PhD student > > karolinska Institutet > > Dep. of Neuroscience > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > Hildebrandt Lab > 8220D MSRB III > 1150 W. Medical Center Drive > Ann Arbor MI 48109-0646 > 734-936-8662 >

ADD REPLY • link 16.9 years ago Jesper Ryge ▴ 110

Login before adding your answer.