Using biomaRt to check probe locations

0

Entering edit mode

Nathan Harmston ▴ 100

@nathan-harmston-2904

Last seen 9.6 years ago

Hi, I have just found a few of the probes which I find to be significantly diff expression in one of my analysis, have no attached genomic location in the annotation package I am using. So I thought a way around this would be to query ensembl using biomaRt and retrieve probe_set locations from ensembl (since I trust Ensembl infinitely more than Affy), however in the listAttributes(ensembl) I have been unable to determine the attribute that I require to determine the start and end location and strand of the actual probe involved, however all I can get back are transcript locations and gene locations. For example probe: 1552422_at from the hgu133plus2 array. Is there an attribute that I am missing that can be used? Also just out of interest, does getBM have the 3 second break automatically or not? or does it use one big query? I ve only been using 1 test probe, and would like to make sure I am not going to get myself and my university blocked. Many thanks in advance, Nathan

Annotation hgu133plus2 probe biomaRt Annotation hgu133plus2 probe biomaRt • 1.4k views

ADD COMMENT • link updated 15.7 years ago by James W. MacDonald 65k • written 15.7 years ago by Nathan Harmston ▴ 100

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 11 minutes ago

United States

Hi Nathan, Nathan Harmston wrote: > Hi, > > I have just found a few of the probes which I find to be significantly > diff expression in one of my analysis, have no attached genomic > location in the annotation package I am using. So I thought a way > around this would be to query ensembl using biomaRt and retrieve > probe_set locations from ensembl (since I trust Ensembl infinitely > more than Affy), however in the listAttributes(ensembl) I have been > unable to determine the attribute that I require to determine the > start and end location and strand of the actual probe involved, > however all I can get back are transcript locations and gene > locations. > > For example probe: 1552422_at from the hgu133plus2 array. > > Is there an attribute that I am missing that can be used? Not that I know of. Ensembl is pretty gene/transcript centric. If you want the start/stop coordinates (or if you just want to see where the probes map to), there are two ways that I can think of offhand. First, you could use the probe package to output the probe sequences in FASTA format and then upload to Blat at UCSC. You will then get where the probes map to, as well as the ability to look at the locations in the genome browser. This is the easier of the two, especially if I give you the code ;-D blatGene <- function(affyid, probe, filename){ ## affyid == Affy probeset ID ## probe == BioC probe package name ## filename == output file name require(probe, quietly = TRUE, character.only = TRUE) tmp <- data.frame(get(probe)) if(length(affyid) > 1){ seqnc <- vector() for(i in seq(along = affyid)) seqnc <- c(seqnc, tmp[tmp$Probe.Set.Name == affyid[i], 1]) }else{ seqnc <- tmp[tmp$Probe.Set.Name == affyid,1] } out <- vector() if(length(seqnc) > 25) warning("Blat will only return values for 25 or fewer sequences!", call. = FALSE) for(i in seq(along = seqnc)) out <- rbind(out, rbind(paste("> Probe", i, sep=""), seqnc[i])) write.table(out, filename, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE) } Second, you could take the probe sequences and use the Biostrings and BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the probes to the genome to get the start and stop coordinates. You could then use the rtracklayer package to put the locations on the genome browser. This will take more work, but will possibly pay dividends in the future if you need to do much mapping of things to the genome. I won't give you code for this, as the Biostrings package is in a state of rapid evolution and what I wrote 6 months ago may not work at all anymore. However, the GenomeSearching vignette (in the BSgenome package) contains the code I used as a template, and should be a good resource. Best, Jim > > Also just out of interest, does getBM have the 3 second break > automatically or not? or does it use one big query? I ve only been > using 1 test probe, and would like to make sure I am not going to get > myself and my university blocked. > > Many thanks in advance, > > > Nathan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Hildebrandt Lab 8220D MSRB III 1150 W. Medical Center Drive Ann Arbor MI 48109-0646 734-936-8662

ADD COMMENT • link 15.7 years ago James W. MacDonald 65k

0

Entering edit mode

On Mon, Aug 11, 2008 at 9:26 AM, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Nathan, > > Nathan Harmston wrote: >> >> Hi, >> >> I have just found a few of the probes which I find to be significantly >> diff expression in one of my analysis, have no attached genomic >> location in the annotation package I am using. So I thought a way >> around this would be to query ensembl using biomaRt and retrieve >> probe_set locations from ensembl (since I trust Ensembl infinitely >> more than Affy), however in the listAttributes(ensembl) I have been >> unable to determine the attribute that I require to determine the >> start and end location and strand of the actual probe involved, >> however all I can get back are transcript locations and gene >> locations. >> >> For example probe: 1552422_at from the hgu133plus2 array. >> >> Is there an attribute that I am missing that can be used? > > Not that I know of. Ensembl is pretty gene/transcript centric. If you want > the start/stop coordinates (or if you just want to see where the probes map > to), there are two ways that I can think of offhand. > > First, you could use the probe package to output the probe sequences in > FASTA format and then upload to Blat at UCSC. You will then get where the > probes map to, as well as the ability to look at the locations in the genome > browser. This is the easier of the two, especially if I give you the code > ;-D > Second, you could take the probe sequences and use the Biostrings and > BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the > probes to the genome to get the start and stop coordinates. You could then > use the rtracklayer package to put the locations on the genome browser. This > will take more work, but will possibly pay dividends in the future if you > need to do much mapping of things to the genome. > > I won't give you code for this, as the Biostrings package is in a state of > rapid evolution and what I wrote 6 months ago may not work at all anymore. > However, the GenomeSearching vignette (in the BSgenome package) contains the > code I used as a template, and should be a good resource. There is a third method, as Ensembl DOES store the information about probe locations. They are not available via the Biomart interface, but they are available for querying directly from the "core" databases. Assuming you want to use the most recent human build, you can do this to get all probe mappings for all arrays: > library(RMySQL) Loading required package: DBI > con <- dbConnect('MySQL',db='homo_sapiens_core_47_36i',host='ensembl db.ensembl.org',user='anonymous') > dat <- dbGetQuery(con,"select op.*,of.*,oa.name from oligo_probe op join oligo_feature of on op.oligo_probe_id=of.oligo_probe_id join oligo_array oa on oa.oligo_array_id=op.oligo_array_id limit 10;") #used limit 10 here for speed. # you could also specify oa.name = 'HGU_U133A_2', or some other # array type if you want to limit things > dat oligo_probe_id oligo_array_id probeset name description 1 6234295 25 89898_at 224:361; <na> 2 6234619 24 40196_at 579:607; <na> 3 6234619 32 40196_at 579:607; <na> 4 6234817 28 Hs.137418.0.A1_3p_at 878:461; <na> 5 6234443 31 230040_at 58:459; <na> 6 6234443 37 230040_at 590:747; <na> 7 6234567 28 Hs.87134.0.A1_3p_at 210:371; <na> 8 6235060 28 Hs2.213065.1.S1_3p_s_at 376:253; <na> 9 6234839 25 74854_at 396:307; <na> 10 6234776 26 218630_at 551:581; <na> length oligo_feature_id seq_region_id seq_region_start seq_region_end 1 25 31760350 226060 1783255 1783279 2 25 31760351 226053 38000872 38000896 3 25 31760351 226053 38000872 38000896 4 25 31760352 226056 21841523 21841547 5 25 31760353 226036 75873860 75873884 6 25 31760353 226036 75873860 75873884 7 25 31760354 226034 162916270 162916294 8 25 31760355 226055 95978760 95978784 9 25 31760356 226043 134008271 134008295 10 25 31760357 226033 53638348 53638372 seq_region_strand mismatches oligo_probe_id analysis_id name 1 -1 0 6234295 5112 HG_U95E 2 1 0 6234619 5112 HG_U95A 3 1 0 6234619 5112 HG_U95Av2 4 -1 0 6234817 5112 U133_X3P 5 -1 0 6234443 5112 HG_U133B 6 -1 0 6234443 5112 HG_U133_Plus_2 7 1 0 6234567 5112 U133_X3P 8 -1 0 6235060 5112 U133_X3P 9 1 0 6234839 5112 HG_U95E 10 -1 0 6234776 5112 HG_U133A_2 Note that this will leave you with a mess, potentially, as not all of the probes from the same probe set necessarily map to the same location in the genome. Hope that helps. Sean

ADD REPLY • link 15.7 years ago Sean Davis 21k

0

Entering edit mode

Section 5 in the main vignette for the 'altcdfenvs' packages shows how to quickly check what probesets of interest are "really" made of. It uses Biostrings, and the code can be edited to suit custom needs. I have written quickly an example with the probeset mentioned (this is going through the refseq ID; doing a full-genome match, as James mentions it, could be interesting to check for possible cross-matches): library(hgu133plus2.db) get("1552422_at", hgu133plus2GENENAME) # [1] "chromosome 10 open reading frame 25" ## Let's see with refseq refseq_ids <- get("1552422_at", hgu133plus2REFSEQ) library(biomaRt) mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl") library(altcdfenvs) getSeq <- function(name) { seq <- getSequence(id = name, type = "refseq_dna", seqType = "cdna", mart = mart) targets <- seq$cdna if (is.null(targets)) return(character(0)) names(targets) <- paste(seq$refseq_dna, 1:nrow(seq), sep = "-") return(targets) } targets <- unlist(lapply(refseq_ids, getSeq)) library(hgu133plus2probe) ## taking code from altcdfenvs' matchAffyProbes() ## (only consider PM, this is a demo) probes <- subset(hgu133plus2probe, Probe.Set.Name == "1552422_at") stringset <- DNAStringSet(probes$sequence) pmdict <- PDict(stringset) targets <- as.list(targets) for (ii in seq(along = targets)) { if is.na(targets[[ii]])) { stop(paste("Target", ii, "is NA.")) } targets[[ii]] <- DNAString(targets[[ii]]) } m_pm <- vector("list", length = length(targets)) for (ii in seq(along = targets)) { md <- matchPDict(pmdict, targets[[ii]]) m_pm[[ii]] <- md } # cheap plot i <- 1 plot(0, 0, xlim=c(0, targets[[i]]@length), ylim=c(0, length(m_pm[[i]])), main=names(targets)[i], type="n") points(unlist(startIndex(m_pm[[i]])), seq(length(m_pm[[i]]))) > On Mon, Aug 11, 2008 at 9:26 AM, James W. MacDonald > <jmacdon at="" med.umich.edu=""> wrote: >> Hi Nathan, >> >> Nathan Harmston wrote: >>> >>> Hi, >>> >>> I have just found a few of the probes which I find to be significantly >>> diff expression in one of my analysis, have no attached genomic >>> location in the annotation package I am using. So I thought a way >>> around this would be to query ensembl using biomaRt and retrieve >>> probe_set locations from ensembl (since I trust Ensembl infinitely >>> more than Affy), however in the listAttributes(ensembl) I have been >>> unable to determine the attribute that I require to determine the >>> start and end location and strand of the actual probe involved, >>> however all I can get back are transcript locations and gene >>> locations. >>> >>> For example probe: 1552422_at from the hgu133plus2 array. >>> >>> Is there an attribute that I am missing that can be used? >> >> Not that I know of. Ensembl is pretty gene/transcript centric. If you >> want >> the start/stop coordinates (or if you just want to see where the probes >> map >> to), there are two ways that I can think of offhand. >> >> First, you could use the probe package to output the probe sequences in >> FASTA format and then upload to Blat at UCSC. You will then get where >> the >> probes map to, as well as the ability to look at the locations in the >> genome >> browser. This is the easier of the two, especially if I give you the >> code >> ;-D > >> Second, you could take the probe sequences and use the Biostrings and >> BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the >> probes to the genome to get the start and stop coordinates. You could >> then >> use the rtracklayer package to put the locations on the genome browser. >> This >> will take more work, but will possibly pay dividends in the future if >> you >> need to do much mapping of things to the genome. >> >> I won't give you code for this, as the Biostrings package is in a state >> of >> rapid evolution and what I wrote 6 months ago may not work at all >> anymore. >> However, the GenomeSearching vignette (in the BSgenome package) contains >> the >> code I used as a template, and should be a good resource. > > There is a third method, as Ensembl DOES store the information about > probe locations. They are not available via the Biomart interface, > but they are available for querying directly from the "core" > databases. Assuming you want to use the most recent human build, you > can do this to get all probe mappings for all arrays: > >> library(RMySQL) > Loading required package: DBI >> con <- >> dbConnect('MySQL',db='homo_sapiens_core_47_36i',host='ensembldb.ens embl.org',user='anonymous') >> dat <- dbGetQuery(con,"select op.*,of.*,oa.name from oligo_probe op join >> oligo_feature of on op.oligo_probe_id=of.oligo_probe_id join oligo_array >> oa on oa.oligo_array_id=op.oligo_array_id limit 10;") #used limit 10 >> here for speed. > # you could also specify oa.name = 'HGU_U133A_2', or some other > # array type if you want to limit things >> dat > oligo_probe_id oligo_array_id probeset name > description > 1 6234295 25 89898_at 224:361; > <na> > 2 6234619 24 40196_at 579:607; > <na> > 3 6234619 32 40196_at 579:607; > <na> > 4 6234817 28 Hs.137418.0.A1_3p_at 878:461; > <na> > 5 6234443 31 230040_at 58:459; > <na> > 6 6234443 37 230040_at 590:747; > <na> > 7 6234567 28 Hs.87134.0.A1_3p_at 210:371; > <na> > 8 6235060 28 Hs2.213065.1.S1_3p_s_at 376:253; > <na> > 9 6234839 25 74854_at 396:307; > <na> > 10 6234776 26 218630_at 551:581; > <na> > length oligo_feature_id seq_region_id seq_region_start seq_region_end > 1 25 31760350 226060 1783255 1783279 > 2 25 31760351 226053 38000872 38000896 > 3 25 31760351 226053 38000872 38000896 > 4 25 31760352 226056 21841523 21841547 > 5 25 31760353 226036 75873860 75873884 > 6 25 31760353 226036 75873860 75873884 > 7 25 31760354 226034 162916270 162916294 > 8 25 31760355 226055 95978760 95978784 > 9 25 31760356 226043 134008271 134008295 > 10 25 31760357 226033 53638348 53638372 > seq_region_strand mismatches oligo_probe_id analysis_id name > 1 -1 0 6234295 5112 HG_U95E > 2 1 0 6234619 5112 HG_U95A > 3 1 0 6234619 5112 HG_U95Av2 > 4 -1 0 6234817 5112 U133_X3P > 5 -1 0 6234443 5112 HG_U133B > 6 -1 0 6234443 5112 HG_U133_Plus_2 > 7 1 0 6234567 5112 U133_X3P > 8 -1 0 6235060 5112 U133_X3P > 9 1 0 6234839 5112 HG_U95E > 10 -1 0 6234776 5112 HG_U133A_2 > > Note that this will leave you with a mess, potentially, as not all of > the probes from the same probe set necessarily map to the same > location in the genome. > > Hope that helps. > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 15.7 years ago lgautier@altern.org ▴ 950

Login before adding your answer.