biomaRt - batch query for chromosome location to gene identifier?
2
0
Entering edit mode
Kemal Akat ▴ 120
@kemal-akat-4351
Last seen 8.0 years ago
Hi all, I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36. Cluster ID Strand Chromosome Cluster_Begin Cluster_End slc754_chr2 + chr2 74295624 74295644 slc4695_chr1 - chr1 203949843 203949866 slc2213_chr8 - chr8 103464182 103464207 slc1866_chr17 - chr17 53272097 53272117 slc1642_chr12 - chr12 11150173 11150194 ... For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all? It is easy for a single entry: > geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36 > geneid entrezgene 1 10797 > However, so far I have failed to make a batch query out of it. I imported/created the following 1 column data frame with the localization formatted as necessary > tdp chromosomal_cocation slc754_chr2 2,74295624,74295644,1 slc4695_chr1 1,203949843,203949866,-1 slc2213_chr8 8,103464182,103464207,-1 slc1866_chr17 17,53272097,53272117,-1 slc1642_chr12 12,11150173,11150194,-1 ... I have two points where I failed: 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run: > geneid <-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54) I get 19733 gene ids; my dataset actually has only 161 locations. 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error. I tried splitting the localization infos into separate vectors, i.e. chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <- c(...), strand <- c(...) and modified my query: > geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54) But this seems to combine the information in the different vectors as the result is over 20.000 entries. Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!? Any help/idea appreciated! Thank you, Kemal Dr. med. Kemal Akat Postdoctoral Fellow Laboratory of RNA Molecular Biology The Rockefeller University 1230 York Avenue, Box #186 New York, NY 10065
biomaRt ASSIGN biomaRt ASSIGN • 2.1k views
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 22 hours ago
United States
Here is a solution "entirely in R" provided you have acquired the hg18 transcript database using GenomicFeatures makeTranscriptDbFromUCSC("hg18") and run saveFeatures on the result to yield file "hg18.txdb.sqlite": I put your data in space-delimited text, then transformed to GRanges preserving chromosome and strand, then find overlaps of your regions with transcripts for hg18. The names of these transcripts are given in the UCSC 'knownGene' vocabulary. After the session info, we use R 2.10 org.Hs.eg.db to resolve these to (I think) hg18 entrez ids (we only have hg19 mappings UCSCKG<->Entrez for current R, to the best of my knowledge, and I believe that even these are deprecated). > kemdat = read.table("kemdat.txt", sep=" ", h=TRUE,stringsAsFactors=FALSE) > library(GenomicRanges) > kem = with(kemdat, GRanges(ranges=IRanges(Cluster_Begin, Cluster_End), strand= + Rle(factor(Strand)), seqnames=Rle(factor(Chromosome)))) > kem GRanges with 5 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr2 [ 74295624, 74295644] + | [2] chr1 [203949843, 203949866] - | [3] chr8 [103464182, 103464207] - | [4] chr17 [ 53272097, 53272117] - | [5] chr12 [ 11150173, 11150194] - | seqlengths chr1 chr12 chr17 chr2 chr8 NA NA NA NA NA > library(GenomicFeatures) > hg18.txdb = loadFeatures("hg18.txdb.sqlite") > tx18 = transcripts(hg18.txdb) > kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name > dput(kg) c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2", "uc003ykr.1", "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2", "uc001qze.2", "uc001qzf.1", "uc001qzj.2") > sessionInfo() R version 2.13.0 Under development (unstable) (2010-10-29 r53474) Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.3.6 GenomicRanges_1.3.2 IRanges_1.9.6 loaded via a namespace (and not attached): [1] BSgenome_1.17.7 Biobase_2.10.0 Biostrings_2.17.47 DBI_0.2-5 [5] RCurl_1.4-3 RSQLite_0.9-2 XML_3.1-1 biomaRt_2.5.1 [9] rtracklayer_1.11.3 > Now use R 2.10.1 and org.Hs.eg.db 2.3.6 (current is 2.4.5 for devel, and the UCSCKG mapping is deprecated) > kgs = c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2", "uc003ykr.1", + "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2", + "uc001qze.2", "uc001qzf.1", "uc001qzj.2") > unlist(mget(kgs, revmap(org.Hs.egUCSCKG)) + ) uc002skj.1 uc002skk.1 uc010ffb.1 uc001hdb.2 uc003ykr.1 uc003yks.1 uc002ivc.1 "10797" "10797" "10797" "64710" "51366" "51366" "51649" uc009zhp.1 uc001qzb.2 uc001qzc.2 uc001qze.2 uc001qzf.1 uc001qzj.2 "11272" "5554" "5554" "5554" "5545" "5554" > sessionInfo() R version 2.10.1 RC (2009-12-10 r50697) i386-apple-darwin9.8.0 locale: [1] C attached base packages: [1] stats graphics grDevices datasets tools utils methods [8] base other attached packages: [1] org.Hs.eg.db_2.3.6 RSQLite_0.7-3 DBI_0.2-5 [4] AnnotationDbi_1.8.2 Biobase_2.6.1 weaver_1.11.1 [7] codetools_0.2-2 digest_0.4.1 On Thu, Nov 11, 2010 at 10:01 PM, Kemal Akat <kakat at="" mail.rockefeller.edu=""> wrote: > Hi all, > > I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36. > > Cluster ID ? ? ? ? ? ? ?Strand ?Chromosome ? ? ?Cluster_Begin ? Cluster_End > slc754_chr2 ? ? ? ? ? ? + ? ? ? ? ? ? ? chr2 ? ? ? ? ? ?74295624 ? ? ? ? ? ? ? ?74295644 > slc4695_chr1 ? ? ? ? ? ?- ? ? ? ? ? ? ? chr1 ? ? ? ? ? ? ? ? ? ?203949843 ? ? ? ? ? ? ? 203949866 > slc2213_chr8 ? ? ? ? ? ?- ? ? ? ? ? ? ? chr8 ? ? ? ? ? ?103464182 ? ? ? ? ? ? ? 103464207 > slc1866_chr17 ? - ? ? ? ? ? ? ? chr17 ? ? ? ? ? 53272097 ? ? ? ? ? ? ? ?53272117 > slc1642_chr12 ? - ? ? ? ? ? ? ? chr12 ? ? ? ? ? 11150173 ? ? ? ? ? ? ? ? ? ? ? ?11150194 > ... > > For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all? > > It is easy for a single entry: > >> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36 >> geneid > ?entrezgene > 1 ? ? ?10797 >> > > However, so far I have failed to make a batch query out of it. > > I imported/created the following 1 column data frame with the localization formatted as necessary > >> tdp > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? chromosomal_cocation > slc754_chr2 ? ? ? ? ? ? ?2,74295624,74295644,1 > slc4695_chr1 ? ?1,203949843,203949866,-1 > slc2213_chr8 ? ?8,103464182,103464207,-1 > slc1866_chr17 ? ? ? ? ? 17,53272097,53272117,-1 > slc1642_chr12 ? ? ? ? ? 12,11150173,11150194,-1 > ... > > I have two points where I failed: > > 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run: > >> geneid <-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54) > > I get 19733 gene ids; my dataset actually has only 161 locations. > > 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error. > I tried splitting the localization infos into separate vectors, i.e. chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <- c(...), strand <- c(...) and modified my query: > >> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54) > > But this seems to combine the information in the different vectors as the result is over 20.000 entries. > > Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!? > > Any help/idea appreciated! > > Thank you, > Kemal > > Dr. med. Kemal Akat > Postdoctoral Fellow > Laboratory of RNA Molecular Biology > The Rockefeller University > 1230 York Avenue, Box #186 > New York, NY 10065 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ADD COMMENT 0 Entering edit mode On 11/12/2010 04:18 AM, Vincent Carey wrote: > tx18 = transcripts(hg18.txdb) >> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name Better to use the accessor matchMatrix(findOveralaps(kem, tx18)) Martin -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
0
Entering edit mode
Dear Vincent and Martin, thank you for your help and explanations. I will try your suggestions. Dear Wolfgang, sorry if the info I posted was incomplete. It was more a semantic explanation than a technical one. I realized the incorrect syntax, but that was just a typo (as I couldn't copy and paste back then). I'll try to be more precise in the future. For the sake of completeness here is the actual code I was running: 1) with one filter, referring to the column of the data frame > options(width = 800, max.print = 5E5) > library(biomaRt) > ensembl54 <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = "/biomart/martservice", archive = FALSE) > tdp <- read.delim("/Users/Kemal/Desktop/Projects/biomaRt/tdp.txt", row.names = 1) > genes <- getBM(attributes = "entrezgene", filters = c("chromosomal_region"), values = list(tdp$Chromosomal_Location), mart = ensembl54) > genes entrezgene 1 6964 2 651536 3 445347 4 3492 5 100133739 6 652494 ... 19731 55657 19732 267002 19733 692312 > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 loaded via a namespace (and not attached): [1] RCurl_1.4-3 XML_3.2-0 > 2) with multiple filters and the localization info splitted into 4 separate vectors > options(width = 800, max.print = 5E5) # change display settings to allow larger data frames, modify as needed > library(biomaRt) # load the biomaRt package > #ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # to assign the variable ensembl (or else) to hg19, NCBI build 37 > ensembl54 <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = "/biomart/martservice", archive = FALSE) # to use the hg18, NCBI build 36 > > chromosome <-c(2, 1, 8, 17, 12, 10, 21, 2, 7, 16, 5, 4, 1, 19, 13, 13, 10, 7, 15, 7, 2, 12, 21, 20, 5, 11, 15, 12, 17, 3, 17, 14, 19, 13, 6, 14, 11, 13, 2, 20, 7, 10, 1, 16, "X", 22, 20, 1, 3, 8, 4, 1, 6, 15, 17, 4, 12, 7, 1, 14, 12, 17, 12, 6, 9, 22, "X", 7, 12, 10, 19, 5, 1, 8, 11, 8, 19, 7, 6, 5, 1, 6, 9, 19, 10, "X", 3, 5, 13, 17, 20, 3, 16, 5, 13, 12, 15, 19, 4, 16, 10, 8, 7, 7, 12, 6, 11, 21, 17, "X", 15, 10, 16, 15, 9, 5, 2, 6, 12, 5, 14, 14, 6, 6, 15, 4, 1, 9, 8, 1, 5, "X", 11, 2, 1, 19, 2, 2, 13, 1, 17, 13, "X", 13, 7, 11, 3, "X", 15, 17, 22, 11, 16, 19, 7, 2, 13, 9, 14, 12, 1) > > strand <- c(1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, 1, -1, 1, 1, -1, 1, 1, -1) > > start <- c(74295624, 203949843, 103464182, 53272097, 11150173, 52616450, 29638098, 160316091, 26218430, 73763529, 118611423, 78176046, 39566219, 45430361, 34823356, 114106490, 85902288, 55407132, 97080019, 97322520, 241940312, 49774225, 43149920, 1298551, 71540977, 2967925, 75125438, 107566872, 32012029, 50130463, 24612590, 103222346, 49628998, 76795050,47700337, 61273382, 64288817, 99433818, 86857893, 60954640, 121562796, 102024852, 233341511, 46963400, 54115224, 17405158, 33706984, 233341179, 11574528, 117938795, 100202920, 191280320, 89641039, 91241813, 27712528, 72107017, 45043151, 115686015, 227526472, 59794222, 132193041, 2534782, 49500468, 64462329, 19040649, 34020658, 118261386, 5071002,47616321, 101146860, 41419808, 145858601, 9912662, 109329554, 47446160, 130922997, 42311362, 135263153, 64348945, 32179718, 224044250, 112127969, 37126090, 41697757, 7442786, 24005946, 52264406, 154175486, 76478566, 20845822, 30384091, 9561760, 55844512, 65510903, 73435394, 52920938, 42797538, 45273560, 174489559, 55535115, 847112, 26283582,79686310, 138640447, 751963, 124334398, 122435615, 29638867, 5277351, 46804967, 39560795, 89503999, 55844568, 42797166, 113346669, 139476285, 74236497, 43716256, 52725397, 137763505, 31647833, 101463626, 136622538, 76608878, 66287317, 174489713, 32145213, 114023332, 61632141, 160759702, 56575893, 73723822, 92851504, 74295579, 20108627, 51970080,86384720, 174692895, 76479389, 233342144, 19819930, 47962168, 70434946, 33344305, 135285934, 64288869, 143115273, 100555421, 40493944, 44486723, 37016836, 27475187, 14437226, 40184189, 104928588, 38862574, 96917747, 114023291, 89944103, 67952950, 223656250) > > end <- c(74295644, 203949866, 103464207, 53272117, 11150194, 52616474, 29638121, 160316115, 26218451, 73763550, 118611442, 78176077, 39566244, 45430401, 34823379, 114106513, 85902314, 55407154, 97080040, 97322541, 241940335, 49774248, 43149944, 1298573, 71540997, 2967950, 75125461, 107566904, 32012052, 50130483, 24612615, 103222367, 49629021, 76795071, 47700356, 61273403, 64288842, 99433838, 86857915, 60954662, 121562817, 102024873, 233341535, 46963421, 54115248, 17405178, 33707009, 233341202, 11574549, 117938820, 100202946, 191280339, 89641061, 91241835, 27712550, 72107038, 45043175, 115686058, 227526491, 59794245, 132193064, 2534802, 49500489, 64462352, 19040675, 34020682, 118261409, 5071024, 47616347, 101146884, 41419832, 145858626, 9912686, 109329575, 47446181, 130923017, 42311383, 135263177, 64348968, 32179741, 224044275, 112127989, 37126110, 41697783, 7442807, 24005968, 52264426, 154175508, 76478587, 20845844, 30384113, 9561782, 55844532, 65510922, 73435415, 52920963, 42797563, 45273580, 174489628, 55535135, 847134, 26283606, 79686336, 138640468, 751985, 124334422, 122435636, 29638895, 5277374, 46804991, 39560819, 89504022, 55844595, 42797186, 113346688, 139476306, 74236521, 43716279, 52725418, 137763531, 31647855, 101463647, 136622560, 76608900, 66287361, 174489736, 32145234, 114023358, 61632162, 160759726, 56575914, 73723844, 92851527, 74295600, 20108650, 51970105, 86384744, 174692915, 76479410, 233342163, 19819951, 47962190, 70434970, 33344326, 135285963, 64288896, 143115294, 100555442, 40493968, 44486745, 37016858, 27475207, 14437254, 40184209, 104928610, 38862598, 96917795, 114023316, 89944124, 67952970, 223656273) > > genes <- getBM(attributes = "entrezgene", filters = c("chromosome_name", "start", "end", "strand"), values = list(chromosome, start, end, strand), mart = ensembl54) > genes entrezgene 1 6964 2 651536 3 445347 4 3492 ... 19073 10806 19074 10000 19075 692312 > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 loaded via a namespace (and not attached): [1] RCurl_1.4-3 XML_3.2-0 > Kind regards, Kemal Am 12.11.2010 um 09:16 schrieb Martin Morgan: > On 11/12/2010 04:18 AM, Vincent Carey wrote: >> tx18 = transcripts(hg18.txdb) >>> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name > > Better to use the accessor matchMatrix(findOveralaps(kem, tx18)) > > Martin > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793
0
Entering edit mode
library(biomaRt) # load the biomaRt package ensembl54 <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = "/biomart/martservice", archive = FALSE) # to use the hg18, NCBI build 36 chromosome <-c(2, 1, 8, 17, 12, 10, 21, 2, 7, 16, 5, 4, 1, 19, 13, 13, 10, 7, 15, 7, 2, 12, 21, 20, 5, 11, 15, 12, 17, 3, 17, 14, 19, 13, 6, 14, 11, 13, 2, 20, 7, 10, 1, 16, "X", 22, 20, 1, 3, 8, 4, 1, 6, 15, 17, 4, 12, 7, 1, 14, 12, 17, 12, 6, 9, 22, "X", 7, 12, 10, 19, 5, 1, 8, 11, 8, 19, 7, 6, 5, 1, 6, 9, 19, 10, "X", 3, 5, 13, 17, 20, 3, 16, 5, 13, 12, 15, 19, 4, 16, 10, 8, 7, 7, 12, 6, 11, 21, 17, "X", 15, 10, 16, 15, 9, 5, 2, 6, 12, 5, 14, 14, 6, 6, 15, 4, 1, 9, 8, 1, 5, "X", 11, 2, 1, 19, 2, 2, 13, 1, 17, 13, "X", 13, 7, 11, 3, "X", 15, 17, 22, 11, 16, 19, 7, 2, 13, 9, 14, 12, 1) strand <- c(1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, 1, -1, 1, 1, -1, 1, 1, -1) start <- c(74295624, 203949843, 103464182, 53272097, 11150173, 52616450, 29638098, 160316091, 26218430, 73763529, 118611423, 78176046, 39566219, 45430361, 34823356, 114106490, 85902288, 55407132, 97080019, 97322520, 241940312, 49774225, 43149920, 1298551, 71540977, 2967925, 75125438, 107566872, 32012029, 50130463, 24612590, 103222346, 49628998, 76795050,47700337, 61273382, 64288817, 99433818, 86857893, 60954640, 121562796, 102024852, 233341511, 46963400, 54115224, 17405158, 33706984, 233341179, 11574528, 117938795, 100202920, 191280320, 89641039, 91241813, 27712528, 72107017, 45043151, 115686015, 227526472, 59794222, 132193041, 2534782, 49500468, 64462329, 19040649, 34020658, 118261386, 5071002,47616321, 101146860, 41419808, 145858601, 9912662, 109329554, 47446160, 130922997, 42311362, 135263153, 64348945, 32179718, 224044250, 112127969, 37126090, 41697757, 7442786, 24005946, 52264406, 154175486, 76478566, 20845822, 30384091, 9561760, 55844512, 65510903, 73435394, 52920938, 42797538, 45273560, 174489559, 55535115, 847112, 26283582,79686310, 138640447, 751963, 124334398, 122435615, 29638867, 5277351, 46804967, 39560795, 89503999, 55844568, 42797166, 113346669, 139476285, 74236497, 43716256, 52725397, 137763505, 31647833, 101463626, 136622538, 76608878, 66287317, 174489713, 32145213, 114023332, 61632141, 160759702, 56575893, 73723822, 92851504, 74295579, 20108627, 51970080,86384720, 174692895, 76479389, 233342144, 19819930, 47962168, 70434946, 33344305, 135285934, 64288869, 143115273, 100555421, 40493944, 44486723, 37016836, 27475187, 14437226, 40184189, 104928588, 38862574, 96917747, 114023291, 89944103, 67952950, 223656250) end <- c(74295644, 203949866, 103464207, 53272117, 11150194, 52616474, 29638121, 160316115, 26218451, 73763550, 118611442, 78176077, 39566244, 45430401, 34823379, 114106513, 85902314, 55407154, 97080040, 97322541, 241940335, 49774248, 43149944, 1298573, 71540997, 2967950, 75125461, 107566904, 32012052, 50130483, 24612615, 103222367, 49629021, 76795071, 47700356, 61273403, 64288842, 99433838, 86857915, 60954662, 121562817, 102024873, 233341535, 46963421, 54115248, 17405178, 33707009, 233341202, 11574549, 117938820, 100202946, 191280339, 89641061, 91241835, 27712550, 72107038, 45043175, 115686058, 227526491, 59794245, 132193064, 2534802, 49500489, 64462352, 19040675, 34020682, 118261409, 5071024, 47616347, 101146884, 41419832, 145858626, 9912686, 109329575, 47446181, 130923017, 42311383, 135263177, 64348968, 32179741, 224044275, 112127989, 37126110, 41697783, 7442807, 24005968, 52264426, 154175508, 76478587, 20845844, 30384113, 9561782, 55844532, 65510922, 73435415, 52920963, 42797563, 45273580, 174489628, 55535135, 847134, 26283606, 79686336, 138640468, 751985, 124334422, 122435636, 29638895, 5277374, 46804991, 39560819, 89504022, 55844595, 42797186, 113346688, 139476306, 74236521, 43716279, 52725418, 137763531, 31647855, 101463647, 136622560, 76608900, 66287361, 174489736, 32145234, 114023358, 61632162, 160759726, 56575914, 73723844, 92851527, 74295600, 20108650, 51970105, 86384744, 174692915, 76479410, 233342163, 19819951, 47962190, 70434970, 33344326, 135285963, 64288896, 143115294, 100555442, 40493968, 44486745, 37016858, 27475207, 14437254, 40184209, 104928610, 38862598, 96917795, 114023316, 89944124, 67952970, 223656273) sel = 1:3 ## try different choices - 1, 2, 3, many... genes <- getBM(attributes = c("entrezgene","chromosome_name", "start_position", "end_position", "strand"), filters = c("chromosome_name", "start", "end", "strand"), values = list(chromosome[sel], start[sel], end[sel], strand[sel]), mart = ensembl54)
0
Entering edit mode
Hi Kemal, The combo filter chromosome_name, start and end position is interpreted by the Ensembl BioMart as 'give everything on this chromosome between the given start and end position'. However this combo filter is an exception to other filters, that it doesn't know what to do when you give it a list of many chromosomes and many positions. This is not well annotated in the vignette but the current version of BioMart doesn't understand multiple locations given at once. It is better in this case to query for all genes and there positions and then select the ones you're interested in R: genes <- getBM(attributes = c("entrezgene","chromosome_name", "start", "end", "strand"), mart = ensembl54) By not specifying any filter you get all positions. I will update the vignette to reflect this info. Cheers, Steffen On Sun, Nov 14, 2010 at 11:48 AM, Wolfgang Huber <whuber@embl.de> wrote: > Dear Kemal, > > thank you for your explanation - I understand the 'conceptual' nature of > your post. Just, if someone that wants to help would like to start from what > you have tried and improve on it or make suggestions on it, it's more > efficient if the starting points agree as much as possible. > > I made some experiments with your query to the Ensembl BioMart (see > attached, try different choices for 'sel'), and, like you, also could not > get meaningful results out if. 'getBM' returns a huge results dataframe > whose origin or rationale I fail to understand. Maybe one of the Ensembl / > BioMart experts can say something? > > I agree with Vince & Martin, however, that the GenomicRanges-based approach > suggested by Vince is much more appropriate here, for many reasons, > including flexibility and speed. > > Best wishes > Wolfgang > > > Il Nov/13/10 1:19 AM, Kemal Akat ha scritto: > > Dear Vincent and Martin, >> >> thank you for your help and explanations. I will try your suggestions. >> >> Dear Wolfgang, >> >> sorry if the info I posted was incomplete. It was more a semantic >> explanation than a technical one. I realized the incorrect syntax, but that >> was just a typo (as I couldn't copy and paste back then). I'll try to be >> more precise in the future. >> >> For the sake of completeness here is the actual code I was running: >> >> 1) with one filter, referring to the column of the data frame >> >> options(width = 800, max.print = 5E5) >>> library(biomaRt) >>> ensembl54<- useMart("ENSEMBL_MART_ENSEMBL", dataset = >>> "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = >>> "/biomart/martservice", archive = FALSE) >>> tdp<- read.delim("/Users/Kemal/Desktop/Projects/biomaRt/tdp.txt", >>> row.names = 1) >>> genes<- getBM(attributes = "entrezgene", filters = >>> c("chromosomal_region"), values = list(tdp$Chromosomal_Location), mart = >>> ensembl54) >>> genes >>> >> entrezgene >> 1 6964 >> 2 651536 >> 3 445347 >> 4 3492 >> 5 100133739 >> 6 652494 >> ... >> 19731 55657 >> 19732 267002 >> 19733 692312 >> >>> sessionInfo() >>> >> R version 2.12.0 (2010-10-15) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C/en_US.UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] biomaRt_2.6.0 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.4-3 XML_3.2-0 >> >>> >>> >> 2) with multiple filters and the localization info splitted into 4 >> separate vectors >> >> options(width = 800, max.print = 5E5) # change display settings to allow >>> larger data frames, modify as needed >>> library(biomaRt) # load the biomaRt package >>> #ensembl<- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # to >>> assign the variable ensembl (or else) to hg19, NCBI build 37 >>> ensembl54<- useMart("ENSEMBL_MART_ENSEMBL", dataset = >>> "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = >>> "/biomart/martservice", archive = FALSE) # to use the hg18, NCBI build 36 >>> >>> chromosome<-c(2, 1, 8, 17, 12, 10, 21, 2, 7, 16, 5, 4, 1, 19, 13, 13, 10, >>> 7, 15, 7, 2, 12, 21, 20, 5, 11, 15, 12, 17, 3, 17, 14, 19, 13, 6, 14, 11, >>> 13, 2, 20, 7, 10, 1, 16, "X", 22, 20, 1, 3, 8, 4, 1, 6, 15, 17, 4, 12, 7, 1, >>> 14, 12, 17, 12, 6, 9, 22, "X", 7, 12, 10, 19, 5, 1, 8, 11, 8, 19, 7, 6, 5, >>> 1, 6, 9, 19, 10, "X", 3, 5, 13, 17, 20, 3, 16, 5, 13, 12, 15, 19, 4, 16, 10, >>> 8, 7, 7, 12, 6, 11, 21, 17, "X", 15, 10, 16, 15, 9, 5, 2, 6, 12, 5, 14, 14, >>> 6, 6, 15, 4, 1, 9, 8, 1, 5, "X", 11, 2, 1, 19, 2, 2, 13, 1, 17, 13, "X", 13, >>> 7, 11, 3, "X", 15, 17, 22, 11, 16, 19, 7, 2, 13, 9, 14, 12, 1) >>> >>> strand<- c(1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, >>> -1, 1, -1, -1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, 1, -1, >>> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, >>> 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, >>> -1, -1, 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, >>> 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, >>> -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, >>> -1, 1, -1, -1, -1, 1, -1, 1, 1, -1, 1, 1, -1) >>> >>> start<- c(74295624, 203949843, 103464182, 53272097, 11150173, 52616450, >>> 29638098, 160316091, 26218430, 73763529, 118611423, 78176046, 39566219, >>> 45430361, 34823356, 114106490, 85902288, 55407132, 97080019, 97322520, >>> 241940312, 49774225, 43149920, 1298551, 71540977, 2967925, 75125438, >>> 107566872, 32012029, 50130463, 24612590, 103222346, 49628998, >>> 76795050,47700337, 61273382, 64288817, 99433818, 86857893, 60954640, >>> 121562796, 102024852, 233341511, 46963400, 54115224, 17405158, 33706984, >>> 233341179, 11574528, 117938795, 100202920, 191280320, 89641039, 91241813, >>> 27712528, 72107017, 45043151, 115686015, 227526472, 59794222, 132193041, >>> 2534782, 49500468, 64462329, 19040649, 34020658, 118261386, >>> 5071002,47616321, 101146860, 41419808, 145858601, 9912662, 109329554, >>> 47446160, 130922997, 42311362, 135263153, 64348945, 32179718, 224044250, >>> 112127969, 37126090, 41697757, 7442786, 24005946, 52264406, 154175486, >>> 76478566, 20845822, 30384091, 9561760, 55844512, 65510903, 73435394, >>> 52920938! >>> >> > , 42797538, 45273560, 174489559, 55535115, 847112, 26283582,79686310, >> 138640447, 751963, 124334398, 122435615, 29638867, 5277351, 46804967, >> 39560795, 89503999, 55844568, 42797166, 113346669, 139476285, 74236497, >> 43716256, 52725397, 137763505, 31647833, 101463626, 136622538, 76608878, >> 66287317, 174489713, 32145213, 114023332, 61632141, 160759702, 56575893, >> 73723822, 92851504, 74295579, 20108627, 51970080,86384720, 174692895, >> 76479389, 233342144, 19819930, 47962168, 70434946, 33344305, 135285934, >> 64288869, 143115273, 100555421, 40493944, 44486723, 37016836, 27475187, >> 14437226, 40184189, 104928588, 38862574, 96917747, 114023291, 89944103, >> 67952950, 223656250) >> >>> >>> end<- c(74295644, 203949866, 103464207, 53272117, 11150194, 52616474, >>> 29638121, 160316115, 26218451, 73763550, 118611442, 78176077, 39566244, >>> 45430401, 34823379, 114106513, 85902314, 55407154, 97080040, 97322541, >>> 241940335, 49774248, 43149944, 1298573, 71540997, 2967950, 75125461, >>> 107566904, 32012052, 50130483, 24612615, 103222367, 49629021, 76795071, >>> 47700356, 61273403, 64288842, 99433838, 86857915, 60954662, 121562817, >>> 102024873, 233341535, 46963421, 54115248, 17405178, 33707009, 233341202, >>> 11574549, 117938820, 100202946, 191280339, 89641061, 91241835, 27712550, >>> 72107038, 45043175, 115686058, 227526491, 59794245, 132193064, 2534802, >>> 49500489, 64462352, 19040675, 34020682, 118261409, 5071024, 47616347, >>> 101146884, 41419832, 145858626, 9912686, 109329575, 47446181, 130923017, >>> 42311383, 135263177, 64348968, 32179741, 224044275, 112127989, 37126110, >>> 41697783, 7442807, 24005968, 52264426, 154175508, 76478587, 20845844, >>> 30384113, 9561782, 55844532, 65510922, 73435415, 52920963! >>> >> > , 42797563, 45273580, 174489628, 55535135, 847134, 26283606, 79686336, >> 138640468, 751985, 124334422, 122435636, 29638895, 5277374, 46804991, >> 39560819, 89504022, 55844595, 42797186, 113346688, 139476306, 74236521, >> 43716279, 52725418, 137763531, 31647855, 101463647, 136622560, 76608900, >> 66287361, 174489736, 32145234, 114023358, 61632162, 160759726, 56575914, >> 73723844, 92851527, 74295600, 20108650, 51970105, 86384744, 174692915, >> 76479410, 233342163, 19819951, 47962190, 70434970, 33344326, 135285963, >> 64288896, 143115294, 100555442, 40493968, 44486745, 37016858, 27475207, >> 14437254, 40184209, 104928610, 38862598, 96917795, 114023316, 89944124, >> 67952970, 223656273) >> >>> >>> genes<- getBM(attributes = "entrezgene", filters = c("chromosome_name", >>> "start", "end", "strand"), values = list(chromosome, start, end, strand), >>> mart = ensembl54) >>> genes >>> >> entrezgene >> 1 6964 >> 2 651536 >> 3 445347 >> 4 3492 >> ... >> 19073 10806 >> 19074 10000 >> 19075 692312 >> >>> sessionInfo() >>> >> R version 2.12.0 (2010-10-15) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C/en_US.UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] biomaRt_2.6.0 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.4-3 XML_3.2-0 >> >>> >>> >> Kind regards, >> Kemal >> >> >> Am 12.11.2010 um 09:16 schrieb Martin Morgan: >> >> On 11/12/2010 04:18 AM, Vincent Carey wrote: >>> >>>> tx18 = transcripts(hg18.txdb) >>>> >>>>> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name >>>>> >>>> >>> Better to use the accessor matchMatrix(findOveralaps(kem, tx18)) >>> >>> Martin >>> >>> -- >>> Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 >>> Telephone: 206 667-2793 >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
0
Entering edit mode
Hi Steffen and Wolfgang, thanks for the update and info - and all your efforts providing help with R/Bioconductor on this list. Best regards, Kemal Am 14.11.2010 um 19:50 schrieb Steffen Durinck: Hi Kemal, The combo filter chromosome_name, start and end position is interpreted by the Ensembl BioMart as 'give everything on this chromosome between the given start and end position'. However this combo filter is an exception to other filters, that it doesn't know what to do when you give it a list of many chromosomes and many positions. This is not well annotated in the vignette but the current version of BioMart doesn't understand multiple locations given at once. It is better in this case to query for all genes and there positions and then select the ones you're interested in R: genes <- getBM(attributes = c("entrezgene","chromosome_name", "start", "end", "strand"), mart = ensembl54) By not specifying any filter you get all positions. I will update the vignette to reflect this info. Cheers, Steffen On Sun, Nov 14, 2010 at 11:48 AM, Wolfgang Huber <whuber@embl.de<mailto:whuber@embl.de>> wrote: Dear Kemal, thank you for your explanation - I understand the 'conceptual' nature of your post. Just, if someone that wants to help would like to start from what you have tried and improve on it or make suggestions on it, it's more efficient if the starting points agree as much as possible. I made some experiments with your query to the Ensembl BioMart (see attached, try different choices for 'sel'), and, like you, also could not get meaningful results out if. 'getBM' returns a huge results dataframe whose origin or rationale I fail to understand. Maybe one of the Ensembl / BioMart experts can say something? I agree with Vince & Martin, however, that the GenomicRanges-based approach suggested by Vince is much more appropriate here, for many reasons, including flexibility and speed. Best wishes Wolfgang Il Nov/13/10 1:19 AM, Kemal Akat ha scritto: Dear Vincent and Martin, thank you for your help and explanations. I will try your suggestions. Dear Wolfgang, sorry if the info I posted was incomplete. It was more a semantic explanation than a technical one. I realized the incorrect syntax, but that was just a typo (as I couldn't copy and paste back then). I'll try to be more precise in the future. For the sake of completeness here is the actual code I was running: 1) with one filter, referring to the column of the data frame options(width = 800, max.print = 5E5) library(biomaRt) ensembl54<- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org<http: may2009.archive.ensembl.org=""/>", path = "/biomart/martservice", archive = FALSE) tdp<- read.delim("/Users/Kemal/Desktop/Projects/biomaRt/tdp.txt", row.names = 1) genes<- getBM(attributes = "entrezgene", filters = c("chromosomal_region"), values = list(tdp$Chromosomal_Location), mart = ensembl54) genes entrezgene 1 6964 2 651536 3 445347 4 3492 5 100133739 6 652494 ... 19731 55657 19732 267002 19733 692312 sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 loaded via a namespace (and not attached): [1] RCurl_1.4-3 XML_3.2-0 2) with multiple filters and the localization info splitted into 4 separate vectors options(width = 800, max.print = 5E5) # change display settings to allow larger data frames, modify as needed library(biomaRt) # load the biomaRt package #ensembl<- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # to assign the variable ensembl (or else) to hg19, NCBI build 37 ensembl54<- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org<http: may2009.archive.ensembl.org=""/>", path = "/biomart/martservice", archive = FALSE) # to use the hg18, NCBI build 36 chromosome<-c(2, 1, 8, 17, 12, 10, 21, 2, 7, 16, 5, 4, 1, 19, 13, 13, 10, 7, 15, 7, 2, 12, 21, 20, 5, 11, 15, 12, 17, 3, 17, 14, 19, 13, 6, 14, 11, 13, 2, 20, 7, 10, 1, 16, "X", 22, 20, 1, 3, 8, 4, 1, 6, 15, 17, 4, 12, 7, 1, 14, 12, 17, 12, 6, 9, 22, "X", 7, 12, 10, 19, 5, 1, 8, 11, 8, 19, 7, 6, 5, 1, 6, 9, 19, 10, "X", 3, 5, 13, 17, 20, 3, 16, 5, 13, 12, 15, 19, 4, 16, 10, 8, 7, 7, 12, 6, 11, 21, 17, "X", 15, 10, 16, 15, 9, 5, 2, 6, 12, 5, 14, 14, 6, 6, 15, 4, 1, 9, 8, 1, 5, "X", 11, 2, 1, 19, 2, 2, 13, 1, 17, 13, "X", 13, 7, 11, 3, "X", 15, 17, 22, 11, 16, 19, 7, 2, 13, 9, 14, 12, 1) strand<- c(1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, 1, -1, 1, 1, -1, 1, 1, -1) start<- c(74295624, 203949843, 103464182, 53272097, 11150173, 52616450, 29638098, 160316091, 26218430, 73763529, 118611423, 78176046, 39566219, 45430361, 34823356, 114106490, 85902288, 55407132, 97080019, 97322520, 241940312, 49774225, 43149920, 1298551, 71540977, 2967925, 75125438, 107566872, 32012029, 50130463, 24612590, 103222346, 49628998, 76795050,47700337, 61273382, 64288817, 99433818, 86857893, 60954640, 121562796, 102024852, 233341511, 46963400, 54115224, 17405158, 33706984, 233341179, 11574528, 117938795, 100202920, 191280320, 89641039, 91241813, 27712528, 72107017, 45043151, 115686015, 227526472, 59794222, 132193041, 2534782, 49500468, 64462329, 19040649, 34020658, 118261386, 5071002,47616321, 101146860, 41419808, 145858601, 9912662, 109329554, 47446160, 130922997, 42311362, 135263153, 64348945, 32179718, 224044250, 112127969, 37126090, 41697757, 7442786, 24005946, 52264406, 154175486, 76478566, 20845822, 30384091, 9561760, 55844512, 65510903, 73435394, 52920938! , 42797538, 45273560, 174489559, 55535115, 847112, 26283582,79686310, 138640447, 751963, 124334398, 122435615, 29638867, 5277351, 46804967, 39560795, 89503999, 55844568, 42797166, 113346669, 139476285, 74236497, 43716256, 52725397, 137763505, 31647833, 101463626, 136622538, 76608878, 66287317, 174489713, 32145213, 114023332, 61632141, 160759702, 56575893, 73723822, 92851504, 74295579, 20108627, 51970080,86384720, 174692895, 76479389, 233342144, 19819930, 47962168, 70434946, 33344305, 135285934, 64288869, 143115273, 100555421, 40493944, 44486723, 37016836, 27475187, 14437226, 40184189, 104928588, 38862574, 96917747, 114023291, 89944103, 67952950, 223656250) end<- c(74295644, 203949866, 103464207, 53272117, 11150194, 52616474, 29638121, 160316115, 26218451, 73763550, 118611442, 78176077, 39566244, 45430401, 34823379, 114106513, 85902314, 55407154, 97080040, 97322541, 241940335, 49774248, 43149944, 1298573, 71540997, 2967950, 75125461, 107566904, 32012052, 50130483, 24612615, 103222367, 49629021, 76795071, 47700356, 61273403, 64288842, 99433838, 86857915, 60954662, 121562817, 102024873, 233341535, 46963421, 54115248, 17405178, 33707009, 233341202, 11574549, 117938820, 100202946, 191280339, 89641061, 91241835, 27712550, 72107038, 45043175, 115686058, 227526491, 59794245, 132193064, 2534802, 49500489, 64462352, 19040675, 34020682, 118261409, 5071024, 47616347, 101146884, 41419832, 145858626, 9912686, 109329575, 47446181, 130923017, 42311383, 135263177, 64348968, 32179741, 224044275, 112127989, 37126110, 41697783, 7442807, 24005968, 52264426, 154175508, 76478587, 20845844, 30384113, 9561782, 55844532, 65510922, 73435415, 52920963! , 42797563, 45273580, 174489628, 55535135, 847134, 26283606, 79686336, 138640468, 751985, 124334422, 122435636, 29638895, 5277374, 46804991, 39560819, 89504022, 55844595, 42797186, 113346688, 139476306, 74236521, 43716279, 52725418, 137763531, 31647855, 101463647, 136622560, 76608900, 66287361, 174489736, 32145234, 114023358, 61632162, 160759726, 56575914, 73723844, 92851527, 74295600, 20108650, 51970105, 86384744, 174692915, 76479410, 233342163, 19819951, 47962190, 70434970, 33344326, 135285963, 64288896, 143115294, 100555442, 40493968, 44486745, 37016858, 27475207, 14437254, 40184209, 104928610, 38862598, 96917795, 114023316, 89944124, 67952970, 223656273) genes<- getBM(attributes = "entrezgene", filters = c("chromosome_name", "start", "end", "strand"), values = list(chromosome, start, end, strand), mart = ensembl54) genes entrezgene 1 6964 2 651536 3 445347 4 3492 ... 19073 10806 19074 10000 19075 692312 sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 loaded via a namespace (and not attached): [1] RCurl_1.4-3 XML_3.2-0 Kind regards, Kemal Am 12.11.2010 um 09:16 schrieb Martin Morgan: On 11/12/2010 04:18 AM, Vincent Carey wrote: tx18 = transcripts(hg18.txdb) kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name Better to use the accessor matchMatrix(findOveralaps(kem, tx18)) Martin -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
0
Entering edit mode
Dear Vincent, your solution works (almost) perfectly (see below ;-) ). In fact, I don't need the step using R2.10.1, because with the UCSC IDs from your solution, I can retrieve the Entrez gene IDs via biomaRt, or do my analysis with the other identifier. Thank you again! Dear Martin, I am not quite sure, about the advantage of your suggestion? Besides of the cleaner syntax (and probably higher efficiency?) the output looks "promising" as it seems, that the it tells me how all of my "query" ranges (= locations) are assigned to a "subject" range in a 1:n way. That would be very nice, but I don't get the content of the "subject" column? I should be a UCSC ID, but I couldn't figure out which (obviously, not the transcript ID)? > mygene.overlap = as.matrix(findOverlaps(mygene.granges,tx18)) > mygene.overlap query subject [1,] 1 6242 [2,] 2 821 [3,] 2 822 [4,] 2 823 ...... [429,] 160 64665 [430,] 160 64666 [431,] 161 64793 [432,] 161 64794 Related to the question above (and the only point that could be "smoothened" ;-) ): is there a way to have my cluster names returned in the results, too, so that I could at a first glimpse see which cluster got which annotation? I tried two positions for my cluster names in the GRanges object: 1) hijacked the metadata column, 2) discovered the name option. But I did not find a possibility to return them side by side with the results in the GenomicFeatures/Ranges, and IRanges vignettes. Best regards, Kemal Structure of data table "data.table" imported into R (I truncated it and that of the output below!): clusterID chromosome strand cluster_begin cluster_end seq1_chr1 chr1 - 103949843 103949866 seq2_chr1 chr1 + 19566219 19566244 seq3_chr1 chr1 - 223341511 223341535 seq4_chr1 chr1 - 3341179 3341202 ... seq161_chr20 chr20 + 2399999 2400010 R session (truncated to save space) > library("GenomicFeatures") Loading required package: IRanges Attaching package: 'IRanges' The following object(s) are masked from 'package:base': cbind, eval, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int, rbind, rep.int, table Loading required package: GenomicRanges > library("biomaRt") > hg18.txdb = loadFeatures("path/to/transcripts/hg18.txdb.sqlite") > tx18 = transcripts(hg18.txdb) > mygene.import = read.table("/path/to/my/table/data.table.txt", sep = "", h = TRUE, stringsAsFactors = FALSE) > mygene.granges = with(mygene.import, GRanges(ranges = IRanges(cluster_begin, cluster_end), strand = Rle(factor(strand)), seqnames = Rle(factor(chromosome)), elementMetadata = clusterID)) > mygene.granges GRanges with 161 ranges and 1 elementMetadata value seqnames ranges strand | elementMetadata <rle> <iranges> <rle> | <character> [1] chr1 [103949843, 103949866] - | seq1_chr1 [2] chr1 [ 19566219, 19566244] + | seq2_chr1 [3] chr1 [223341511, 233341535] - | seq3_chr1 [4] chr1 [3341179, 3341202] - | seq4_chr1 ... ... ... ... ... ... [161] chr20 [2399999, 2400010] + | seq161_chr20 seqlengths chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA >mygene.granges2 = with(mygene.import, GRanges(ranges = IRanges(cluster_begin, cluster_end, names = clusterID), strand = Rle(factor(strand)), seqnames = Rle(factor(chromosome)))) > mygene.granges2 GRanges with 161 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | seq1_chr1 chr1 [103949843, 103949866] - | seq2_chr1 chr1 [ 19566219, 19566244] + | seq3_chr1 chr1 [223341511, 233341535] - | seq4_chr1 chr1 [3341179, 3341202] - | ... ... ... ... ... seq161_chr20 chr20 [2399999, 2400010] + | seqlengths chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 GenomicFeatures_1.2.1 GenomicRanges_1.2.1 IRanges_1.8.2 loaded via a namespace (and not attached): [1] Biobase_2.10.0 Biostrings_2.18.0 BSgenome_1.18.1 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3 rtracklayer_1.10.5 XML_3.2-0 > On Nov 12, 2010, at 7:18 AM, Vincent Carey wrote: > Here is a solution "entirely in R" provided you have acquired the > hg18 transcript database using GenomicFeatures makeTranscriptDbFromUCSC("hg18") > and run saveFeatures on the result to yield file "hg18.txdb.sqlite": > > I put your data in space-delimited text, then transformed to GRanges preserving > chromosome and strand, then find overlaps of your regions with transcripts > for hg18. The names of these transcripts are given in the UCSC 'knownGene' > vocabulary. After the session info, we use R 2.10 org.Hs.eg.db to resolve these > to (I think) hg18 entrez ids (we only have hg19 mappings UCSCKG<->Entrez for > current R, to the best of my knowledge, and I believe that even these > are deprecated). > >> kemdat = read.table("kemdat.txt", sep=" ", h=TRUE,stringsAsFactors=FALSE) >> library(GenomicRanges) >> kem = with(kemdat, GRanges(ranges=IRanges(Cluster_Begin, Cluster_End), strand= > + Rle(factor(Strand)), seqnames=Rle(factor(Chromosome)))) >> kem > GRanges with 5 ranges and 0 elementMetadata values > seqnames ranges strand | > <rle> <iranges> <rle> | > [1] chr2 [ 74295624, 74295644] + | > [2] chr1 [203949843, 203949866] - | > [3] chr8 [103464182, 103464207] - | > [4] chr17 [ 53272097, 53272117] - | > [5] chr12 [ 11150173, 11150194] - | > > seqlengths > chr1 chr12 chr17 chr2 chr8 > NA NA NA NA NA >> library(GenomicFeatures) >> hg18.txdb = loadFeatures("hg18.txdb.sqlite") >> tx18 = transcripts(hg18.txdb) >> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name >> dput(kg) > c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2", "uc003ykr.1", > "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2", > "uc001qze.2", "uc001qzf.1", "uc001qzj.2") >> sessionInfo() > R version 2.13.0 Under development (unstable) (2010-10-29 r53474) > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.3.6 GenomicRanges_1.3.2 IRanges_1.9.6 > > loaded via a namespace (and not attached): > [1] BSgenome_1.17.7 Biobase_2.10.0 Biostrings_2.17.47 DBI_0.2-5 > [5] RCurl_1.4-3 RSQLite_0.9-2 XML_3.1-1 biomaRt_2.5.1 > [9] rtracklayer_1.11.3 >> > > Now use R 2.10.1 and org.Hs.eg.db 2.3.6 (current is 2.4.5 for devel, > and the UCSCKG > mapping is deprecated) > >> kgs = c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2", "uc003ykr.1", > + "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2", > + "uc001qze.2", "uc001qzf.1", "uc001qzj.2") >> unlist(mget(kgs, revmap(org.Hs.egUCSCKG)) > + ) > uc002skj.1 uc002skk.1 uc010ffb.1 uc001hdb.2 uc003ykr.1 uc003yks.1 uc002ivc.1 > "10797" "10797" "10797" "64710" "51366" "51366" "51649" > uc009zhp.1 uc001qzb.2 uc001qzc.2 uc001qze.2 uc001qzf.1 uc001qzj.2 > "11272" "5554" "5554" "5554" "5545" "5554" >> sessionInfo() > R version 2.10.1 RC (2009-12-10 r50697) > i386-apple-darwin9.8.0 > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets tools utils methods > [8] base > > other attached packages: > [1] org.Hs.eg.db_2.3.6 RSQLite_0.7-3 DBI_0.2-5 > [4] AnnotationDbi_1.8.2 Biobase_2.6.1 weaver_1.11.1 > [7] codetools_0.2-2 digest_0.4.1 > > > > On Thu, Nov 11, 2010 at 10:01 PM, Kemal Akat <kakat at="" mail.rockefeller.edu=""> wrote: >> Hi all, >> >> I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36. >> >> Cluster ID Strand Chromosome Cluster_Begin Cluster_End >> slc754_chr2 + chr2 74295624 74295644 >> slc4695_chr1 - chr1 203949843 203949866 >> slc2213_chr8 - chr8 103464182 103464207 >> slc1866_chr17 - chr17 53272097 53272117 >> slc1642_chr12 - chr12 11150173 11150194 >> ... >> >> For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all? >> >> It is easy for a single entry: >> >>> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36 >>> geneid >> entrezgene >> 1 10797 >>> >> >> However, so far I have failed to make a batch query out of it. >> >> I imported/created the following 1 column data frame with the localization formatted as necessary >> >>> tdp >> chromosomal_cocation >> slc754_chr2 2,74295624,74295644,1 >> slc4695_chr1 1,203949843,203949866,-1 >> slc2213_chr8 8,103464182,103464207,-1 >> slc1866_chr17 17,53272097,53272117,-1 >> slc1642_chr12 12,11150173,11150194,-1 >> ... >> >> I have two points where I failed: >> >> 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run: >> >>> geneid <-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54) >> >> I get 19733 gene ids; my dataset actually has only 161 locations. >> >> 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error. >> I tried splitting the localization infos into separate vectors, i.e. chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <- c(...), strand <- c(...) and modified my query: >> >>> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54) >> >> But this seems to combine the information in the different vectors as the result is over 20.000 entries. >> >> Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!? >> >> Any help/idea appreciated! >> >> Thank you, >> Kemal >> >> Dr. med. Kemal Akat >> Postdoctoral Fellow >> Laboratory of RNA Molecular Biology >> The Rockefeller University >> 1230 York Avenue, Box #186 >> New York, NY 10065 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> ADD REPLY 0 Entering edit mode @wolfgang-huber-3550 Last seen 9 days ago EMBL European Molecular Biology Laborat… Kemal, thanks for the feedback! Would you mind being so helpful and provide the actual code that you tried? The line that you claim to have tried is syntactically incorrect (or incomplete) in R, so it would never have run, and the definition of the object 'ensembl54' is missing. Also, please think of including the output of 'sessionInfo()'. Best wishes Wolfgang Kemal Akat scripsit 12/11/10 04:01: > Hi all, > > I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36. > > Cluster ID Strand Chromosome Cluster_Begin Cluster_End > slc754_chr2 + chr2 74295624 74295644 > slc4695_chr1 - chr1 203949843 203949866 > slc2213_chr8 - chr8 103464182 103464207 > slc1866_chr17 - chr17 53272097 53272117 > slc1642_chr12 - chr12 11150173 11150194 > ... > > For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all? > > It is easy for a single entry: > >> geneid<-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36 >> geneid > entrezgene > 1 10797 >> > > However, so far I have failed to make a batch query out of it. > > I imported/created the following 1 column data frame with the localization formatted as necessary > >> tdp > chromosomal_cocation > slc754_chr2 2,74295624,74295644,1 > slc4695_chr1 1,203949843,203949866,-1 > slc2213_chr8 8,103464182,103464207,-1 > slc1866_chr17 17,53272097,53272117,-1 > slc1642_chr12 12,11150173,11150194,-1 > ... > > I have two points where I failed: > > 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run: > >> geneid<-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54) > > I get 19733 gene ids; my dataset actually has only 161 locations. > > 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp\$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error. > I tried splitting the localization infos into separate vectors, i.e. chromosome<- c(2,1,8,17,12,...), start<- c(74295624,....), end<- c(...), strand<- c(...) and modified my query: > >> geneid<-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54) > > But this seems to combine the information in the different vectors as the result is over 20.000 entries. > > Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!? > > Any help/idea appreciated! > > Thank you, > Kemal > > Dr. med. Kemal Akat > Postdoctoral Fellow > Laboratory of RNA Molecular Biology > The Rockefeller University > 1230 York Avenue, Box #186 > New York, NY 10065 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber