Here is a solution "entirely in R" provided you have acquired the
hg18 transcript database using GenomicFeatures
makeTranscriptDbFromUCSC("hg18")
and run saveFeatures on the result to yield file "hg18.txdb.sqlite":
I put your data in space-delimited text, then transformed to GRanges
preserving
chromosome and strand, then find overlaps of your regions with
transcripts
for hg18. The names of these transcripts are given in the UCSC
'knownGene'
vocabulary. After the session info, we use R 2.10 org.Hs.eg.db to
resolve these
to (I think) hg18 entrez ids (we only have hg19 mappings
UCSCKG<->Entrez for
current R, to the best of my knowledge, and I believe that even these
are deprecated).
> kemdat = read.table("kemdat.txt", sep=" ",
h=TRUE,stringsAsFactors=FALSE)
> library(GenomicRanges)
> kem = with(kemdat, GRanges(ranges=IRanges(Cluster_Begin,
Cluster_End), strand=
+ Rle(factor(Strand)), seqnames=Rle(factor(Chromosome))))
> kem
GRanges with 5 ranges and 0 elementMetadata values
seqnames ranges strand |
<rle> <iranges> <rle> |
[1] chr2 [ 74295624, 74295644] + |
[2] chr1 [203949843, 203949866] - |
[3] chr8 [103464182, 103464207] - |
[4] chr17 [ 53272097, 53272117] - |
[5] chr12 [ 11150173, 11150194] - |
seqlengths
chr1 chr12 chr17 chr2 chr8
NA NA NA NA NA
> library(GenomicFeatures)
> hg18.txdb = loadFeatures("hg18.txdb.sqlite")
> tx18 = transcripts(hg18.txdb)
> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name
> dput(kg)
c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2",
"uc003ykr.1",
"uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2",
"uc001qze.2", "uc001qzf.1", "uc001qzj.2")
> sessionInfo()
R version 2.13.0 Under development (unstable) (2010-10-29 r53474)
Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.3.6 GenomicRanges_1.3.2 IRanges_1.9.6
loaded via a namespace (and not attached):
[1] BSgenome_1.17.7 Biobase_2.10.0 Biostrings_2.17.47 DBI_0.2-5
[5] RCurl_1.4-3 RSQLite_0.9-2 XML_3.1-1
biomaRt_2.5.1
[9] rtracklayer_1.11.3
>
Now use R 2.10.1 and org.Hs.eg.db 2.3.6 (current is 2.4.5 for devel,
and the UCSCKG
mapping is deprecated)
> kgs = c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2",
"uc003ykr.1",
+ "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2",
"uc001qzc.2",
+ "uc001qze.2", "uc001qzf.1", "uc001qzj.2")
> unlist(mget(kgs, revmap(org.Hs.egUCSCKG))
+ )
uc002skj.1 uc002skk.1 uc010ffb.1 uc001hdb.2 uc003ykr.1 uc003yks.1
uc002ivc.1
"10797" "10797" "10797" "64710" "51366" "51366"
"51649"
uc009zhp.1 uc001qzb.2 uc001qzc.2 uc001qze.2 uc001qzf.1 uc001qzj.2
"11272" "5554" "5554" "5554" "5545" "5554"
> sessionInfo()
R version 2.10.1 RC (2009-12-10 r50697)
i386-apple-darwin9.8.0
locale:
[1] C
attached base packages:
[1] stats graphics grDevices datasets tools utils
methods
[8] base
other attached packages:
[1] org.Hs.eg.db_2.3.6 RSQLite_0.7-3 DBI_0.2-5
[4] AnnotationDbi_1.8.2 Biobase_2.6.1 weaver_1.11.1
[7] codetools_0.2-2 digest_0.4.1
On Thu, Nov 11, 2010 at 10:01 PM, Kemal Akat <kakat at="" mail.rockefeller.edu=""> wrote:
> Hi all,
>
> I have a list of mapped sequence reads to hg18 for that I have the
exact chromosomal location on NCBI build 36.
>
> Cluster ID ? ? ? ? ? ? ?Strand ?Chromosome ? ? ?Cluster_Begin ?
Cluster_End
> slc754_chr2 ? ? ? ? ? ? + ? ? ? ? ? ? ? chr2 ? ? ? ? ? ?74295624 ? ?
? ? ? ? ? ?74295644
> slc4695_chr1 ? ? ? ? ? ?- ? ? ? ? ? ? ? chr1 ? ? ? ? ? ? ? ? ?
?203949843 ? ? ? ? ? ? ? 203949866
> slc2213_chr8 ? ? ? ? ? ?- ? ? ? ? ? ? ? chr8 ? ? ? ? ? ?103464182 ?
? ? ? ? ? ? 103464207
> slc1866_chr17 ? - ? ? ? ? ? ? ? chr17 ? ? ? ? ? 53272097 ? ? ? ? ? ?
? ?53272117
> slc1642_chr12 ? - ? ? ? ? ? ? ? chr12 ? ? ? ? ? 11150173 ? ? ? ? ? ?
? ? ? ? ? ?11150194
> ...
>
> For the downstream analysis I would like to assign each location an
identifier (entrez gene id, ensembl gene id and so forth), and the
question is simply if I can use the biomaRt package for this at all?
>
> It is easy for a single entry:
>
>> geneid <-getBM(attributes="entrezgene",
filters=c("chromosome_name","start","end", "strand"), values =
list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the
archived build 54 = NCBI 36
>> geneid
> ?entrezgene
> 1 ? ? ?10797
>>
>
> However, so far I have failed to make a batch query out of it.
>
> I imported/created the following 1 column data frame with the
localization formatted as necessary
>
>> tdp
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? chromosomal_cocation
> slc754_chr2 ? ? ? ? ? ? ?2,74295624,74295644,1
> slc4695_chr1 ? ?1,203949843,203949866,-1
> slc2213_chr8 ? ?8,103464182,103464207,-1
> slc1866_chr17 ? ? ? ? ? 17,53272097,53272117,-1
> slc1642_chr12 ? ? ? ? ? 12,11150173,11150194,-1
> ...
>
> I have two points where I failed:
>
> 1) I have not found a single filter that replaces the multiple
filters above. When I use "chromosomal_region" as single filter and
run:
>
>> geneid <-getBM(attributes="entrezgene",
filters="chromosomal_region", values = list(tdp$chromosomal_location,
mart=ensembl54)
>
> I get 19733 gene ids; my dataset actually has only 161 locations.
>
> 2) If I use multiple filters like I did above in the first example,
"values" has to be a vector and the expression "values =
list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript
out of bounds" error.
> I tried splitting the localization infos into separate vectors, i.e.
chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <-
c(...), strand <- c(...) and modified my query:
>
>> geneid <-getBM(attributes="entrezgene",
filters=c("chromosome_name","start","end", "strand"), values =
list(chromosome, start, end, strand, mart=ensembl54)
>
> But this seems to combine the information in the different vectors
as the result is over 20.000 entries.
>
> Finally, I was thinking of a loop to complete the task, but this has
been discouraged by another post in the mailing archive!?
>
> Any help/idea appreciated!
>
> Thank you,
> Kemal
>
> Dr. med. Kemal Akat
> Postdoctoral Fellow
> Laboratory of RNA Molecular Biology
> The Rockefeller University
> 1230 York Avenue, Box #186
> New York, NY 10065
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>