Hi Allen --
> syms <- lapply(res, lookUp, "org.Hs.eg.db", "SYMBOL")
The way to understand this is to start with the help page for lookUp
> ?lookUp
and to experiment with a single identifier, e.g.,
> id <- res[[1]][[1]]
> lookUp(id, "org.Hs.eg.db", "GENENAME")
$`24`
[1] "ATP-binding cassette, sub-family A (ABC1), member 4"
What I want to change is 'SYMBOL', i.e., the 'what' argument of
'lookUp', rather than the the data package (org.Hs.eg.db) that I want
to look up the information in, so I try
> lookUp(id, "org.Hs.eg.db", "SYMBOL")
$`24`
[1] "ABCA4"
Looks good, so I'll move on to trying to look up all entries from a
single element of my list, lets choose a short one, and use unlist in
the final line to print out more compactly...
> sapply(res, length)
[1] 1833 13 1165 162 425
> ids <- res[[2]]
> unlist(lookUp(ids, "org.Hs.eg.db", "SYMBOL"))
5737 8880 10561 10964 11080 23032 23266
26009
"PTGFR" "FUBP1" "IFI44" "IFI44L" "DNAJB4" "USP33" "LPHN2"
"ZZZ3"
26289 54810 64123 91624 374986
"AK5" "GIPC2" "ELTD1" "NEXN" "FAM73A"
Note that 'lookUp' automatically looking up a vector of
ids. Nice. Finally, we want to look up each element of 'nms'. We use
lapply, which can take a list as its first argument, and a function as
its second argument. Each element of the list becomes the first
arugment of the function. Subsequent arguments to the function can be
applied after the function is given (I'm getting this from looking at
the help page for lapply). This gets me to
> syms <- lapply(res, lookUp, "org.Hs.eg.db", "SYMBOL")
If I wanted to replace the 'names' of one of the list elements in res,
I might
> names(nms[[1]]) <- syms[[1]]
or more cleverly for all elements of nms
> nms <- mapply("names<-", nms, sysms)
Can you see how this works?
Not sure how you know that some genes are 'hypothetical'. If it's by
looking at their description, then something like
> grep("^hypothetical", nms[[1]])
[1] 1079 1464 1466 1555 1584 1614 1710 1748 1781 1807 1810 1828
will tell you the index of names you'd like to remove. You could do
something like
> n1 <- nms[[1]]
> idx <- grep("^hypothetical", n1)
> nms[[1]] <- n1[-idx]
How would you write that to apply to all the elements in your 'nms'
list? What if a list element has no hypothetical gene names?
Sorry about ComputedCollection; it's in the development version of
GSEABase; just drop it from the argument list.
Martin
"affy snp" <affysnp at="" gmail.com=""> writes:
> Hi Martin,
> Thank you!
> I tried the code below as you suggested:
> ####################################################################
> library(annotate)
> library(org.Hs.eg.db)
> df <- data.frame(chr=c(1,1,2,2,3),
> ??????? start=c(2171559, 77465510, 1, 145329129, 1),
> ??????? end=c(245522847, 82255496, 245522847, 187974870, 59538720))
> filt <- function(eid, czome, start, end) {
> ???? loc <- abs(eid) # either strand
> ???? any(names(eid)==czome & loc>start & loc<end)> ?}
> eids <- function(crit, map) {
> ???? found <- unlist(eapply(map, filt,
> ??????????????????????????? czome=crit[["chr"]],
> ??????????????????????????? start=crit[["start"]],
> ??????????????????????????? end=crit[["end"]]))
> ???? ls(map[found])
> ?}
> res <- apply(df, 1, eids, map=org.Hs.egCHRLOC)
> nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME")
> #############################################################
> But I have no idea how to apply 'org.Hs.egSYMBOL' correctly. I tried
> to replace 'org.Hs.eg.db' with 'org.Hs.egSYMBOL' but it was not the
> case.
> The several of last lines in nms are like:
> [[5]]$`643853`
> [1] "similar to F40B5.2b"
> [[5]]$`646424`
> [1] "serine peptidase inhibitor, Kazal type 8 (putative)"
> [[5]]$`646498`
> [1] "hypothetical LOC646498"
> [[5]]$`727811`
> [1] "similar to chemokine (C-C motif) receptor-like 2"
> Again, I would like to have the ID replaced by Gene Symbol.? But
just don't know
> how to use 'org.Hs.egSYMBOL'. I noticed there are some genes
assigned as 'hypothetical'
> predicted genes. Is there a way to filter out those genes?
> With regard to 'GSEABase' package, I installed the package and tried
> the code:
> gss <- mapply(GeneSet, res, setName=setNames,
> ?????????????? MoreArgs=list(
> ???????????????? geneIdType=EntrezIdentifier(),
> ???????????????? collectionType=ComputedCollection()))
> But an error popped up as:
>> library(GSEABase)
>> gss <- mapply(GeneSet, res, setName=setNames,
> +??????????????? MoreArgs=list(
> +????????????????? geneIdType=EntrezIdentifier(),
> +????????????????? collectionType=ComputedCollection()))
> Error in mapply(GeneSet, res, setName = setNames, MoreArgs =
list(geneIdType = EntrezIdentifier(),? :
> ? could not find function "ComputedCollection"
> I appreciate!
> Good evening!
> Allen
>
>
> On Dec 27, 2007 1:35 PM, Martin Morgan <mtmorgan at="" fhcrc.org="">
wrote:
>
> Hi Allen --
>
> affy snp wrote: > Hi Martin, > > Thanks
for brining this to my attention. > I am wondering: > >
(1) Should the location information only be specified > in a way
using 'e'?
> Will sth like 1234567 work?
>
>
> There's nothing magical going on. 'filt' just compares numeric
values in 'loc' to another numeric value. Any numeric comparison
is possible.
>
> > (2) While the result return the gene function, can
it also include > the gene symbol?
>
>
>
> The function returns a list, with names of the list elements
being the Entrez ids. There are other maps in the org.* packages
to get at other types of information, for instance org.Hs.egSYMBOL
to get
> symbol' values associated with each Entrez id. List
available maps with
>
> ?> library(org.Hs.eg.db)
>
>
> ?> ls("package: org.Hs.eg.db")
> and find out about each with
> ?> ?org.Hs.egSYMBOL
>
> > (3) In case I want to look up genes of multiple
chromosomal regions, > how easily can I do that if I supply with a
table of regions such as: > > chr ? ? ?Start ? ? ? ? ? ? ?
?End > 1 ? ?
> 2171559 ? ? ? ? 245522847 > 1 ? ? 77465510 ? ? ? 82255496
> 2 ? ? 1 ? ? ? ? ? ? ? ? ? 243018229 > 2 ? ? 145329129 ? ?
187974870 > 3 ? ? 1 ? ? ? ? ? ? ? ? ? ?59538720
>
>
>
> Pretty easily, though now you need to specify your problem a
little more fully, and write some R code to do what you want. One
possibility is that you would like the gene symbols satisfying all
conditions
> within a row (chr=1 & loc>s2711559 & end<245522847), but
for any of the rows. A first approach might start by generalizing
'filt'
> ?> filt <- function(eid, czome, start, end) {
>
> + ? ? loc <- abs(eid) # either strand
>
>
> + ? ? any(names(eid)==czome & loc>start & loc<end) +="" }=""> and putting the code to find Entrez ids into a function
> ?> eids <- function(crit, map) { + ? ? found <-
unlist(eapply(map, filt, + ? ? ? ? ? ? ? ? ? ? ? ? ?
?czome=crit[["chr"]], + ? ? ? ? ? ? ? ? ? ? ? ? ?
?start=crit[["start"]], + ? ? ? ? ? ? ? ? ? ?
> ? ? ? ?end=crit[["end"]])) + ? ? ls(map[found]) + }
> 'crit' is the criterion we'll use to select entrez ides from
'map', it will be a named vector with elements 'chr', 'start',
'end'.
> We'll 'apply' our 'eids' function to each row of a data frame
representing your criteria:
> ?> df <- data.frame(chr=c(1,1,2,2,3), + ? ? ? ?
start=c(2171559, 77465510, 1, 145329129, 1), + ? ? ? ?
end=c(245522847, 82255496, 245522847, 187974870, 59538720)) ?> res
<- apply(df, 1, eids,
> map=org.Hs.egCHRLOC) ?> str(res) List of 5 ?$ : chr
[1:1833] "24" "27" "34" "58" ... ?$ : chr [1:13] "5737" "8880"
"10561" "10964" ... ?$ : chr [1:1165] "14" "33" "52" "72" ...
?$ : chr
> [1:162] "90" "92" "390" "518" ... ?$ : chr [1:425] "30"
"93" "95" "211" ...
> The result is a list, with each element corresponding to the
Entrez ids found in the corresponding row of 'df'.
> It turns out that there are
> ?> sum(sapply(res, length)) [1] 3598
> Entrez IDs but some Entrez ids occur in more than one region
> ?> length(unique(unlist(res))) [1] 3423
> You could find the GENENAMEs for all of these with
> ?> nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME")
> Depending on what you want to do in the future, you might also
think about making these into GeneSets, e.g.,
> ?> library(GSEABase) ?> setNames <- apply(df, 1, paste,
collapse="_") ?> gss <- mapply(GeneSet, res, setName=setNames,
+ ? ? ? ? ? ? ? MoreArgs=list( + ? ? ? ? ? ? ? ?
> geneIdType=EntrezIdentifier(), + ? ? ? ? ? ? ? ?
collectionType=ComputedCollection())) ?> gsc <-
GeneSetCollection(gss) ?> gsc GeneSetCollection ?names:
1_2171559_245522847,
> 1_77465510_82255496, ..., 3_1_59538720 (5 total)
?unique identifiers: 24, 27, ..., 727811 (3423 total) ?types in
collection: ? ?geneIdType: EntrezIdentifier (1 total) ?
?collectionType:
> ComputedCollection (1 total)
> Martin
>
> > > Thank you and happy holiday! > >
Allen > > On Dec 26, 2007 12:26 PM, Martin Morgan <mtmorgan at="" fhcrc.org="">
>
> > <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > ? ?
Hi Allen -- > > ? ? Also the 'org.*' annotation packages
provide organism-centric > ? ? annotations. These packages have
environment-like
> key-value structures > ? ? that typically map Entrez
identifiers to other information. The > ? ? CHRLOC > ? ?
variables map Entrez ids to named integer vectors. Names on the vector
> ? ? are
> chromosome locations, values are the strand (+ or -) and base
> ? ? start position. Thus > > ? ? > library(annotate) > ?
? > library(org.Hs.eg.db ) > ? ? > > ? ? > filt <-
function(eid) { >
> ? ? + ? ? loc <- abs(eid) # either strand > ? ? + ? ?
any(names(eid)=="3" & loc>1e8 & loc<1.1e8) > ? ? + } > ? ? >
found <- unlist(eapply(org.Hs.egCHRLOC , filt)) > ? ? > sum(found)
# named genes
> found > ? ? [1] 33 > ? ? > > ? ? > eids <-
ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids > ? ? >
head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6 > ? ?
$`214` > ? ? [1]
> "activated leukocyte cell adhesion molecule" > > ? ?
$`868` > ? ? [1] "Cas-Br-M (murine) ecotropic retroviral
transforming sequence b" > > ? ? $`961` > ? ? [1] "CD47
molecule" > > ?
> ? $`1295` > ? ? [1] "collagen, type VIII, alpha 1" >
> ? ? $`6152` > ? ? [1] "ribosomal protein L24" > > ? ?
$`9666` > ? ? [1] "zinc finger DAZ interacting protein 3" >
> ? ?
> The annotation packages are from data sources taken from a
> ? ? snapshot of > ? ? data sources (see org.Hs.eg_dbInfo()) that
might differ from the data > ? ? sources used for biomaRt; the
merits of
> this include (a) > ? ? reproduciblilty and (b) relative
speed of computation (no internet > ? ? download required). >
> ? ? Martin > >
>
> > ? ? toedling at ebi.ac.uk <mailto:toedling at="" ebi.ac.uk="">
writes: > > ? ? > Hi Allen, > ? ? > > ? ? > have a
look at the biomaRt package: > ? ? >
>
http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html
> ? ? > and see the vignette section 4.5 Task 5. > ? ? > You may
want to select additional attributes returned, such as > ? ? >
> "description" etc. Use the "listAttributes" function to obtain
> ? ? of list of > ? ? > possible attributes. > ? ? > > ?
? > Best and Merry Christmas, > ? ? > Joern > ? ? > > ? ?
>>
> Dear list, > ? ? >> > ? ? >> I am wondering if I supply
with chromosome information, start > ? ? and end > ? ? >>
position > ? ? >> of a region, can any Bioconductor package could
return a list >
> ? ? of genes > ? ? >> with > ? ? >> short descriptions
within those regions? > ? ? >> > ? ? >> Thanks and Merry
Christmas! > ? ? >> > ? ? >> Allen > ? ? >> > ? ? >>
> ? ? > >
> ? ? > _______________________________________________ > ? ?
> Bioconductor mailing list > ? ? > Bioconductor at
stat.math.ethz.ch
>
>
> > ? ? <mailto:bioconductor at="" stat.math.ethz.ch="">
>
>
> > ? ? >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> ? ? > Search the archives: > ? ?
http://news.gmane.org/gmane.science.biology.informatics.conductor
> > ? ? -- > ? ? Martin
> Morgan > ? ? Computational Biology / Fred Hutchinson Cancer
Research Center > ? ? 1100 Fairview Ave. N. > ? ? PO Box 19024
Seattle, WA 98109 > > ? ? Location: Arnold Building M2 B169
> ? ?
> Phone: (206) 667-2793 > >
>
>
>
>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793