How to retrieve gene information

0

Entering edit mode

affy snp ▴ 640

@affy-snp-2480

Last seen 9.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071224/ 4c4c3754/attachment.pl

• 605 views

ADD COMMENT • link updated 16.3 years ago by Martin Morgan 25k • written 16.3 years ago by affy snp ▴ 640

0

Entering edit mode

Joern Toedling ▴ 730

@joern-toedling-1244

Last seen 9.6 years ago

Hi Allen, have a look at the biomaRt package: http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html and see the vignette section 4.5 Task 5. You may want to select additional attributes returned, such as "description" etc. Use the "listAttributes" function to obtain of list of possible attributes. Best and Merry Christmas, Joern > Dear list, > > I am wondering if I supply with chromosome information, start and end > position > of a region, can any Bioconductor package could return a list of genes > with > short descriptions within those regions? > > Thanks and Merry Christmas! > > Allen > >

ADD COMMENT • link 16.3 years ago Joern Toedling ▴ 730

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

Hi Allen -- Also the 'org.*' annotation packages provide organism-centric annotations. These packages have environment-like key-value structures that typically map Entrez identifiers to other information. The CHRLOC variables map Entrez ids to named integer vectors. Names on the vector are chromosome locations, values are the strand (+ or -) and base start position. Thus > library(annotate) > library(org.Hs.eg.db) > > filt <- function(eid) { + loc <- abs(eid) # either strand + any(names(eid)=="3" & loc>1e8 & loc<1.1e8) + } > found <- unlist(eapply(org.Hs.egCHRLOC, filt)) > sum(found) # named genes found [1] 33 > > eids <- ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids > head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6 $`214` [1] "activated leukocyte cell adhesion molecule" $`868` [1] "Cas-Br-M (murine) ecotropic retroviral transforming sequence b" $`961` [1] "CD47 molecule" $`1295` [1] "collagen, type VIII, alpha 1" $`6152` [1] "ribosomal protein L24" $`9666` [1] "zinc finger DAZ interacting protein 3" The annotation packages are from data sources taken from a snapshot of data sources (see org.Hs.eg_dbInfo()) that might differ from the data sources used for biomaRt; the merits of this include (a) reproduciblilty and (b) relative speed of computation (no internet download required). Martin toedling at ebi.ac.uk writes: > Hi Allen, > > have a look at the biomaRt package: > http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html > and see the vignette section 4.5 Task 5. > You may want to select additional attributes returned, such as > "description" etc. Use the "listAttributes" function to obtain of list of > possible attributes. > > Best and Merry Christmas, > Joern > >> Dear list, >> >> I am wondering if I supply with chromosome information, start and end >> position >> of a region, can any Bioconductor package could return a list of genes >> with >> short descriptions within those regions? >> >> Thanks and Merry Christmas! >> >> Allen >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 16.3 years ago Martin Morgan 25k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071227/ 764078ff/attachment.pl

ADD REPLY • link 16.3 years ago affy snp ▴ 640

0

Entering edit mode

Hi Allen -- affy snp wrote: > Hi Martin, > > Thanks for brining this to my attention. > I am wondering: > > (1) Should the location information only be specified > in a way using 'e'? Will sth like 1234567 work? There's nothing magical going on. 'filt' just compares numeric values in 'loc' to another numeric value. Any numeric comparison is possible. > (2) While the result return the gene function, can it also include > the gene symbol? The function returns a list, with names of the list elements being the Entrez ids. There are other maps in the org.* packages to get at other types of information, for instance org.Hs.egSYMBOL to get 'symbol' values associated with each Entrez id. List available maps with > library(org.Hs.eg.db) > ls("package:org.Hs.eg.db") and find out about each with > ?org.Hs.egSYMBOL > (3) In case I want to look up genes of multiple chromosomal regions, > how easily can I do that if I supply with a table of regions such as: > > chr Start End > 1 2171559 245522847 > 1 77465510 82255496 > 2 1 243018229 > 2 145329129 187974870 > 3 1 59538720 Pretty easily, though now you need to specify your problem a little more fully, and write some R code to do what you want. One possibility is that you would like the gene symbols satisfying all conditions within a row (chr=1 & loc>s2711559 & end<245522847), but for any of the rows. A first approach might start by generalizing 'filt' > filt <- function(eid, czome, start, end) { + loc <- abs(eid) # either strand + any(names(eid)==czome & loc>start & loc<end) +="" }="" and="" putting="" the="" code="" to="" find="" entrez="" ids="" into="" a="" function=""> eids <- function(crit, map) { + found <- unlist(eapply(map, filt, + czome=crit[["chr"]], + start=crit[["start"]], + end=crit[["end"]])) + ls(map[found]) + } 'crit' is the criterion we'll use to select entrez ides from 'map', it will be a named vector with elements 'chr', 'start', 'end'. We'll 'apply' our 'eids' function to each row of a data frame representing your criteria: > df <- data.frame(chr=c(1,1,2,2,3), + start=c(2171559, 77465510, 1, 145329129, 1), + end=c(245522847, 82255496, 245522847, 187974870, 59538720)) > res <- apply(df, 1, eids, map=org.Hs.egCHRLOC) > str(res) List of 5 $ : chr [1:1833] "24" "27" "34" "58" ... $ : chr [1:13] "5737" "8880" "10561" "10964" ... $ : chr [1:1165] "14" "33" "52" "72" ... $ : chr [1:162] "90" "92" "390" "518" ... $ : chr [1:425] "30" "93" "95" "211" ... The result is a list, with each element corresponding to the Entrez ids found in the corresponding row of 'df'. It turns out that there are > sum(sapply(res, length)) [1] 3598 Entrez IDs but some Entrez ids occur in more than one region > length(unique(unlist(res))) [1] 3423 You could find the GENENAMEs for all of these with > nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME") Depending on what you want to do in the future, you might also think about making these into GeneSets, e.g., > library(GSEABase) > setNames <- apply(df, 1, paste, collapse="_") > gss <- mapply(GeneSet, res, setName=setNames, + MoreArgs=list( + geneIdType=EntrezIdentifier(), + collectionType=ComputedCollection())) > gsc <- GeneSetCollection(gss) > gsc GeneSetCollection names: 1_2171559_245522847, 1_77465510_82255496, ..., 3_1_59538720 (5 total) unique identifiers: 24, 27, ..., 727811 (3423 total) types in collection: geneIdType: EntrezIdentifier (1 total) collectionType: ComputedCollection (1 total) Martin > > Thank you and happy holiday! > > Allen > > On Dec 26, 2007 12:26 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > Hi Allen -- > > Also the 'org.*' annotation packages provide organism-centric > annotations. These packages have environment-like key-value structures > that typically map Entrez identifiers to other information. The > CHRLOC > variables map Entrez ids to named integer vectors. Names on the vector > are chromosome locations, values are the strand (+ or -) and base > start position. Thus > > > library(annotate) > > library(org.Hs.eg.db ) > > > > filt <- function(eid) { > + loc <- abs(eid) # either strand > + any(names(eid)=="3" & loc>1e8 & loc<1.1e8) > + } > > found <- unlist(eapply(org.Hs.egCHRLOC , filt)) > > sum(found) # named genes found > [1] 33 > > > > eids <- ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids > > head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6 > $`214` > [1] "activated leukocyte cell adhesion molecule" > > $`868` > [1] "Cas-Br-M (murine) ecotropic retroviral transforming sequence b" > > $`961` > [1] "CD47 molecule" > > $`1295` > [1] "collagen, type VIII, alpha 1" > > $`6152` > [1] "ribosomal protein L24" > > $`9666` > [1] "zinc finger DAZ interacting protein 3" > > The annotation packages are from data sources taken from a > snapshot of > data sources (see org.Hs.eg_dbInfo()) that might differ from the data > sources used for biomaRt; the merits of this include (a) > reproduciblilty and (b) relative speed of computation (no internet > download required). > > Martin > > > toedling at ebi.ac.uk <mailto:toedling at="" ebi.ac.uk=""> writes: > > > Hi Allen, > > > > have a look at the biomaRt package: > > http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html > > and see the vignette section 4.5 Task 5. > > You may want to select additional attributes returned, such as > > "description" etc. Use the "listAttributes" function to obtain > of list of > > possible attributes. > > > > Best and Merry Christmas, > > Joern > > > >> Dear list, > >> > >> I am wondering if I supply with chromosome information, start > and end > >> position > >> of a region, can any Bioconductor package could return a list > of genes > >> with > >> short descriptions within those regions? > >> > >> Thanks and Merry Christmas! > >> > >> Allen > >> > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M2 B169 > Phone: (206) 667-2793 > >

ADD REPLY • link 16.3 years ago Martin Morgan 25k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071227/ e9210bbd/attachment.pl

ADD REPLY • link 16.3 years ago affy snp ▴ 640

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

Hi Allen -- > syms <- lapply(res, lookUp, "org.Hs.eg.db", "SYMBOL") The way to understand this is to start with the help page for lookUp > ?lookUp and to experiment with a single identifier, e.g., > id <- res[[1]][[1]] > lookUp(id, "org.Hs.eg.db", "GENENAME") $`24` [1] "ATP-binding cassette, sub-family A (ABC1), member 4" What I want to change is 'SYMBOL', i.e., the 'what' argument of 'lookUp', rather than the the data package (org.Hs.eg.db) that I want to look up the information in, so I try > lookUp(id, "org.Hs.eg.db", "SYMBOL") $`24` [1] "ABCA4" Looks good, so I'll move on to trying to look up all entries from a single element of my list, lets choose a short one, and use unlist in the final line to print out more compactly... > sapply(res, length) [1] 1833 13 1165 162 425 > ids <- res[[2]] > unlist(lookUp(ids, "org.Hs.eg.db", "SYMBOL")) 5737 8880 10561 10964 11080 23032 23266 26009 "PTGFR" "FUBP1" "IFI44" "IFI44L" "DNAJB4" "USP33" "LPHN2" "ZZZ3" 26289 54810 64123 91624 374986 "AK5" "GIPC2" "ELTD1" "NEXN" "FAM73A" Note that 'lookUp' automatically looking up a vector of ids. Nice. Finally, we want to look up each element of 'nms'. We use lapply, which can take a list as its first argument, and a function as its second argument. Each element of the list becomes the first arugment of the function. Subsequent arguments to the function can be applied after the function is given (I'm getting this from looking at the help page for lapply). This gets me to > syms <- lapply(res, lookUp, "org.Hs.eg.db", "SYMBOL") If I wanted to replace the 'names' of one of the list elements in res, I might > names(nms[[1]]) <- syms[[1]] or more cleverly for all elements of nms > nms <- mapply("names<-", nms, sysms) Can you see how this works? Not sure how you know that some genes are 'hypothetical'. If it's by looking at their description, then something like > grep("^hypothetical", nms[[1]]) [1] 1079 1464 1466 1555 1584 1614 1710 1748 1781 1807 1810 1828 will tell you the index of names you'd like to remove. You could do something like > n1 <- nms[[1]] > idx <- grep("^hypothetical", n1) > nms[[1]] <- n1[-idx] How would you write that to apply to all the elements in your 'nms' list? What if a list element has no hypothetical gene names? Sorry about ComputedCollection; it's in the development version of GSEABase; just drop it from the argument list. Martin "affy snp" <affysnp at="" gmail.com=""> writes: > Hi Martin, > Thank you! > I tried the code below as you suggested: > #################################################################### > library(annotate) > library(org.Hs.eg.db) > df <- data.frame(chr=c(1,1,2,2,3), > ??????? start=c(2171559, 77465510, 1, 145329129, 1), > ??????? end=c(245522847, 82255496, 245522847, 187974870, 59538720)) > filt <- function(eid, czome, start, end) { > ???? loc <- abs(eid) # either strand > ???? any(names(eid)==czome & loc>start & loc<end)> ?} > eids <- function(crit, map) { > ???? found <- unlist(eapply(map, filt, > ??????????????????????????? czome=crit[["chr"]], > ??????????????????????????? start=crit[["start"]], > ??????????????????????????? end=crit[["end"]])) > ???? ls(map[found]) > ?} > res <- apply(df, 1, eids, map=org.Hs.egCHRLOC) > nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME") > ############################################################# > But I have no idea how to apply 'org.Hs.egSYMBOL' correctly. I tried > to replace 'org.Hs.eg.db' with 'org.Hs.egSYMBOL' but it was not the > case. > The several of last lines in nms are like: > [[5]]$`643853` > [1] "similar to F40B5.2b" > [[5]]$`646424` > [1] "serine peptidase inhibitor, Kazal type 8 (putative)" > [[5]]$`646498` > [1] "hypothetical LOC646498" > [[5]]$`727811` > [1] "similar to chemokine (C-C motif) receptor-like 2" > Again, I would like to have the ID replaced by Gene Symbol.? But just don't know > how to use 'org.Hs.egSYMBOL'. I noticed there are some genes assigned as 'hypothetical' > predicted genes. Is there a way to filter out those genes? > With regard to 'GSEABase' package, I installed the package and tried > the code: > gss <- mapply(GeneSet, res, setName=setNames, > ?????????????? MoreArgs=list( > ???????????????? geneIdType=EntrezIdentifier(), > ???????????????? collectionType=ComputedCollection())) > But an error popped up as: >> library(GSEABase) >> gss <- mapply(GeneSet, res, setName=setNames, > +??????????????? MoreArgs=list( > +????????????????? geneIdType=EntrezIdentifier(), > +????????????????? collectionType=ComputedCollection())) > Error in mapply(GeneSet, res, setName = setNames, MoreArgs = list(geneIdType = EntrezIdentifier(),? : > ? could not find function "ComputedCollection" > I appreciate! > Good evening! > Allen > > > On Dec 27, 2007 1:35 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > > Hi Allen -- > > affy snp wrote: > Hi Martin, > > Thanks for brining this to my attention. > I am wondering: > > (1) Should the location information only be specified > in a way using 'e'? > Will sth like 1234567 work? > > > There's nothing magical going on. 'filt' just compares numeric values in 'loc' to another numeric value. Any numeric comparison is possible. > > > (2) While the result return the gene function, can it also include > the gene symbol? > > > > The function returns a list, with names of the list elements being the Entrez ids. There are other maps in the org.* packages to get at other types of information, for instance org.Hs.egSYMBOL to get > symbol' values associated with each Entrez id. List available maps with > > ?> library(org.Hs.eg.db) > > > ?> ls("package: org.Hs.eg.db") > and find out about each with > ?> ?org.Hs.egSYMBOL > > > (3) In case I want to look up genes of multiple chromosomal regions, > how easily can I do that if I supply with a table of regions such as: > > chr ? ? ?Start ? ? ? ? ? ? ? ?End > 1 ? ? > 2171559 ? ? ? ? 245522847 > 1 ? ? 77465510 ? ? ? 82255496 > 2 ? ? 1 ? ? ? ? ? ? ? ? ? 243018229 > 2 ? ? 145329129 ? ? 187974870 > 3 ? ? 1 ? ? ? ? ? ? ? ? ? ?59538720 > > > > Pretty easily, though now you need to specify your problem a little more fully, and write some R code to do what you want. One possibility is that you would like the gene symbols satisfying all conditions > within a row (chr=1 & loc>s2711559 & end<245522847), but for any of the rows. A first approach might start by generalizing 'filt' > ?> filt <- function(eid, czome, start, end) { > > + ? ? loc <- abs(eid) # either strand > > > + ? ? any(names(eid)==czome & loc>start & loc<end) +="" }=""> and putting the code to find Entrez ids into a function > ?> eids <- function(crit, map) { + ? ? found <- unlist(eapply(map, filt, + ? ? ? ? ? ? ? ? ? ? ? ? ? ?czome=crit[["chr"]], + ? ? ? ? ? ? ? ? ? ? ? ? ? ?start=crit[["start"]], + ? ? ? ? ? ? ? ? ? ? > ? ? ? ?end=crit[["end"]])) + ? ? ls(map[found]) + } > 'crit' is the criterion we'll use to select entrez ides from 'map', it will be a named vector with elements 'chr', 'start', 'end'. > We'll 'apply' our 'eids' function to each row of a data frame representing your criteria: > ?> df <- data.frame(chr=c(1,1,2,2,3), + ? ? ? ? start=c(2171559, 77465510, 1, 145329129, 1), + ? ? ? ? end=c(245522847, 82255496, 245522847, 187974870, 59538720)) ?> res <- apply(df, 1, eids, > map=org.Hs.egCHRLOC) ?> str(res) List of 5 ?$ : chr [1:1833] "24" "27" "34" "58" ... ?$ : chr [1:13] "5737" "8880" "10561" "10964" ... ?$ : chr [1:1165] "14" "33" "52" "72" ... ?$ : chr > [1:162] "90" "92" "390" "518" ... ?$ : chr [1:425] "30" "93" "95" "211" ... > The result is a list, with each element corresponding to the Entrez ids found in the corresponding row of 'df'. > It turns out that there are > ?> sum(sapply(res, length)) [1] 3598 > Entrez IDs but some Entrez ids occur in more than one region > ?> length(unique(unlist(res))) [1] 3423 > You could find the GENENAMEs for all of these with > ?> nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME") > Depending on what you want to do in the future, you might also think about making these into GeneSets, e.g., > ?> library(GSEABase) ?> setNames <- apply(df, 1, paste, collapse="_") ?> gss <- mapply(GeneSet, res, setName=setNames, + ? ? ? ? ? ? ? MoreArgs=list( + ? ? ? ? ? ? ? ? > geneIdType=EntrezIdentifier(), + ? ? ? ? ? ? ? ? collectionType=ComputedCollection())) ?> gsc <- GeneSetCollection(gss) ?> gsc GeneSetCollection ?names: 1_2171559_245522847, > 1_77465510_82255496, ..., 3_1_59538720 (5 total) ?unique identifiers: 24, 27, ..., 727811 (3423 total) ?types in collection: ? ?geneIdType: EntrezIdentifier (1 total) ? ?collectionType: > ComputedCollection (1 total) > Martin > > > > Thank you and happy holiday! > > Allen > > On Dec 26, 2007 12:26 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> > > > <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > ? ? Hi Allen -- > > ? ? Also the 'org.*' annotation packages provide organism-centric > ? ? annotations. These packages have environment-like > key-value structures > ? ? that typically map Entrez identifiers to other information. The > ? ? CHRLOC > ? ? variables map Entrez ids to named integer vectors. Names on the vector > ? ? are > chromosome locations, values are the strand (+ or -) and base > ? ? start position. Thus > > ? ? > library(annotate) > ? ? > library(org.Hs.eg.db ) > ? ? > > ? ? > filt <- function(eid) { > > ? ? + ? ? loc <- abs(eid) # either strand > ? ? + ? ? any(names(eid)=="3" & loc>1e8 & loc<1.1e8) > ? ? + } > ? ? > found <- unlist(eapply(org.Hs.egCHRLOC , filt)) > ? ? > sum(found) # named genes > found > ? ? [1] 33 > ? ? > > ? ? > eids <- ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids > ? ? > head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6 > ? ? $`214` > ? ? [1] > "activated leukocyte cell adhesion molecule" > > ? ? $`868` > ? ? [1] "Cas-Br-M (murine) ecotropic retroviral transforming sequence b" > > ? ? $`961` > ? ? [1] "CD47 molecule" > > ? > ? $`1295` > ? ? [1] "collagen, type VIII, alpha 1" > > ? ? $`6152` > ? ? [1] "ribosomal protein L24" > > ? ? $`9666` > ? ? [1] "zinc finger DAZ interacting protein 3" > > ? ? > The annotation packages are from data sources taken from a > ? ? snapshot of > ? ? data sources (see org.Hs.eg_dbInfo()) that might differ from the data > ? ? sources used for biomaRt; the merits of > this include (a) > ? ? reproduciblilty and (b) relative speed of computation (no internet > ? ? download required). > > ? ? Martin > > > > > ? ? toedling at ebi.ac.uk <mailto:toedling at="" ebi.ac.uk=""> writes: > > ? ? > Hi Allen, > ? ? > > ? ? > have a look at the biomaRt package: > ? ? > > http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html > ? ? > and see the vignette section 4.5 Task 5. > ? ? > You may want to select additional attributes returned, such as > ? ? > > "description" etc. Use the "listAttributes" function to obtain > ? ? of list of > ? ? > possible attributes. > ? ? > > ? ? > Best and Merry Christmas, > ? ? > Joern > ? ? > > ? ? >> > Dear list, > ? ? >> > ? ? >> I am wondering if I supply with chromosome information, start > ? ? and end > ? ? >> position > ? ? >> of a region, can any Bioconductor package could return a list > > ? ? of genes > ? ? >> with > ? ? >> short descriptions within those regions? > ? ? >> > ? ? >> Thanks and Merry Christmas! > ? ? >> > ? ? >> Allen > ? ? >> > ? ? >> > ? ? > > > ? ? > _______________________________________________ > ? ? > Bioconductor mailing list > ? ? > Bioconductor at stat.math.ethz.ch > > > > ? ? <mailto:bioconductor at="" stat.math.ethz.ch=""> > > > > ? ? > https://stat.ethz.ch/mailman/listinfo/bioconductor > ? ? > Search the archives: > ? ? http://news.gmane.org/gmane.science.biology.informatics.conductor > > ? ? -- > ? ? Martin > Morgan > ? ? Computational Biology / Fred Hutchinson Cancer Research Center > ? ? 1100 Fairview Ave. N. > ? ? PO Box 19024 Seattle, WA 98109 > > ? ? Location: Arnold Building M2 B169 > ? ? > Phone: (206) 667-2793 > > > > > > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 16.3 years ago Martin Morgan 25k

Login before adding your answer.