How to annotate genomic coordinates

0

Entering edit mode

José Luis Lavín ▴ 280

@jose-luis-lavin-5529

Last seen 9.6 years ago

Dear Bioconductor list, I write you this email asking for a Bioconductor module that allows me to annotate genomic coordinates and get different GeneIds. I'll show you an example of what I'm referring to: I have this data: Chromosome coordinate chr17 31246506 which can also be written this way by the program that yielded the result: chr17.31246506 And I need to convert this data into a gene name and known gene Ids, such as: Gene name Entrez_ID Ensembl_ID Tff3 NM_011575 20050 Can you please advice me about a module able to perform this ID conversion using a list of "chr17.31246506" type coordinates as input? Thanks in advance With best wishes -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

convert convert • 3.1k views

ADD COMMENT • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Jose, On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: > Dear Bioconductor list, > > I write you this email asking for a Bioconductor module that allows me to > annotate genomic coordinates and get different GeneIds. > I'll show you an example of what I'm referring to: > > I have this data: > Chromosome coordinate > chr17 31246506 It depends on what that coordinate is. Is it the start of a transcript? A SNP? Do you really just have a single coordinate, or do you have a range? What species are we talking about here? Depending on what your data are, you might want to use either one of the TxDb packages, or a SNPlocs package. There really isn't much to go on here. If I assume this is a coordinate that one might think is within an exon, and if I further assume you are working with H. sapiens, I could suggest something like library(TxDb.Hsapiens.UCSC.hg19.knownGene) ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) ex[ex %in% x] or maybe more appropriately names(ex)[ex %in% x] which will give you the Gene ID, and you can go from there using the org.Hs.eg.db package. If however, your coordinate isn't in an exon, but might be in a UTR, you can look at ?exonsBy to see what other sequences you can extract to compare with. If these are SNPs, then you can look at the help pages for the relevant SNPlocs package. Best, Jim > > which can also be written this way by the program that yielded the result: > chr17.31246506 > > And I need to convert this data into a gene name and known gene Ids, such > as: > > Gene name Entrez_ID Ensembl_ID > > Tff3 NM_011575 20050 > Can you please advice me about a module able to perform this ID conversion > using a list of "chr17.31246506" type coordinates as input? > > Thanks in advance > > With best wishes > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.4 years ago James W. MacDonald 65k

0

Entering edit mode

Dear James, To answer your questions swiftly, the features are methylation sites (that's why I only have a coordinate instead of having a Start/End pair) in mouse (mm9) genome and I have a list of those in the following format: chr17.31246506 How could I read the list so that I can input the "chr" and the "coordinate" parameters into the expression you recommended? x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) I 'm lookin forward to obtain a table where my "coordinate-based Id" appears in a column, and the gene_name and if possible, the Entrez/Ensembl Ids in two other columns : Coordinate Gene_name Entrez_ID Ensembl_ID Is it easy to do this in R? I'm still really new to R and I lack many concepts you may consider basic... I'm awfully sorry Best JL 2012/11/8 James W. MacDonald <jmacdon@uw.edu> > Hi Jose, > > > On 11/8/2012 8:19 AM, José Luis Lavín wrote: > >> Dear Bioconductor list, >> >> I write you this email asking for a Bioconductor module that allows me to >> annotate genomic coordinates and get different GeneIds. >> I'll show you an example of what I'm referring to: >> >> I have this data: >> Chromosome coordinate >> chr17 31246506 >> > > It depends on what that coordinate is. Is it the start of a transcript? A > SNP? Do you really just have a single coordinate, or do you have a range? > What species are we talking about here? > > Depending on what your data are, you might want to use either one of the > TxDb packages, or a SNPlocs package. There really isn't much to go on here. > If I assume this is a coordinate that one might think is within an exon, > and if I further assume you are working with H. sapiens, I could suggest > something like > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > ex <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene") > > x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > ex[ex %in% x] > > or maybe more appropriately > > names(ex)[ex %in% x] > > which will give you the Gene ID, and you can go from there using the > org.Hs.eg.db package. > > If however, your coordinate isn't in an exon, but might be in a UTR, you > can look at ?exonsBy to see what other sequences you can extract to compare > with. > > If these are SNPs, then you can look at the help pages for the relevant > SNPlocs package. > > Best, > > Jim > > > >> which can also be written this way by the program that yielded the result: >> chr17.31246506 >> >> And I need to convert this data into a gene name and known gene Ids, such >> as: >> >> Gene name Entrez_ID Ensembl_ID >> >> Tff3 NM_011575 20050 >> Can you please advice me about a module able to perform this ID conversion >> using a list of "chr17.31246506" type coordinates as input? >> >> Thanks in advance >> >> With best wishes >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

try this biocLite('Mus.musculus') require(Mus.musculus) require(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb <- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene') egid <- names(txdb) name <- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) length(name) == length(egid) ## TRUE, OK to use esbl <- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) length(esbl) == length(egid) ## FALSE, do not use probes <- GRanges(seq="chr17", IRanges(start=31245606, width=1), strand='*') genome(probes) <- 'mm9' ## prevents some stoopid mistakes geneBodyProbes <- findOverlaps(probes, txdb) geneBodyProbes ## Hits of length 1 ## queryLength: 1 ## subjectLength: 21761 ## queryHits subjectHits ## <integer> <integer> ## 1 1 1641 ## name[subjectHits(geneBodyProbes)] ## 11307 # egid ## "Abcg1" # name ## org.Mm.egCHRLOC[['11307']] ## 17 ## 31057694 ## org.Mm.egENSEMBL[['11307']] ## [1] "ENSMUSG00000024030" promotersByGene <- flank(txdb, 1500) # or however many bases you want promotersByGene <- flank(promotersByGene, 200, FALSE) # a little extra promoterProbes <- findOverlaps(probes, promotersByGene) promoterProbes ## Hits of length 0 ## queryLength: 1 ## subjectLength: 21761 ## ## read the GRanges and GenomicFeatures vignettes for more On Thu, Nov 8, 2012 at 7:28 AM, José Luis Lavín <jluis.lavin@unavarra.es>wrote: > Dear James, > > To answer your questions swiftly, the features are methylation sites > (that's why I only have a coordinate instead of having a Start/End pair) in > mouse (mm9) genome and I have a list of those in the following format: > > chr17.31246506 > > How could I read the list so that I can input the "chr" and the > "coordinate" parameters into the expression you recommended? > > x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > I 'm lookin forward to obtain a table where my "coordinate-based Id" > appears in a column, and the gene_name and if possible, the Entrez/Ensembl > Ids in two other columns : > > Coordinate Gene_name Entrez_ID Ensembl_ID > > Is it easy to do this in R? > > I'm still really new to R and I lack many concepts you may consider > basic... I'm awfully sorry > > Best > > JL > > 2012/11/8 James W. MacDonald <jmacdon@uw.edu> > > > Hi Jose, > > > > > > On 11/8/2012 8:19 AM, José Luis Lavín wrote: > > > >> Dear Bioconductor list, > >> > >> I write you this email asking for a Bioconductor module that allows me > to > >> annotate genomic coordinates and get different GeneIds. > >> I'll show you an example of what I'm referring to: > >> > >> I have this data: > >> Chromosome coordinate > >> chr17 31246506 > >> > > > > It depends on what that coordinate is. Is it the start of a transcript? A > > SNP? Do you really just have a single coordinate, or do you have a range? > > What species are we talking about here? > > > > Depending on what your data are, you might want to use either one of the > > TxDb packages, or a SNPlocs package. There really isn't much to go on > here. > > If I assume this is a coordinate that one might think is within an exon, > > and if I further assume you are working with H. sapiens, I could suggest > > something like > > > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > > ex <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene") > > > > x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > > > ex[ex %in% x] > > > > or maybe more appropriately > > > > names(ex)[ex %in% x] > > > > which will give you the Gene ID, and you can go from there using the > > org.Hs.eg.db package. > > > > If however, your coordinate isn't in an exon, but might be in a UTR, you > > can look at ?exonsBy to see what other sequences you can extract to > compare > > with. > > > > If these are SNPs, then you can look at the help pages for the relevant > > SNPlocs package. > > > > Best, > > > > Jim > > > > > > > >> which can also be written this way by the program that yielded the > result: > >> chr17.31246506 > >> > >> And I need to convert this data into a gene name and known gene Ids, > such > >> as: > >> > >> Gene name Entrez_ID Ensembl_ID > >> > >> Tff3 NM_011575 20050 > >> Can you please advice me about a module able to perform this ID > conversion > >> using a list of "chr17.31246506" type coordinates as input? > >> > >> Thanks in advance > >> > >> With best wishes > >> > >> > >> > >> ______________________________**_________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/**listinfo/bioconductor< > https://stat.ethz.ch/mailman/listinfo/bioconductor> > >> Search the archives: http://news.gmane.org/gmane.** > >> science.biology.informatics.**conductor< > http://news.gmane.org/gmane.science.biology.informatics.conductor> > >> > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > > > > > > -- > -- > Dr. José Luis Lavín Trueba > > Dpto. de Producción Agraria > Grupo de Genética y Microbiología > Universidad Pública de Navarra > 31006 Pamplona > Navarra > SPAIN > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Jose, On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: > Dear James, > > To answer your questions swiftly, the features are methylation sites > (that's why I only have a coordinate instead of having a Start/End > pair) in mouse (mm9) genome and I have a list of those in the > following format: > > chr17.31246506 > > How could I read the list so that I can input the "chr" and the > "coordinate" parameters into the expression you recommended? First you need to split your data to chr and pos. chr.pos <- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, "\\.")) Note that your vector of chr.pos data may have been in as a factor, so you may need to wrap with an as.character(). If you don't know what I mean by that, you should first read 'An Introduction to R' (http://cran.r-project.org/doc/manuals/R-intro.html). That will be time well spent. > > x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) Both the seqnames and ranges argument to GRanges() can be based on vectors. So if you had a matrix (called 'y') like chr16 3423543 chr3 403992 chr18 3494202 then you can do x <- GRanges(y[,1], IRanges(start = y[,2], width = 1)) see ?GRanges for more info. As a side note, IIRC, methylation sites are not in general found in exons, but are more likely to be found upstream of a given gene. You might then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy after loading the GenomicFeatures package. > > I 'm lookin forward to obtain a table where my "coordinate-based Id" > appears in a column, and the gene_name and if possible, the > Entrez/Ensembl Ids in two other columns : > > Coordinate Gene_name Entrez_ID Ensembl_ID > > Is it easy to do this in R? Of course! Everything is easy to do in R. You should see my sweet R toast 'n coffee maker ;-D But you should note that the fiveUTRByTranscript() function is transcript based (obvs), and so you will have multiple transcripts per gene. This makes things much more difficult, as a given methylation site may overlap multiple transcripts of the same gene. So that makes it hard to merge your original data with the overlapping transcripts. You could do something like ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE) ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id then you could use unique Gene IDs thusly my.trans <- select(Mus.musculus, unique(as.character(ex2)), c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") That should at least give you a start. See where you can go on your own, and let us know if you get stuck. Best, Jim > > I'm still really new to R and I lack many concepts you may consider > basic... I'm awfully sorry > > Best > > JL > > 2012/11/8 James W. MacDonald <jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu="">> > > Hi Jose, > > > On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: > > Dear Bioconductor list, > > I write you this email asking for a Bioconductor module that > allows me to > annotate genomic coordinates and get different GeneIds. > I'll show you an example of what I'm referring to: > > I have this data: > Chromosome coordinate > chr17 31246506 > > > It depends on what that coordinate is. Is it the start of a > transcript? A SNP? Do you really just have a single coordinate, or > do you have a range? What species are we talking about here? > > Depending on what your data are, you might want to use either one > of the TxDb packages, or a SNPlocs package. There really isn't > much to go on here. If I assume this is a coordinate that one > might think is within an exon, and if I further assume you are > working with H. sapiens, I could suggest something like > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > > x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > ex[ex %in% x] > > or maybe more appropriately > > names(ex)[ex %in% x] > > which will give you the Gene ID, and you can go from there using > the org.Hs.eg.db package. > > If however, your coordinate isn't in an exon, but might be in a > UTR, you can look at ?exonsBy to see what other sequences you can > extract to compare with. > > If these are SNPs, then you can look at the help pages for the > relevant SNPlocs package. > > Best, > > Jim > > > > which can also be written this way by the program that yielded > the result: > chr17.31246506 > > And I need to convert this data into a gene name and known > gene Ids, such > as: > > Gene name Entrez_ID Ensembl_ID > > Tff3 NM_011575 20050 > Can you please advice me about a module able to perform this > ID conversion > using a list of "chr17.31246506" type coordinates as input? > > Thanks in advance > > With best wishes > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.4 years ago James W. MacDonald 65k

0

Entering edit mode

On the elementary end of all this... If the sites are on a *file*, one per line, you could do y <- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) etc. On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: > Hi Jose, > > On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: >> Dear James, >> >> To answer your questions swiftly, the features are methylation sites (that's why I only have a coordinate instead of having a Start/End pair) in mouse (mm9) genome and I have a list of those in the following format: >> >> chr17.31246506 >> >> How could I read the list so that I can input the "chr" and the "coordinate" parameters into the expression you recommended? > > First you need to split your data to chr and pos. > > chr.pos <- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, "\\.")) > > Note that your vector of chr.pos data may have been in as a factor, so you may need to wrap with an as.character(). If you don't know what I mean by that, you should first read 'An Introduction to R' (http://cran.r-project.org/doc/manuals/R-intro.html). That will be time well spent. > >> >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > Both the seqnames and ranges argument to GRanges() can be based on vectors. So if you had a matrix (called 'y') like > > chr16 3423543 > chr3 403992 > chr18 3494202 > > then you can do > > x <- GRanges(y[,1], IRanges(start = y[,2], width = 1)) > > see ?GRanges for more info. > > As a side note, IIRC, methylation sites are not in general found in exons, but are more likely to be found upstream of a given gene. You might then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy after loading the GenomicFeatures package. > >> >> I 'm lookin forward to obtain a table where my "coordinate-based Id" appears in a column, and the gene_name and if possible, the Entrez/Ensembl Ids in two other columns : >> >> Coordinate Gene_name Entrez_ID Ensembl_ID >> >> Is it easy to do this in R? > > Of course! Everything is easy to do in R. You should see my sweet R toast 'n coffee maker ;-D > > But you should note that the fiveUTRByTranscript() function is transcript based (obvs), and so you will have multiple transcripts per gene. This makes things much more difficult, as a given methylation site may overlap multiple transcripts of the same gene. So that makes it hard to merge your original data with the overlapping transcripts. > > You could do something like > > ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE) > ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id > > then you could use unique Gene IDs thusly > > my.trans <- select(Mus.musculus, unique(as.character(ex2)), c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") > > That should at least give you a start. See where you can go on your own, and let us know if you get stuck. > > Best, > > Jim > > > >> >> I'm still really new to R and I lack many concepts you may consider basic... I'm awfully sorry >> >> Best >> >> JL >> >> 2012/11/8 James W. MacDonald <jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu="">> >> >> Hi Jose, >> >> >> On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: >> >> Dear Bioconductor list, >> >> I write you this email asking for a Bioconductor module that >> allows me to >> annotate genomic coordinates and get different GeneIds. >> I'll show you an example of what I'm referring to: >> >> I have this data: >> Chromosome coordinate >> chr17 31246506 >> >> >> It depends on what that coordinate is. Is it the start of a >> transcript? A SNP? Do you really just have a single coordinate, or >> do you have a range? What species are we talking about here? >> >> Depending on what your data are, you might want to use either one >> of the TxDb packages, or a SNPlocs package. There really isn't >> much to go on here. If I assume this is a coordinate that one >> might think is within an exon, and if I further assume you are >> working with H. sapiens, I could suggest something like >> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") >> >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >> >> ex[ex %in% x] >> >> or maybe more appropriately >> >> names(ex)[ex %in% x] >> >> which will give you the Gene ID, and you can go from there using >> the org.Hs.eg.db package. >> >> If however, your coordinate isn't in an exon, but might be in a >> UTR, you can look at ?exonsBy to see what other sequences you can >> extract to compare with. >> >> If these are SNPs, then you can look at the help pages for the >> relevant SNPlocs package. >> >> Best, >> >> Jim >> >> >> >> which can also be written this way by the program that yielded >> the result: >> chr17.31246506 >> >> And I need to convert this data into a gene name and known >> gene Ids, such >> as: >> >> Gene name Entrez_ID Ensembl_ID >> >> Tff3 NM_011575 20050 >> Can you please advice me about a module able to perform this >> ID conversion >> using a list of "chr17.31246506" type coordinates as input? >> >> Thanks in advance >> >> With best wishes >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> >> >> >> -- >> -- >> Dr. Jos? Luis Lav?n Trueba >> >> Dpto. de Producci?n Agraria >> Grupo de Gen?tica y Microbiolog?a >> Universidad P?blica de Navarra >> 31006 Pamplona >> Navarra >> SPAIN > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Hello all, First of all I want to thank everybody that gave me advice on this subject. trying to follow the advice, I added some modifications mixing codes from Tim, Harris and James , but it seems I got lost somewhere in between... I'm sorry for bothering you all again, but I'm afraid I need some more help. I need to read my ids.txt file and then iterate use each line of id (chr.position) to perform the annotation process with it. I thought of a "for" loop to achieve it, but I do not seem to catch the essence of R processes and I guess I miss my tryout. Please have a look at my "disaster" and help me with it If such thing is possible... biocLite('Mus.musculus') require(Mus.musculus) require(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb <- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene') egid <- names(txdb) name <- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) length(name) == length(egid) ## TRUE, OK to use esbl <- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) length(esbl) == length(egid) ## FALSE, do not use #read input table coordTable <- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is =TRUE) for(i in 1:length(coordTable)) { probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), strand='*') } genome(probes) <- 'mm9' ## prevents some stoopid mistakes geneBodyProbes <- findOverlaps(probes, txdb) geneBodyProbes write.table (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") ## Hits of length 1 ## queryLength: 1 ## subjectLength: 21761 ## queryHits subjectHits ## <integer> <integer> ## 1 1 1641 ## name[subjectHits(geneBodyProbes)] ## 11307 # egid ## "Abcg1" # name ## org.Mm.egCHRLOC[['11307']] ## 17 ## 31057694 ## org.Mm.egENSEMBL[['11307']] ## [1] "ENSMUSG00000024030" promotersByGene <- flank(txdb, 1500) # or however many bases you want promotersByGene <- flank(promotersByGene, 200, FALSE) # a little extra promoterProbes <- findOverlaps(probes, promotersByGene) promoterProbes write.table (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep ="\t") ## Hits of length 0 ## queryLength: 1 ## subjectLength: 21761 ## ## read the GRanges and GenomicFeatures vignettes for more write.table (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") *Thanks in advance* JL 2012/11/8 Harris A. Jaffee <hj@jhu.edu> > On the elementary end of all this... > > If the sites are on a *file*, one per line, you could do > > y <- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) > > etc. > > On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: > > > Hi Jose, > > > > On 11/8/2012 10:28 AM, José Luis Lavín wrote: > >> Dear James, > >> > >> To answer your questions swiftly, the features are methylation sites > (that's why I only have a coordinate instead of having a Start/End pair) in > mouse (mm9) genome and I have a list of those in the following format: > >> > >> chr17.31246506 > >> > >> How could I read the list so that I can input the "chr" and the > "coordinate" parameters into the expression you recommended? > > > > First you need to split your data to chr and pos. > > > > chr.pos <- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, > "\\.")) > > > > Note that your vector of chr.pos data may have been in as a factor, so > you may need to wrap with an as.character(). If you don't know what I mean > by that, you should first read 'An Introduction to R' ( > http://cran.r-project.org/doc/manuals/R-intro.html). That will be time > well spent. > > > >> > >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > > > Both the seqnames and ranges argument to GRanges() can be based on > vectors. So if you had a matrix (called 'y') like > > > > chr16 3423543 > > chr3 403992 > > chr18 3494202 > > > > then you can do > > > > x <- GRanges(y[,1], IRanges(start = y[,2], width = 1)) > > > > see ?GRanges for more info. > > > > As a side note, IIRC, methylation sites are not in general found in > exons, but are more likely to be found upstream of a given gene. You might > then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy > after loading the GenomicFeatures package. > > > >> > >> I 'm lookin forward to obtain a table where my "coordinate-based Id" > appears in a column, and the gene_name and if possible, the Entrez/Ensembl > Ids in two other columns : > >> > >> Coordinate Gene_name Entrez_ID Ensembl_ID > >> > >> Is it easy to do this in R? > > > > Of course! Everything is easy to do in R. You should see my sweet R > toast 'n coffee maker ;-D > > > > But you should note that the fiveUTRByTranscript() function is > transcript based (obvs), and so you will have multiple transcripts per > gene. This makes things much more difficult, as a given methylation site > may overlap multiple transcripts of the same gene. So that makes it hard to > merge your original data with the overlapping transcripts. > > > > You could do something like > > > > ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names > = TRUE) > > ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id > > > > then you could use unique Gene IDs thusly > > > > my.trans <- select(Mus.musculus, unique(as.character(ex2)), > c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") > > > > That should at least give you a start. See where you can go on your own, > and let us know if you get stuck. > > > > Best, > > > > Jim > > > > > > > >> > >> I'm still really new to R and I lack many concepts you may consider > basic... I'm awfully sorry > >> > >> Best > >> > >> JL > >> > >> 2012/11/8 James W. MacDonald <jmacdon@uw.edu <mailto:jmacdon@uw.edu="">> > >> > >> Hi Jose, > >> > >> > >> On 11/8/2012 8:19 AM, José Luis Lavín wrote: > >> > >> Dear Bioconductor list, > >> > >> I write you this email asking for a Bioconductor module that > >> allows me to > >> annotate genomic coordinates and get different GeneIds. > >> I'll show you an example of what I'm referring to: > >> > >> I have this data: > >> Chromosome coordinate > >> chr17 31246506 > >> > >> > >> It depends on what that coordinate is. Is it the start of a > >> transcript? A SNP? Do you really just have a single coordinate, or > >> do you have a range? What species are we talking about here? > >> > >> Depending on what your data are, you might want to use either one > >> of the TxDb packages, or a SNPlocs package. There really isn't > >> much to go on here. If I assume this is a coordinate that one > >> might think is within an exon, and if I further assume you are > >> working with H. sapiens, I could suggest something like > >> > >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > >> > >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > >> > >> ex[ex %in% x] > >> > >> or maybe more appropriately > >> > >> names(ex)[ex %in% x] > >> > >> which will give you the Gene ID, and you can go from there using > >> the org.Hs.eg.db package. > >> > >> If however, your coordinate isn't in an exon, but might be in a > >> UTR, you can look at ?exonsBy to see what other sequences you can > >> extract to compare with. > >> > >> If these are SNPs, then you can look at the help pages for the > >> relevant SNPlocs package. > >> > >> Best, > >> > >> Jim > >> > >> > >> > >> which can also be written this way by the program that yielded > >> the result: > >> chr17.31246506 > >> > >> And I need to convert this data into a gene name and known > >> gene Ids, such > >> as: > >> > >> Gene name Entrez_ID Ensembl_ID > >> > >> Tff3 NM_011575 20050 > >> Can you please advice me about a module able to perform this > >> ID conversion > >> using a list of "chr17.31246506" type coordinates as input? > >> > >> Thanks in advance > >> > >> With best wishes > >> > >> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> -- James W. MacDonald, M.S. > >> Biostatistician > >> University of Washington > >> Environmental and Occupational Health Sciences > >> 4225 Roosevelt Way NE, # 100 > >> Seattle WA 98105-6099 > >> > >> > >> > >> > >> -- > >> -- > >> Dr. José Luis Lavín Trueba > >> > >> Dpto. de Producción Agraria > >> Grupo de Genética y Microbiología > >> Universidad Pública de Navarra > >> 31006 Pamplona > >> Navarra > >> SPAIN > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hi Jose, Here is a slightly different approach. library(Mus.musculus) library(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene ## I assume you've figured out how to read your data into a GRanges. ## Here we use a small example. gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) ## The locateVariants() function in the VariantAnnotation package will ## give you the GENEIDs for all ranges in 'query'. If the range does not ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and ## FOLLOWID are provided. The LOCATION column indicates what ## region the range fell in. See ?locateVariants for details on the ## different regions and the ability to set 'upstream' and 'downstream' ## values for promoter regions. library(VariantAnnotation) loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > loc GRanges with 1 range and 7 metadata columns: seqnames ranges strand | LOCATION QUERYID TXID <rle> <iranges> <rle> | <factor> <integer> <integer> [1] chr17 [31245606, 31245625] * | coding 1 47716 CDSID GENEID PRECEDEID FOLLOWID <integer> <character> <character> <character> [1] 184246 11307 <na> <na> ## Use the select() function to map these GENEID's to the other ## values you are interested in. The GENEID's from locateVariants() ## are Entrez ID's so we use those as our 'keytype'. entrezid <- loc$GENEID select(Mus.musculus, keytype="ENTREZID", keys=entrezid, cols=c("GENENAME", "ENSEMBL")) > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, cols=c("GENENAME", "ENSEMBL")) ENTREZID ENSEMBL 1 11307 ENSMUSG00000024030 GENENAME 1 ATP-binding cassette, sub-family G (WHITE), member 1 Valerie On 11/12/12 06:59, Jos? Luis Lav?n wrote: > Hello all, > > First of all I want to thank everybody that gave me advice on this subject. > trying to follow the advice, I added some modifications mixing codes from > Tim, Harris and James , but it seems I got lost somewhere in between... > I'm sorry for bothering you all again, but I'm afraid I need some more help. > > I need to read my ids.txt file and then iterate use each line of id > (chr.position) to perform the annotation process with it. I thought of a > "for" loop to achieve it, but I do not seem to catch the essence of R > processes and I guess I miss my tryout. > Please have a look at my "disaster" and help me with it If such thing is > possible... > > biocLite('Mus.musculus') > require(Mus.musculus) > require(TxDb.Mmusculus.UCSC.mm9.knownGene) > txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene') > egid<- names(txdb) > name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) > length(name) == length(egid) ## TRUE, OK to use > esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) > length(esbl) == length(egid) ## FALSE, do not use > > #read input table > coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is > =TRUE) > > for(i in 1:length(coordTable)) > { > probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), > strand='*') > } > > genome(probes)<- 'mm9' ## prevents some stoopid mistakes > > geneBodyProbes<- findOverlaps(probes, txdb) > geneBodyProbes > > write.table > (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") > > ## Hits of length 1 > ## queryLength: 1 > ## subjectLength: 21761 > ## queryHits subjectHits > ##<integer> <integer> > ## 1 1 1641 > ## > > name[subjectHits(geneBodyProbes)] > > ## 11307 # egid > ## "Abcg1" # name > ## > > org.Mm.egCHRLOC[['11307']] > > ## 17 > ## 31057694 > ## > > org.Mm.egENSEMBL[['11307']] > > ## [1] "ENSMUSG00000024030" > > promotersByGene<- flank(txdb, 1500) # or however many bases you want > promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra > promoterProbes<- findOverlaps(probes, promotersByGene) > promoterProbes > > write.table > (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,s ep="\t") > > ## Hits of length 0 > ## queryLength: 1 > ## subjectLength: 21761 > ## > ## read the GRanges and GenomicFeatures vignettes for more > > write.table > (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") > > > *Thanks in advance* > > JL > > 2012/11/8 Harris A. Jaffee<hj at="" jhu.edu=""> > >> On the elementary end of all this... >> >> If the sites are on a *file*, one per line, you could do >> >> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) >> >> etc. >> >> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >> >>> Hi Jose, >>> >>> On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: >>>> Dear James, >>>> >>>> To answer your questions swiftly, the features are methylation sites >> (that's why I only have a coordinate instead of having a Start/End pair) in >> mouse (mm9) genome and I have a list of those in the following format: >>>> chr17.31246506 >>>> >>>> How could I read the list so that I can input the "chr" and the >> "coordinate" parameters into the expression you recommended? >>> First you need to split your data to chr and pos. >>> >>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, >> "\\.")) >>> Note that your vector of chr.pos data may have been in as a factor, so >> you may need to wrap with an as.character(). If you don't know what I mean >> by that, you should first read 'An Introduction to R' ( >> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time >> well spent. >>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>> Both the seqnames and ranges argument to GRanges() can be based on >> vectors. So if you had a matrix (called 'y') like >>> chr16 3423543 >>> chr3 403992 >>> chr18 3494202 >>> >>> then you can do >>> >>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >>> >>> see ?GRanges for more info. >>> >>> As a side note, IIRC, methylation sites are not in general found in >> exons, but are more likely to be found upstream of a given gene. You might >> then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy >> after loading the GenomicFeatures package. >>>> I 'm lookin forward to obtain a table where my "coordinate-based Id" >> appears in a column, and the gene_name and if possible, the Entrez/Ensembl >> Ids in two other columns : >>>> Coordinate Gene_name Entrez_ID Ensembl_ID >>>> >>>> Is it easy to do this in R? >>> Of course! Everything is easy to do in R. You should see my sweet R >> toast 'n coffee maker ;-D >>> But you should note that the fiveUTRByTranscript() function is >> transcript based (obvs), and so you will have multiple transcripts per >> gene. This makes things much more difficult, as a given methylation site >> may overlap multiple transcripts of the same gene. So that makes it hard to >> merge your original data with the overlapping transcripts. >>> You could do something like >>> >>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names >> = TRUE) >>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id >>> >>> then you could use unique Gene IDs thusly >>> >>> my.trans<- select(Mus.musculus, unique(as.character(ex2)), >> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") >>> That should at least give you a start. See where you can go on your own, >> and let us know if you get stuck. >>> Best, >>> >>> Jim >>> >>> >>> >>>> I'm still really new to R and I lack many concepts you may consider >> basic... I'm awfully sorry >>>> Best >>>> >>>> JL >>>> >>>> 2012/11/8 James W. MacDonald<jmacdon at="" uw.edu<mailto:jmacdon="" at="" uw.edu="">> >>>> >>>> Hi Jose, >>>> >>>> >>>> On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: >>>> >>>> Dear Bioconductor list, >>>> >>>> I write you this email asking for a Bioconductor module that >>>> allows me to >>>> annotate genomic coordinates and get different GeneIds. >>>> I'll show you an example of what I'm referring to: >>>> >>>> I have this data: >>>> Chromosome coordinate >>>> chr17 31246506 >>>> >>>> >>>> It depends on what that coordinate is. Is it the start of a >>>> transcript? A SNP? Do you really just have a single coordinate, or >>>> do you have a range? What species are we talking about here? >>>> >>>> Depending on what your data are, you might want to use either one >>>> of the TxDb packages, or a SNPlocs package. There really isn't >>>> much to go on here. If I assume this is a coordinate that one >>>> might think is within an exon, and if I further assume you are >>>> working with H. sapiens, I could suggest something like >>>> >>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") >>>> >>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>> >>>> ex[ex %in% x] >>>> >>>> or maybe more appropriately >>>> >>>> names(ex)[ex %in% x] >>>> >>>> which will give you the Gene ID, and you can go from there using >>>> the org.Hs.eg.db package. >>>> >>>> If however, your coordinate isn't in an exon, but might be in a >>>> UTR, you can look at ?exonsBy to see what other sequences you can >>>> extract to compare with. >>>> >>>> If these are SNPs, then you can look at the help pages for the >>>> relevant SNPlocs package. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> which can also be written this way by the program that yielded >>>> the result: >>>> chr17.31246506 >>>> >>>> And I need to convert this data into a gene name and known >>>> gene Ids, such >>>> as: >>>> >>>> Gene name Entrez_ID Ensembl_ID >>>> >>>> Tff3 NM_011575 20050 >>>> Can you please advice me about a module able to perform this >>>> ID conversion >>>> using a list of "chr17.31246506" type coordinates as input? >>>> >>>> Thanks in advance >>>> >>>> With best wishes >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> Dr. Jos? Luis Lav?n Trueba >>>> >>>> Dpto. de Producci?n Agraria >>>> Grupo de Gen?tica y Microbiolog?a >>>> Universidad P?blica de Navarra >>>> 31006 Pamplona >>>> Navarra >>>> SPAIN >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi all, On 11/12/2012 10:24 AM, Valerie Obenchain wrote: > Hi Jose, > > Here is a slightly different approach. > > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > > ## I assume you've figured out how to read your data into a GRanges. > ## Here we use a small example. > gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) > > ## The locateVariants() function in the VariantAnnotation package will > ## give you the GENEIDs for all ranges in 'query'. If the range does not > ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and > ## FOLLOWID are provided. The LOCATION column indicates what > ## region the range fell in. See ?locateVariants for details on the > ## different regions and the ability to set 'upstream' and 'downstream' > ## values for promoter regions. > library(VariantAnnotation) > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > > > loc > GRanges with 1 range and 7 metadata columns: > seqnames ranges strand | LOCATION QUERYID TXID > <rle> <iranges> <rle> | <factor> <integer> <integer> > [1] chr17 [31245606, 31245625] * | coding 1 47716 > CDSID GENEID PRECEDEID FOLLOWID > <integer> <character> <character> <character> > [1] 184246 11307 <na> <na> Nice! So IIUC locateVariants() is not intrinsically for variants but seems to be generally useful for "locating" any set of genomic positions. Shouldn't we have this (or something similar) in GenomicFeatures (probably with a different name)? Cheers, H. > > > ## Use the select() function to map these GENEID's to the other > ## values you are interested in. The GENEID's from locateVariants() > ## are Entrez ID's so we use those as our 'keytype'. > entrezid <- loc$GENEID > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > ENTREZID ENSEMBL > 1 11307 ENSMUSG00000024030 > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > > > Valerie > > > > On 11/12/12 06:59, Jos? Luis Lav?n wrote: >> Hello all, >> >> First of all I want to thank everybody that gave me advice on this >> subject. >> trying to follow the advice, I added some modifications mixing codes from >> Tim, Harris and James , but it seems I got lost somewhere in between... >> I'm sorry for bothering you all again, but I'm afraid I need some more >> help. >> >> I need to read my ids.txt file and then iterate use each line of id >> (chr.position) to perform the annotation process with it. I thought of a >> "for" loop to achieve it, but I do not seem to catch the essence of R >> processes and I guess I miss my tryout. >> Please have a look at my "disaster" and help me with it If such thing is >> possible... >> >> biocLite('Mus.musculus') >> require(Mus.musculus) >> require(TxDb.Mmusculus.UCSC.mm9.knownGene) >> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene') >> egid<- names(txdb) >> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) >> length(name) == length(egid) ## TRUE, OK to use >> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) >> length(esbl) == length(egid) ## FALSE, do not use >> >> #read input table >> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, >> as.is >> =TRUE) >> >> for(i in 1:length(coordTable)) >> { >> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), >> strand='*') >> } >> >> genome(probes)<- 'mm9' ## prevents some stoopid mistakes >> >> geneBodyProbes<- findOverlaps(probes, txdb) >> geneBodyProbes >> >> write.table >> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") >> >> ## Hits of length 1 >> ## queryLength: 1 >> ## subjectLength: 21761 >> ## queryHits subjectHits >> ##<integer> <integer> >> ## 1 1 1641 >> ## >> >> name[subjectHits(geneBodyProbes)] >> >> ## 11307 # egid >> ## "Abcg1" # name >> ## >> >> org.Mm.egCHRLOC[['11307']] >> >> ## 17 >> ## 31057694 >> ## >> >> org.Mm.egENSEMBL[['11307']] >> >> ## [1] "ENSMUSG00000024030" >> >> promotersByGene<- flank(txdb, 1500) # or however many bases you want >> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra >> promoterProbes<- findOverlaps(probes, promotersByGene) >> promoterProbes >> >> write.table >> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE, sep="\t") >> >> >> ## Hits of length 0 >> ## queryLength: 1 >> ## subjectLength: 21761 >> ## >> ## read the GRanges and GenomicFeatures vignettes for more >> >> write.table >> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") >> >> >> *Thanks in advance* >> >> JL >> >> 2012/11/8 Harris A. Jaffee<hj at="" jhu.edu=""> >> >>> On the elementary end of all this... >>> >>> If the sites are on a *file*, one per line, you could do >>> >>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) >>> >>> etc. >>> >>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >>> >>>> Hi Jose, >>>> >>>> On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: >>>>> Dear James, >>>>> >>>>> To answer your questions swiftly, the features are methylation sites >>> (that's why I only have a coordinate instead of having a Start/End >>> pair) in >>> mouse (mm9) genome and I have a list of those in the following format: >>>>> chr17.31246506 >>>>> >>>>> How could I read the list so that I can input the "chr" and the >>> "coordinate" parameters into the expression you recommended? >>>> First you need to split your data to chr and pos. >>>> >>>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, >>> "\\.")) >>>> Note that your vector of chr.pos data may have been in as a factor, so >>> you may need to wrap with an as.character(). If you don't know what I >>> mean >>> by that, you should first read 'An Introduction to R' ( >>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time >>> well spent. >>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>> Both the seqnames and ranges argument to GRanges() can be based on >>> vectors. So if you had a matrix (called 'y') like >>>> chr16 3423543 >>>> chr3 403992 >>>> chr18 3494202 >>>> >>>> then you can do >>>> >>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >>>> >>>> see ?GRanges for more info. >>>> >>>> As a side note, IIRC, methylation sites are not in general found in >>> exons, but are more likely to be found upstream of a given gene. You >>> might >>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See >>> ?exonsBy >>> after loading the GenomicFeatures package. >>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id" >>> appears in a column, and the gene_name and if possible, the >>> Entrez/Ensembl >>> Ids in two other columns : >>>>> Coordinate Gene_name Entrez_ID Ensembl_ID >>>>> >>>>> Is it easy to do this in R? >>>> Of course! Everything is easy to do in R. You should see my sweet R >>> toast 'n coffee maker ;-D >>>> But you should note that the fiveUTRByTranscript() function is >>> transcript based (obvs), and so you will have multiple transcripts per >>> gene. This makes things much more difficult, as a given methylation site >>> may overlap multiple transcripts of the same gene. So that makes it >>> hard to >>> merge your original data with the overlapping transcripts. >>>> You could do something like >>>> >>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names >>> = TRUE) >>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id >>>> >>>> then you could use unique Gene IDs thusly >>>> >>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)), >>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") >>>> That should at least give you a start. See where you can go on your >>>> own, >>> and let us know if you get stuck. >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>>> I'm still really new to R and I lack many concepts you may consider >>> basic... I'm awfully sorry >>>>> Best >>>>> >>>>> JL >>>>> >>>>> 2012/11/8 James W. MacDonald<jmacdon at="" uw.edu<mailto:jmacdon="" at="" uw.edu="">> >>>>> >>>>> Hi Jose, >>>>> >>>>> >>>>> On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: >>>>> >>>>> Dear Bioconductor list, >>>>> >>>>> I write you this email asking for a Bioconductor module that >>>>> allows me to >>>>> annotate genomic coordinates and get different GeneIds. >>>>> I'll show you an example of what I'm referring to: >>>>> >>>>> I have this data: >>>>> Chromosome coordinate >>>>> chr17 31246506 >>>>> >>>>> >>>>> It depends on what that coordinate is. Is it the start of a >>>>> transcript? A SNP? Do you really just have a single coordinate, or >>>>> do you have a range? What species are we talking about here? >>>>> >>>>> Depending on what your data are, you might want to use either one >>>>> of the TxDb packages, or a SNPlocs package. There really isn't >>>>> much to go on here. If I assume this is a coordinate that one >>>>> might think is within an exon, and if I further assume you are >>>>> working with H. sapiens, I could suggest something like >>>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") >>>>> >>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>> >>>>> ex[ex %in% x] >>>>> >>>>> or maybe more appropriately >>>>> >>>>> names(ex)[ex %in% x] >>>>> >>>>> which will give you the Gene ID, and you can go from there using >>>>> the org.Hs.eg.db package. >>>>> >>>>> If however, your coordinate isn't in an exon, but might be in a >>>>> UTR, you can look at ?exonsBy to see what other sequences you can >>>>> extract to compare with. >>>>> >>>>> If these are SNPs, then you can look at the help pages for the >>>>> relevant SNPlocs package. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> which can also be written this way by the program that yielded >>>>> the result: >>>>> chr17.31246506 >>>>> >>>>> And I need to convert this data into a gene name and known >>>>> gene Ids, such >>>>> as: >>>>> >>>>> Gene name Entrez_ID Ensembl_ID >>>>> >>>>> Tff3 NM_011575 20050 >>>>> Can you please advice me about a module able to perform this >>>>> ID conversion >>>>> using a list of "chr17.31246506" type coordinates as input? >>>>> >>>>> Thanks in advance >>>>> >>>>> With best wishes >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> -- James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Dr. Jos? Luis Lav?n Trueba >>>>> >>>>> Dpto. de Producci?n Agraria >>>>> Grupo de Gen?tica y Microbiolog?a >>>>> Universidad P?blica de Navarra >>>>> 31006 Pamplona >>>>> Navarra >>>>> SPAIN >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.4 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, On 11/13/12 17:32, Hervé Pagès wrote: > Hi all, > > On 11/12/2012 10:24 AM, Valerie Obenchain wrote: >> Hi Jose, >> >> Here is a slightly different approach. >> >> library(Mus.musculus) >> library(TxDb.Mmusculus.UCSC.mm9.knownGene) >> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene >> >> ## I assume you've figured out how to read your data into a GRanges. >> ## Here we use a small example. >> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) >> >> ## The locateVariants() function in the VariantAnnotation package will >> ## give you the GENEIDs for all ranges in 'query'. If the range does not >> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and >> ## FOLLOWID are provided. The LOCATION column indicates what >> ## region the range fell in. See ?locateVariants for details on the >> ## different regions and the ability to set 'upstream' and 'downstream' >> ## values for promoter regions. >> library(VariantAnnotation) >> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) >> >> > loc >> GRanges with 1 range and 7 metadata columns: >> seqnames ranges strand | LOCATION QUERYID >> TXID >> <rle> <iranges> <rle> | <factor> <integer> <integer> >> [1] chr17 [31245606, 31245625] * | coding 1 >> 47716 >> CDSID GENEID PRECEDEID FOLLOWID >> <integer> <character> <character> <character> >> [1] 184246 11307 <na> <na> > > Nice! So IIUC locateVariants() is not intrinsically for variants but > seems to be generally useful for "locating" any set of genomic > positions. Shouldn't we have this (or something similar) in > GenomicFeatures (probably with a different name)? Yes, locateVariants() locates any range, not just variants. The name choice was probably not the best. I like the idea of possibly renaming and moving to GenomicRanges. Val > > Cheers, > H. > >> >> >> ## Use the select() function to map these GENEID's to the other >> ## values you are interested in. The GENEID's from locateVariants() >> ## are Entrez ID's so we use those as our 'keytype'. >> entrezid <- loc$GENEID >> select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> ENTREZID ENSEMBL >> 1 11307 ENSMUSG00000024030 >> GENENAME >> 1 ATP-binding cassette, sub-family G (WHITE), member 1 >> >> >> Valerie >> >> >> >> On 11/12/12 06:59, Jos? Luis Lav?n wrote: >>> Hello all, >>> >>> First of all I want to thank everybody that gave me advice on this >>> subject. >>> trying to follow the advice, I added some modifications mixing codes >>> from >>> Tim, Harris and James , but it seems I got lost somewhere in >>> between... >>> I'm sorry for bothering you all again, but I'm afraid I need some more >>> help. >>> >>> I need to read my ids.txt file and then iterate use each line of id >>> (chr.position) to perform the annotation process with it. I thought >>> of a >>> "for" loop to achieve it, but I do not seem to catch the essence of R >>> processes and I guess I miss my tryout. >>> Please have a look at my "disaster" and help me with it If such >>> thing is >>> possible... >>> >>> biocLite('Mus.musculus') >>> require(Mus.musculus) >>> require(TxDb.Mmusculus.UCSC.mm9.knownGene) >>> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene') >>> egid<- names(txdb) >>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) >>> length(name) == length(egid) ## TRUE, OK to use >>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) >>> length(esbl) == length(egid) ## FALSE, do not use >>> >>> #read input table >>> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, >>> as.is >>> =TRUE) >>> >>> for(i in 1:length(coordTable)) >>> { >>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), >>> strand='*') >>> } >>> >>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes >>> >>> geneBodyProbes<- findOverlaps(probes, txdb) >>> geneBodyProbes >>> >>> write.table >>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") >>> >>> ## Hits of length 1 >>> ## queryLength: 1 >>> ## subjectLength: 21761 >>> ## queryHits subjectHits >>> ##<integer> <integer> >>> ## 1 1 1641 >>> ## >>> >>> name[subjectHits(geneBodyProbes)] >>> >>> ## 11307 # egid >>> ## "Abcg1" # name >>> ## >>> >>> org.Mm.egCHRLOC[['11307']] >>> >>> ## 17 >>> ## 31057694 >>> ## >>> >>> org.Mm.egENSEMBL[['11307']] >>> >>> ## [1] "ENSMUSG00000024030" >>> >>> promotersByGene<- flank(txdb, 1500) # or however many bases you want >>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra >>> promoterProbes<- findOverlaps(probes, promotersByGene) >>> promoterProbes >>> >>> write.table >>> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE ,sep="\t") >>> >>> >>> >>> ## Hits of length 0 >>> ## queryLength: 1 >>> ## subjectLength: 21761 >>> ## >>> ## read the GRanges and GenomicFeatures vignettes for more >>> >>> write.table >>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") >>> >>> >>> *Thanks in advance* >>> >>> JL >>> >>> 2012/11/8 Harris A. Jaffee<hj at="" jhu.edu=""> >>> >>>> On the elementary end of all this... >>>> >>>> If the sites are on a *file*, one per line, you could do >>>> >>>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) >>>> >>>> etc. >>>> >>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >>>> >>>>> Hi Jose, >>>>> >>>>> On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: >>>>>> Dear James, >>>>>> >>>>>> To answer your questions swiftly, the features are methylation sites >>>> (that's why I only have a coordinate instead of having a Start/End >>>> pair) in >>>> mouse (mm9) genome and I have a list of those in the following format: >>>>>> chr17.31246506 >>>>>> >>>>>> How could I read the list so that I can input the "chr" and the >>>> "coordinate" parameters into the expression you recommended? >>>>> First you need to split your data to chr and pos. >>>>> >>>>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, >>>> "\\.")) >>>>> Note that your vector of chr.pos data may have been in as a >>>>> factor, so >>>> you may need to wrap with an as.character(). If you don't know what I >>>> mean >>>> by that, you should first read 'An Introduction to R' ( >>>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time >>>> well spent. >>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>> Both the seqnames and ranges argument to GRanges() can be based on >>>> vectors. So if you had a matrix (called 'y') like >>>>> chr16 3423543 >>>>> chr3 403992 >>>>> chr18 3494202 >>>>> >>>>> then you can do >>>>> >>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >>>>> >>>>> see ?GRanges for more info. >>>>> >>>>> As a side note, IIRC, methylation sites are not in general found in >>>> exons, but are more likely to be found upstream of a given gene. You >>>> might >>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See >>>> ?exonsBy >>>> after loading the GenomicFeatures package. >>>>>> I 'm lookin forward to obtain a table where my "coordinate- based Id" >>>> appears in a column, and the gene_name and if possible, the >>>> Entrez/Ensembl >>>> Ids in two other columns : >>>>>> Coordinate Gene_name Entrez_ID Ensembl_ID >>>>>> >>>>>> Is it easy to do this in R? >>>>> Of course! Everything is easy to do in R. You should see my sweet R >>>> toast 'n coffee maker ;-D >>>>> But you should note that the fiveUTRByTranscript() function is >>>> transcript based (obvs), and so you will have multiple transcripts per >>>> gene. This makes things much more difficult, as a given methylation >>>> site >>>> may overlap multiple transcripts of the same gene. So that makes it >>>> hard to >>>> merge your original data with the overlapping transcripts. >>>>> You could do something like >>>>> >>>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, >>>>> use.names >>>> = TRUE) >>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], >>>>> elementMetadata))$exon_id >>>>> >>>>> then you could use unique Gene IDs thusly >>>>> >>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)), >>>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") >>>>> That should at least give you a start. See where you can go on your >>>>> own, >>>> and let us know if you get stuck. >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>>> I'm still really new to R and I lack many concepts you may consider >>>> basic... I'm awfully sorry >>>>>> Best >>>>>> >>>>>> JL >>>>>> >>>>>> 2012/11/8 James W. MacDonald<jmacdon at="" uw.edu<mailto:jmacdon="" at="" uw.edu="">> >>>>>> >>>>>> Hi Jose, >>>>>> >>>>>> >>>>>> On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: >>>>>> >>>>>> Dear Bioconductor list, >>>>>> >>>>>> I write you this email asking for a Bioconductor module that >>>>>> allows me to >>>>>> annotate genomic coordinates and get different GeneIds. >>>>>> I'll show you an example of what I'm referring to: >>>>>> >>>>>> I have this data: >>>>>> Chromosome coordinate >>>>>> chr17 31246506 >>>>>> >>>>>> >>>>>> It depends on what that coordinate is. Is it the start of a >>>>>> transcript? A SNP? Do you really just have a single >>>>>> coordinate, or >>>>>> do you have a range? What species are we talking about here? >>>>>> >>>>>> Depending on what your data are, you might want to use either >>>>>> one >>>>>> of the TxDb packages, or a SNPlocs package. There really isn't >>>>>> much to go on here. If I assume this is a coordinate that one >>>>>> might think is within an exon, and if I further assume you are >>>>>> working with H. sapiens, I could suggest something like >>>>>> >>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") >>>>>> >>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>>> >>>>>> ex[ex %in% x] >>>>>> >>>>>> or maybe more appropriately >>>>>> >>>>>> names(ex)[ex %in% x] >>>>>> >>>>>> which will give you the Gene ID, and you can go from there using >>>>>> the org.Hs.eg.db package. >>>>>> >>>>>> If however, your coordinate isn't in an exon, but might be in a >>>>>> UTR, you can look at ?exonsBy to see what other sequences you >>>>>> can >>>>>> extract to compare with. >>>>>> >>>>>> If these are SNPs, then you can look at the help pages for the >>>>>> relevant SNPlocs package. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> which can also be written this way by the program that >>>>>> yielded >>>>>> the result: >>>>>> chr17.31246506 >>>>>> >>>>>> And I need to convert this data into a gene name and known >>>>>> gene Ids, such >>>>>> as: >>>>>> >>>>>> Gene name Entrez_ID Ensembl_ID >>>>>> >>>>>> Tff3 NM_011575 20050 >>>>>> Can you please advice me about a module able to perform this >>>>>> ID conversion >>>>>> using a list of "chr17.31246506" type coordinates as input? >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> With best wishes >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> >>>>>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> -- James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> -- >>>>>> Dr. Jos? Luis Lav?n Trueba >>>>>> >>>>>> Dpto. de Producci?n Agraria >>>>>> Grupo de Gen?tica y Microbiolog?a >>>>>> Universidad P?blica de Navarra >>>>>> 31006 Pamplona >>>>>> Navarra >>>>>> SPAIN >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 11.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie, I have an issue with the results of your approach. Now I get a table with the following fields: ENTREZID ENSEMBL GENENAME 1 216285 ENSMUSG00000036602 ALX homeobox 1 But the Identifier I converted to obtain the previously mentioned data is not present in that table (eg. chr10.100837476) , so there's nothing I can do with this results table because I can't correlate my previous chr.location type identifiers to the results of the table. I guess I've done something wrong somewhere in the code I modified so that I missed my "chr.coordinate" ids column. So I'll paste the code here again to see if any of you can tell me where I lost that column. source("http://bioconductor.org/biocLite.R") biocLite('Mus.musculus') biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene') biocLite('VariantAnnotation') library(Mus.musculus) library(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene y <- read.delim("/path_to_dir/ids_noM.txt", sep=".", header=FALSE, as.is =TRUE) probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) library(VariantAnnotation) loc <- locateVariants(query=probes, subject=txdb, region=AllVariants()) head(loc) entrezid <- loc$GENEID entrezid2<-select(Mus.musculus, keytype="ENTREZID", keys=entrezid, cols=c("GENENAME", "ENSEMBL")) write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep = ",", eol = "\n") Thank you again for your advice 2012/11/14 Hervé Pagès <hpages@fhcrc.org> > Hi all, > > > On 11/12/2012 10:24 AM, Valerie Obenchain wrote: > >> Hi Jose, >> >> Here is a slightly different approach. >> >> library(Mus.musculus) >> library(TxDb.Mmusculus.UCSC.**mm9.knownGene) >> txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene >> >> ## I assume you've figured out how to read your data into a GRanges. >> ## Here we use a small example. >> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) >> >> ## The locateVariants() function in the VariantAnnotation package will >> ## give you the GENEIDs for all ranges in 'query'. If the range does not >> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and >> ## FOLLOWID are provided. The LOCATION column indicates what >> ## region the range fell in. See ?locateVariants for details on the >> ## different regions and the ability to set 'upstream' and 'downstream' >> ## values for promoter regions. >> library(VariantAnnotation) >> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) >> >> > loc >> GRanges with 1 range and 7 metadata columns: >> seqnames ranges strand | LOCATION QUERYID TXID >> <rle> <iranges> <rle> | <factor> <integer> <integer> >> [1] chr17 [31245606, 31245625] * | coding 1 47716 >> CDSID GENEID PRECEDEID FOLLOWID >> <integer> <character> <character> <character> >> [1] 184246 11307 <na> <na> >> > > Nice! So IIUC locateVariants() is not intrinsically for variants but > seems to be generally useful for "locating" any set of genomic > positions. Shouldn't we have this (or something similar) in > GenomicFeatures (probably with a different name)? > > Cheers, > H. > > > >> >> ## Use the select() function to map these GENEID's to the other >> ## values you are interested in. The GENEID's from locateVariants() >> ## are Entrez ID's so we use those as our 'keytype'. >> entrezid <- loc$GENEID >> select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> ENTREZID ENSEMBL >> 1 11307 ENSMUSG00000024030 >> GENENAME >> 1 ATP-binding cassette, sub-family G (WHITE), member 1 >> >> >> Valerie >> >> >> >> On 11/12/12 06:59, José Luis Lavín wrote: >> >>> Hello all, >>> >>> First of all I want to thank everybody that gave me advice on this >>> subject. >>> trying to follow the advice, I added some modifications mixing codes from >>> Tim, Harris and James , but it seems I got lost somewhere in between... >>> I'm sorry for bothering you all again, but I'm afraid I need some more >>> help. >>> >>> I need to read my ids.txt file and then iterate use each line of id >>> (chr.position) to perform the annotation process with it. I thought of a >>> "for" loop to achieve it, but I do not seem to catch the essence of R >>> processes and I guess I miss my tryout. >>> Please have a look at my "disaster" and help me with it If such thing is >>> possible... >>> >>> biocLite('Mus.musculus') >>> require(Mus.musculus) >>> require(TxDb.Mmusculus.UCSC.**mm9.knownGene) >>> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene, 'gene') >>> egid<- names(txdb) >>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) >>> length(name) == length(egid) ## TRUE, OK to use >>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) >>> length(esbl) == length(egid) ## FALSE, do not use >>> >>> #read input table >>> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".", header=FALSE, >>> as.is >>> =TRUE) >>> >>> for(i in 1:length(coordTable)) >>> { >>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), >>> strand='*') >>> } >>> >>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes >>> >>> geneBodyProbes<- findOverlaps(probes, txdb) >>> geneBodyProbes >>> >>> write.table >>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >>> FALSE,sep="\t") >>> >>> ## Hits of length 1 >>> ## queryLength: 1 >>> ## subjectLength: 21761 >>> ## queryHits subjectHits >>> ##<integer> <integer> >>> ## 1 1 1641 >>> ## >>> >>> name[subjectHits(**geneBodyProbes)] >>> >>> ## 11307 # egid >>> ## "Abcg1" # name >>> ## >>> >>> org.Mm.egCHRLOC[['11307']] >>> >>> ## 17 >>> ## 31057694 >>> ## >>> >>> org.Mm.egENSEMBL[['11307']] >>> >>> ## [1] "ENSMUSG00000024030" >>> >>> promotersByGene<- flank(txdb, 1500) # or however many bases you want >>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra >>> promoterProbes<- findOverlaps(probes, promotersByGene) >>> promoterProbes >>> >>> write.table >>> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",** >>> quote=FALSE,sep="\t") >>> >>> >>> ## Hits of length 0 >>> ## queryLength: 1 >>> ## subjectLength: 21761 >>> ## >>> ## read the GRanges and GenomicFeatures vignettes for more >>> >>> write.table >>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >>> FALSE,sep="\t") >>> >>> >>> *Thanks in advance* >>> >>> JL >>> >>> 2012/11/8 Harris A. Jaffee<hj@jhu.edu> >>> >>> On the elementary end of all this... >>>> >>>> If the sites are on a *file*, one per line, you could do >>>> >>>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) >>>> >>>> etc. >>>> >>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >>>> >>>> Hi Jose, >>>>> >>>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote: >>>>> >>>>>> Dear James, >>>>>> >>>>>> To answer your questions swiftly, the features are methylation sites >>>>>> >>>>> (that's why I only have a coordinate instead of having a Start/End >>>> pair) in >>>> mouse (mm9) genome and I have a list of those in the following format: >>>> >>>>> chr17.31246506 >>>>>> >>>>>> How could I read the list so that I can input the "chr" and the >>>>>> >>>>> "coordinate" parameters into the expression you recommended? >>>> >>>>> First you need to split your data to chr and pos. >>>>> >>>>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, >>>>> >>>> "\\.")) >>>> >>>>> Note that your vector of chr.pos data may have been in as a factor, so >>>>> >>>> you may need to wrap with an as.character(). If you don't know what I >>>> mean >>>> by that, you should first read 'An Introduction to R' ( >>>> http://cran.r-project.org/doc/**manuals/R-intro.html<http: cran.="" r-project.org="" doc="" manuals="" r-intro.html="">). >>>> That will be time >>>> well spent. >>>> >>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>>> >>>>> Both the seqnames and ranges argument to GRanges() can be based on >>>>> >>>> vectors. So if you had a matrix (called 'y') like >>>> >>>>> chr16 3423543 >>>>> chr3 403992 >>>>> chr18 3494202 >>>>> >>>>> then you can do >>>>> >>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >>>>> >>>>> see ?GRanges for more info. >>>>> >>>>> As a side note, IIRC, methylation sites are not in general found in >>>>> >>>> exons, but are more likely to be found upstream of a given gene. You >>>> might >>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See >>>> ?exonsBy >>>> after loading the GenomicFeatures package. >>>> >>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id" >>>>>> >>>>> appears in a column, and the gene_name and if possible, the >>>> Entrez/Ensembl >>>> Ids in two other columns : >>>> >>>>> Coordinate Gene_name Entrez_ID Ensembl_ID >>>>>> >>>>>> Is it easy to do this in R? >>>>>> >>>>> Of course! Everything is easy to do in R. You should see my sweet R >>>>> >>>> toast 'n coffee maker ;-D >>>> >>>>> But you should note that the fiveUTRByTranscript() function is >>>>> >>>> transcript based (obvs), and so you will have multiple transcripts per >>>> gene. This makes things much more difficult, as a given methylation site >>>> may overlap multiple transcripts of the same gene. So that makes it >>>> hard to >>>> merge your original data with the overlapping transcripts. >>>> >>>>> You could do something like >>>>> >>>>> ex<- fiveUTRsByTranscript(TxDb.**Mmusculus.UCSC.mm10.knownGene, >>>>> use.names >>>>> >>>> = TRUE) >>>> >>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id >>>>> >>>>> then you could use unique Gene IDs thusly >>>>> >>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)), >>>>> >>>> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID") >>>> >>>>> That should at least give you a start. See where you can go on your >>>>> own, >>>>> >>>> and let us know if you get stuck. >>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> I'm still really new to R and I lack many concepts you may consider >>>>>> >>>>> basic... I'm awfully sorry >>>> >>>>> Best >>>>>> >>>>>> JL >>>>>> >>>>>> 2012/11/8 James W. MacDonald<jmacdon@uw.edu<**mailto:jmacdon@uw.edu>> >>>>>> >>>>>> Hi Jose, >>>>>> >>>>>> >>>>>> On 11/8/2012 8:19 AM, José Luis Lavín wrote: >>>>>> >>>>>> Dear Bioconductor list, >>>>>> >>>>>> I write you this email asking for a Bioconductor module that >>>>>> allows me to >>>>>> annotate genomic coordinates and get different GeneIds. >>>>>> I'll show you an example of what I'm referring to: >>>>>> >>>>>> I have this data: >>>>>> Chromosome coordinate >>>>>> chr17 31246506 >>>>>> >>>>>> >>>>>> It depends on what that coordinate is. Is it the start of a >>>>>> transcript? A SNP? Do you really just have a single coordinate, or >>>>>> do you have a range? What species are we talking about here? >>>>>> >>>>>> Depending on what your data are, you might want to use either one >>>>>> of the TxDb packages, or a SNPlocs package. There really isn't >>>>>> much to go on here. If I assume this is a coordinate that one >>>>>> might think is within an exon, and if I further assume you are >>>>>> working with H. sapiens, I could suggest something like >>>>>> >>>>>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >>>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene") >>>>>> >>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>>> >>>>>> ex[ex %in% x] >>>>>> >>>>>> or maybe more appropriately >>>>>> >>>>>> names(ex)[ex %in% x] >>>>>> >>>>>> which will give you the Gene ID, and you can go from there using >>>>>> the org.Hs.eg.db package. >>>>>> >>>>>> If however, your coordinate isn't in an exon, but might be in a >>>>>> UTR, you can look at ?exonsBy to see what other sequences you can >>>>>> extract to compare with. >>>>>> >>>>>> If these are SNPs, then you can look at the help pages for the >>>>>> relevant SNPlocs package. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> which can also be written this way by the program that yielded >>>>>> the result: >>>>>> chr17.31246506 >>>>>> >>>>>> And I need to convert this data into a gene name and known >>>>>> gene Ids, such >>>>>> as: >>>>>> >>>>>> Gene name Entrez_ID Ensembl_ID >>>>>> >>>>>> Tff3 NM_011575 20050 >>>>>> Can you please advice me about a module able to perform this >>>>>> ID conversion >>>>>> using a list of "chr17.31246506" type coordinates as input? >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> With best wishes >>>>>> >>>>>> >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor@r-project.org<**mailto:Bioconductor@r-project.** >>>>>> org <bioconductor@r-project.org>> >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<ht tps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>> Search the archives: >>>>>> >>>>>> http://news.gmane.org/gmane.**science.biology.informatics.** >>>> conductor<http: news.gmane.org="" gmane.science.biology.informatics="" .conductor=""> >>>> >>>>> >>>>>> -- James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> -- >>>>>> Dr. José Luis Lavín Trueba >>>>>> >>>>>> Dpto. de Producción Agraria >>>>>> Grupo de Genética y Microbiología >>>>>> Universidad Pública de Navarra >>>>>> 31006 Pamplona >>>>>> Navarra >>>>>> SPAIN >>>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>>> ______________________________**_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: sta="" t.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: >>>>> >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> >>>> >>> >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**conduct or<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hello, Are you familiar with the R help pages? For any function or class object you can type ? followed by the name. ?locateVariants ?select The page for locateVariants explains each of the the metadata columns in the result. One of the columns is 'QUERYID'. This id represents the row in your original query (i.e., the GRanges you entered as the 'query' argument). The page also explains that the result has one row for each range-transcript match. gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, 31245606, 11248806), width = 20)) loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) The first range hit an intergenic region and we therefore have PRECEDEID and FOLLOWID. The second range hit a coding region in a single transcript and the third range hit an intron in two different transcripts. Because the third range hit 2 transcripts we have 2 rows of data, each has a unique TXID but the same GENEID. > loc GRanges with 4 ranges and 7 metadata columns: seqnames ranges strand | LOCATION QUERYID TXID <rle> <iranges> <rle> | <factor> <integer> <integer> [1] chr1 [41245606, 41245625] * | intergenic 1 <na> [2] chr17 [31245606, 31245625] * | coding 2 47716 [3] chr16 [11248806, 11248825] * | intron 3 46316 [4] chr16 [11248806, 11248825] * | intron 3 46317 CDSID GENEID PRECEDEID FOLLOWID <integer> <character> <character> <character> [1] <na> <na> 211798 381339 [2] 184246 11307 <na> <na> [3] <na> 14852 <na> <na> [4] <na> 14852 <na> <na> Pull out the unique combinations of GENEID and QUERYID. idx <- loc[ ,c("QUERYID", "GENEID")] idx <- idx[!duplicated(idx)] > idx GRanges with 3 ranges and 2 metadata columns: seqnames ranges strand | QUERYID GENEID <rle> <iranges> <rle> | <integer> <character> [1] chr1 [41245606, 41245625] * | 1 <na> [2] chr17 [31245606, 31245625] * | 2 11307 [3] chr16 [11248806, 11248825] * | 3 14852 Remove NA GENEID's and call select(). geneid <- idx[!is.na(idx$GENEID)] genetable <- select(Mus.musculus, keytype="ENTREZID", keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL")) > genetable ENTREZID ENSEMBL 1 11307 ENSMUSG00000024030 2 14852 ENSMUSG00000062203 GENENAME 1 ATP-binding cassette, sub-family G (WHITE), member 1 2 G1 to S phase transition 1 You can use the QUERYID to map back to an original list of identifiers. If you just need 'chromosome.start' information you can extract that from the ranges in geneid. Note that the QUERYID maps back to the original 'query' but those same ranges are also included in the result. id <- paste(as.character(seqnames(geneid)), ".", start(geneid), sep="") > data.frame(id, genetable) id ENTREZID ENSEMBL 1 chr17.31245606 11307 ENSMUSG00000024030 2 chr16.11248806 14852 ENSMUSG00000062203 GENENAME 1 ATP-binding cassette, sub-family G (WHITE), member 1 2 G1 to S phase transition 1 Valerie On 11/14/12 01:00, Jos? Luis Lav?n wrote: > Dear Valerie, > > I have an issue with the results of your approach. Now I get a table > with the following fields: > > ENTREZID ENSEMBL GENENAME > > 1 216285 ENSMUSG00000036602 ALX homeobox 1 > > > But the Identifier I converted to obtain the previously mentioned data > is not present in that table (eg. chr10.100837476) , so there's > nothing I can do with this results table because I can't correlate my > previous chr.location type identifiers to the results of the table. > I guess I've done something wrong somewhere in the code I modified so > that I missed my "chr.coordinate" ids column. So I'll paste the code > here again to see if any of you can tell me where I lost that column. > > source("http://bioconductor.org/biocLite.R") > biocLite('Mus.musculus') > biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene') > > biocLite('VariantAnnotation') > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > y <- read.delim("/path_to_dir/ids_noM.txt", sep=".", header=FALSE, > as.is <http: as.is="">=TRUE) > probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) > > library(VariantAnnotation) > loc <- locateVariants(query=probes, subject=txdb, region=AllVariants()) > head(loc) > > entrezid <- loc$GENEID > entrezid2<-select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > > write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep = > ",", eol = "\n") > > Thank you again for your advice > > > 2012/11/14 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi all, > > > On 11/12/2012 10:24 AM, Valerie Obenchain wrote: > > Hi Jose, > > Here is a slightly different approach. > > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > > ## I assume you've figured out how to read your data into a > GRanges. > ## Here we use a small example. > gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = > 20)) > > ## The locateVariants() function in the VariantAnnotation > package will > ## give you the GENEIDs for all ranges in 'query'. If the > range does not > ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and > ## FOLLOWID are provided. The LOCATION column indicates what > ## region the range fell in. See ?locateVariants for details > on the > ## different regions and the ability to set 'upstream' and > 'downstream' > ## values for promoter regions. > library(VariantAnnotation) > loc <- locateVariants(query=gr, subject=txdb, > region=AllVariants()) > > > loc > GRanges with 1 range and 7 metadata columns: > seqnames ranges strand | LOCATION > QUERYID TXID > <rle> <iranges> <rle> | <factor> <integer> <integer> > [1] chr17 [31245606, 31245625] * | coding > 1 47716 > CDSID GENEID PRECEDEID FOLLOWID > <integer> <character> <character> <character> > [1] 184246 11307 <na> <na> > > > Nice! So IIUC locateVariants() is not intrinsically for variants but > seems to be generally useful for "locating" any set of genomic > positions. Shouldn't we have this (or something similar) in > GenomicFeatures (probably with a different name)? > > Cheers, > H. > > > > > ## Use the select() function to map these GENEID's to the other > ## values you are interested in. The GENEID's from > locateVariants() > ## are Entrez ID's so we use those as our 'keytype'. > entrezid <- loc$GENEID > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > ENTREZID ENSEMBL > 1 11307 ENSMUSG00000024030 > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > > > Valerie > > > > On 11/12/12 06:59, Jos? Luis Lav?n wrote: > > Hello all, > > First of all I want to thank everybody that gave me advice > on this > subject. > trying to follow the advice, I added some modifications > mixing codes from > Tim, Harris and James , but it seems I got lost somewhere > in between... > I'm sorry for bothering you all again, but I'm afraid I > need some more > help. > > I need to read my ids.txt file and then iterate use each > line of id > (chr.position) to perform the annotation process with it. > I thought of a > "for" loop to achieve it, but I do not seem to catch the > essence of R > processes and I guess I miss my tryout. > Please have a look at my "disaster" and help me with it If > such thing is > possible... > > biocLite('Mus.musculus') > require(Mus.musculus) > require(TxDb.Mmusculus.UCSC.mm9.knownGene) > txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, > 'gene') > egid<- names(txdb) > name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) > length(name) == length(egid) ## TRUE, OK to use > esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) > length(esbl) == length(egid) ## FALSE, do not use > > #read input table > coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", > header=FALSE, > as.is <http: as.is=""> > =TRUE) > > for(i in 1:length(coordTable)) > { > probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], > width = 1), > strand='*') > } > > genome(probes)<- 'mm9' ## prevents some stoopid mistakes > > geneBodyProbes<- findOverlaps(probes, txdb) > geneBodyProbes > > write.table > (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") > > ## Hits of length 1 > ## queryLength: 1 > ## subjectLength: 21761 > ## queryHits subjectHits > ##<integer> <integer> > ## 1 1 1641 > ## > > name[subjectHits(geneBodyProbes)] > > ## 11307 # egid > ## "Abcg1" # name > ## > > org.Mm.egCHRLOC[['11307']] > > ## 17 > ## 31057694 > ## > > org.Mm.egENSEMBL[['11307']] > > ## [1] "ENSMUSG00000024030" > > promotersByGene<- flank(txdb, 1500) # or however many > bases you want > promotersByGene<- flank(promotersByGene, 200, FALSE) # a > little extra > promoterProbes<- findOverlaps(probes, promotersByGene) > promoterProbes > > write.table > (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",q uote=FALSE,sep="\t") > > > ## Hits of length 0 > ## queryLength: 1 > ## subjectLength: 21761 > ## > ## read the GRanges and GenomicFeatures vignettes for more > > write.table > (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") > > > *Thanks in advance* > > JL > > 2012/11/8 Harris A. Jaffee<hj at="" jhu.edu="" <mailto:hj="" at="" jhu.edu="">> > > On the elementary end of all this... > > If the sites are on a *file*, one per line, you could do > > y<- read.delim(filename, sep=".", > header=FALSE, as.is <http: as.is="">=TRUE) > > etc. > > On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: > > Hi Jose, > > On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: > > Dear James, > > To answer your questions swiftly, the features > are methylation sites > > (that's why I only have a coordinate instead of having > a Start/End > pair) in > mouse (mm9) genome and I have a list of those in the > following format: > > chr17.31246506 > > How could I read the list so that I can input > the "chr" and the > > "coordinate" parameters into the expression you > recommended? > > First you need to split your data to chr and pos. > > chr.pos<- do.call("rbind", strsplit(<your vector=""> of chr17.pos data>, > > "\\.")) > > Note that your vector of chr.pos data may have > been in as a factor, so > > you may need to wrap with an as.character(). If you > don't know what I > mean > by that, you should first read 'An Introduction to R' ( > http://cran.r-project.org/doc/manuals/R-intro.html). > That will be time > well spent. > > x<- GRanges(seq = "chr17", IRanges(start = > 31245606, width = 1)) > > Both the seqnames and ranges argument to GRanges() > can be based on > > vectors. So if you had a matrix (called 'y') like > > chr16 3423543 > chr3 403992 > chr18 3494202 > > then you can do > > x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) > > see ?GRanges for more info. > > As a side note, IIRC, methylation sites are not in > general found in > > exons, but are more likely to be found upstream of a > given gene. You > might > then want to use fiveUTRsByTranscript() rather than > exonsBy(). See > ?exonsBy > after loading the GenomicFeatures package. > > I 'm lookin forward to obtain a table where my > "coordinate-based Id" > > appears in a column, and the gene_name and if > possible, the > Entrez/Ensembl > Ids in two other columns : > > Coordinate Gene_name Entrez_ID Ensembl_ID > > Is it easy to do this in R? > > Of course! Everything is easy to do in R. You > should see my sweet R > > toast 'n coffee maker ;-D > > But you should note that the fiveUTRByTranscript() > function is > > transcript based (obvs), and so you will have multiple > transcripts per > gene. This makes things much more difficult, as a > given methylation site > may overlap multiple transcripts of the same gene. So > that makes it > hard to > merge your original data with the overlapping transcripts. > > You could do something like > > ex<- > fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, > use.names > > = TRUE) > > ex2<- do.call("rbind", sapply(ex[ex %in% x], > elementMetadata))$exon_id > > then you could use unique Gene IDs thusly > > my.trans<- select(Mus.musculus, > unique(as.character(ex2)), > > c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") > > That should at least give you a start. See where > you can go on your > own, > > and let us know if you get stuck. > > Best, > > Jim > > > > I'm still really new to R and I lack many > concepts you may consider > > basic... I'm awfully sorry > > Best > > JL > > 2012/11/8 James W. MacDonald<jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu=""><mailto:jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">>> > > Hi Jose, > > > On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: > > Dear Bioconductor list, > > I write you this email asking for a > Bioconductor module that > allows me to > annotate genomic coordinates and get > different GeneIds. > I'll show you an example of what I'm > referring to: > > I have this data: > Chromosome coordinate > chr17 31246506 > > > It depends on what that coordinate is. Is > it the start of a > transcript? A SNP? Do you really just have > a single coordinate, or > do you have a range? What species are we > talking about here? > > Depending on what your data are, you might > want to use either one > of the TxDb packages, or a SNPlocs > package. There really isn't > much to go on here. If I assume this is a > coordinate that one > might think is within an exon, and if I > further assume you are > working with H. sapiens, I could suggest > something like > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > ex<- > exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > > x<- GRanges(seq = "chr17", IRanges(start = > 31245606, width = 1)) > > ex[ex %in% x] > > or maybe more appropriately > > names(ex)[ex %in% x] > > which will give you the Gene ID, and you > can go from there using > the org.Hs.eg.db package. > > If however, your coordinate isn't in an > exon, but might be in a > UTR, you can look at ?exonsBy to see what > other sequences you can > extract to compare with. > > If these are SNPs, then you can look at > the help pages for the > relevant SNPlocs package. > > Best, > > Jim > > > > which can also be written this way by > the program that yielded > the result: > chr17.31246506 > > And I need to convert this data into a > gene name and known > gene Ids, such > as: > > Gene name Entrez_ID Ensembl_ID > > Tff3 NM_011575 20050 > Can you please advice me about a > module able to perform this > ID conversion > using a list of "chr17.31246506" type > coordinates as input? > > Thanks in advance > > With best wishes > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN

ADD REPLY • link 11.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie, *I've tried your last approach in the following way:* library(Mus.musculus) library(TxDb.Mmusculus.UCSC.mm9.knownGene) library(VariantAnnotation) txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene y <- read.delim("/path_to_file/ids_noM.txt", sep=".", header=FALSE, as.is =TRUE) gr<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width = 20)) #gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, 31245606, 11248806), width = 20)) loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) head(loc) idx <- loc[ ,c("QUERYID", "GENEID")] idx <- idx[!duplicated(idx)] #no duplicar ids head(idx) geneid <- idx[!is.na(idx$GENEID)] #Remove NA GENEID's and call select(). genetable <- select(Mus.musculus, keytype="ENTREZID", keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL")) head(genetable) id <- paste(as.character(seqnames(geneid)), ".", start(geneid), sep="") data.frame(id, genetable) * and I get this error*: *Error in data.frame(id, genetable) : arguments imply differing number of rows: 180912, 12962* It seems we are losing rows in the process,but I have no clue where or why this happens... Please forgive me for being still so unable to catch R file processing dynamics properly. Thank you again 2012/11/14 Valerie Obenchain <vobencha@fhcrc.org> > Hello, > > Are you familiar with the R help pages? For any function or class object > you can type ? followed by the name. > > ?locateVariants > ?select > > The page for locateVariants explains each of the the metadata columns in > the result. One of the columns is 'QUERYID'. This id represents the row in > your original query (i.e., the GRanges you entered as the 'query' > argument). The page also explains that the result has one row for each > range-transcript match. > > gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, 31245606, > 11248806), width = 20)) > > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > > The first range hit an intergenic region and we therefore have PRECEDEID > and FOLLOWID. The second range hit a coding region in a single transcript > and the third range hit an intron in two different transcripts. Because the > third range hit 2 transcripts we have 2 rows of data, each has a unique > TXID but the same GENEID. > > > loc > GRanges with 4 ranges and 7 metadata columns: > > seqnames ranges strand | LOCATION QUERYID TXID > <rle> <iranges> <rle> | <factor> <integer> <integer> > [1] chr1 [41245606, 41245625] * | intergenic 1 <na> > [2] chr17 [31245606, 31245625] * | coding 2 47716 > [3] chr16 [11248806, 11248825] * | intron 3 46316 > [4] chr16 [11248806, 11248825] * | intron 3 46317 > > CDSID GENEID PRECEDEID FOLLOWID > <integer> <character> <character> <character> > [1] <na> <na> 211798 381339 > [2] 184246 11307 <na> <na> > [3] <na> 14852 <na> <na> > [4] <na> 14852 <na> <na> > > > Pull out the unique combinations of GENEID and QUERYID. > > idx <- loc[ ,c("QUERYID", "GENEID")] > idx <- idx[!duplicated(idx)] > > idx > GRanges with 3 ranges and 2 metadata columns: > seqnames ranges strand | QUERYID GENEID > <rle> <iranges> <rle> | <integer> <character> > [1] chr1 [41245606, 41245625] * | 1 <na> > [2] chr17 [31245606, 31245625] * | 2 11307 > [3] chr16 [11248806, 11248825] * | 3 14852 > > Remove NA GENEID's and call select(). > geneid <- idx[!is.na(idx$GENEID)] > genetable <- select(Mus.musculus, keytype="ENTREZID", keys=geneid$GENEID, > cols=c("GENENAME", "ENSEMBL")) > > genetable > > ENTREZID ENSEMBL > 1 11307 ENSMUSG00000024030 > 2 14852 ENSMUSG00000062203 > > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > 2 G1 to S phase transition 1 > > You can use the QUERYID to map back to an original list of identifiers. If > you just need 'chromosome.start' information you can extract that from the > ranges in geneid. Note that the QUERYID maps back to the original 'query' > but those same ranges are also included in the result. > id <- paste(as.character(seqnames(**geneid)), ".", start(geneid), sep="") > > data.frame(id, genetable) > id ENTREZID ENSEMBL > 1 chr17.31245606 11307 ENSMUSG00000024030 > 2 chr16.11248806 14852 ENSMUSG00000062203 > > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > 2 G1 to S phase transition 1 > > > Valerie > > > > On 11/14/12 01:00, José Luis Lavín wrote: > >> Dear Valerie, >> >> I have an issue with the results of your approach. Now I get a table with >> the following fields: >> >> ENTREZID ENSEMBL GENENAME >> >> 1 216285 ENSMUSG00000036602 ALX homeobox 1 >> >> >> But the Identifier I converted to obtain the previously mentioned data is >> not present in that table (eg. chr10.100837476) , so there's nothing I can >> do with this results table because I can't correlate my previous >> chr.location type identifiers to the results of the table. >> I guess I've done something wrong somewhere in the code I modified so >> that I missed my "chr.coordinate" ids column. So I'll paste the code here >> again to see if any of you can tell me where I lost that column. >> >> source("http://bioconductor.**org/biocLite.R<http: bioconductor.or="" g="" bioclite.r=""> >> ") >> biocLite('Mus.musculus') >> biocLite('TxDb.Mmusculus.UCSC.**mm9.knownGene') >> >> biocLite('VariantAnnotation') >> library(Mus.musculus) >> library(TxDb.Mmusculus.UCSC.**mm9.knownGene) >> >> txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene >> y <- read.delim("/path_to_dir/ids_**noM.txt", sep=".", header=FALSE, >> as.is <http: as.is="">=TRUE) >> >> probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) >> >> library(VariantAnnotation) >> loc <- locateVariants(query=probes, subject=txdb, region=AllVariants()) >> head(loc) >> >> entrezid <- loc$GENEID >> entrezid2<-select(Mus.**musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> >> write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep = >> ",", eol = "\n") >> >> Thank you again for your advice >> >> >> 2012/11/14 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi all, >> >> >> On 11/12/2012 10:24 AM, Valerie Obenchain wrote: >> >> Hi Jose, >> >> Here is a slightly different approach. >> >> library(Mus.musculus) >> library(TxDb.Mmusculus.UCSC.**mm9.knownGene) >> txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene >> >> ## I assume you've figured out how to read your data into a >> GRanges. >> ## Here we use a small example. >> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = >> 20)) >> >> ## The locateVariants() function in the VariantAnnotation >> package will >> ## give you the GENEIDs for all ranges in 'query'. If the >> range does not >> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and >> ## FOLLOWID are provided. The LOCATION column indicates what >> ## region the range fell in. See ?locateVariants for details >> on the >> ## different regions and the ability to set 'upstream' and >> 'downstream' >> ## values for promoter regions. >> library(VariantAnnotation) >> loc <- locateVariants(query=gr, subject=txdb, >> region=AllVariants()) >> >> > loc >> GRanges with 1 range and 7 metadata columns: >> seqnames ranges strand | LOCATION >> QUERYID TXID >> <rle> <iranges> <rle> | <factor> <integer> <integer> >> [1] chr17 [31245606, 31245625] * | coding >> 1 47716 >> CDSID GENEID PRECEDEID FOLLOWID >> <integer> <character> <character> <character> >> [1] 184246 11307 <na> <na> >> >> >> Nice! So IIUC locateVariants() is not intrinsically for variants but >> seems to be generally useful for "locating" any set of genomic >> positions. Shouldn't we have this (or something similar) in >> GenomicFeatures (probably with a different name)? >> >> Cheers, >> H. >> >> >> >> >> ## Use the select() function to map these GENEID's to the other >> ## values you are interested in. The GENEID's from >> locateVariants() >> ## are Entrez ID's so we use those as our 'keytype'. >> entrezid <- loc$GENEID >> select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >> cols=c("GENENAME", "ENSEMBL")) >> ENTREZID ENSEMBL >> 1 11307 ENSMUSG00000024030 >> GENENAME >> 1 ATP-binding cassette, sub-family G (WHITE), member 1 >> >> >> Valerie >> >> >> >> On 11/12/12 06:59, José Luis Lavín wrote: >> >> Hello all, >> >> First of all I want to thank everybody that gave me advice >> on this >> subject. >> trying to follow the advice, I added some modifications >> mixing codes from >> Tim, Harris and James , but it seems I got lost somewhere >> in between... >> I'm sorry for bothering you all again, but I'm afraid I >> need some more >> help. >> >> I need to read my ids.txt file and then iterate use each >> line of id >> (chr.position) to perform the annotation process with it. >> I thought of a >> "for" loop to achieve it, but I do not seem to catch the >> essence of R >> processes and I guess I miss my tryout. >> Please have a look at my "disaster" and help me with it If >> such thing is >> possible... >> >> biocLite('Mus.musculus') >> require(Mus.musculus) >> require(TxDb.Mmusculus.UCSC.**mm9.knownGene) >> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene, >> 'gene') >> egid<- names(txdb) >> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) >> length(name) == length(egid) ## TRUE, OK to use >> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) >> length(esbl) == length(egid) ## FALSE, do not use >> >> #read input table >> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".", >> header=FALSE, >> as.is <http: as.is=""> >> >> =TRUE) >> >> for(i in 1:length(coordTable)) >> { >> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], >> width = 1), >> strand='*') >> } >> >> genome(probes)<- 'mm9' ## prevents some stoopid mistakes >> >> geneBodyProbes<- findOverlaps(probes, txdb) >> geneBodyProbes >> >> write.table >> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >> FALSE,sep="\t") >> >> ## Hits of length 1 >> ## queryLength: 1 >> ## subjectLength: 21761 >> ## queryHits subjectHits >> ##<integer> <integer> >> ## 1 1 1641 >> ## >> >> name[subjectHits(**geneBodyProbes)] >> >> ## 11307 # egid >> ## "Abcg1" # name >> ## >> >> org.Mm.egCHRLOC[['11307']] >> >> ## 17 >> ## 31057694 >> ## >> >> org.Mm.egENSEMBL[['11307']] >> >> ## [1] "ENSMUSG00000024030" >> >> promotersByGene<- flank(txdb, 1500) # or however many >> bases you want >> promotersByGene<- flank(promotersByGene, 200, FALSE) # a >> little extra >> promoterProbes<- findOverlaps(probes, promotersByGene) >> promoterProbes >> >> write.table >> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",** >> quote=FALSE,sep="\t") >> >> >> ## Hits of length 0 >> ## queryLength: 1 >> ## subjectLength: 21761 >> ## >> ## read the GRanges and GenomicFeatures vignettes for more >> >> write.table >> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >> FALSE,sep="\t") >> >> >> *Thanks in advance* >> >> JL >> >> 2012/11/8 Harris A. Jaffee<hj@jhu.edu <mailto:hj@jhu.edu="">> >> >> >> On the elementary end of all this... >> >> If the sites are on a *file*, one per line, you could do >> >> y<- read.delim(filename, sep=".", >> header=FALSE, as.is <http: as.is="">=TRUE) >> >> >> etc. >> >> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >> >> Hi Jose, >> >> On 11/8/2012 10:28 AM, José Luis Lavín wrote: >> >> Dear James, >> >> To answer your questions swiftly, the features >> are methylation sites >> >> (that's why I only have a coordinate instead of having >> a Start/End >> pair) in >> mouse (mm9) genome and I have a list of those in the >> following format: >> >> chr17.31246506 >> >> How could I read the list so that I can input >> the "chr" and the >> >> "coordinate" parameters into the expression you >> recommended? >> >> First you need to split your data to chr and pos. >> >> chr.pos<- do.call("rbind", strsplit(<your vector="">> of chr17.pos data>, >> >> "\\.")) >> >> Note that your vector of chr.pos data may have >> been in as a factor, so >> >> you may need to wrap with an as.character(). If you >> don't know what I >> mean >> by that, you should first read 'An Introduction to R' ( >> http://cran.r-project.org/doc/**manuals/R-intro.htm l<http: cran.r-project.org="" doc="" manuals="" r-intro.html=""> >> ). >> That will be time >> well spent. >> >> x<- GRanges(seq = "chr17", IRanges(start = >> 31245606, width = 1)) >> >> Both the seqnames and ranges argument to GRanges() >> can be based on >> >> vectors. So if you had a matrix (called 'y') like >> >> chr16 3423543 >> chr3 403992 >> chr18 3494202 >> >> then you can do >> >> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >> >> see ?GRanges for more info. >> >> As a side note, IIRC, methylation sites are not in >> general found in >> >> exons, but are more likely to be found upstream of a >> given gene. You >> might >> then want to use fiveUTRsByTranscript() rather than >> exonsBy(). See >> ?exonsBy >> after loading the GenomicFeatures package. >> >> I 'm lookin forward to obtain a table where my >> "coordinate-based Id" >> >> appears in a column, and the gene_name and if >> possible, the >> Entrez/Ensembl >> Ids in two other columns : >> >> Coordinate Gene_name Entrez_ID Ensembl_ID >> >> Is it easy to do this in R? >> >> Of course! Everything is easy to do in R. You >> should see my sweet R >> >> toast 'n coffee maker ;-D >> >> But you should note that the fiveUTRByTranscript() >> function is >> >> transcript based (obvs), and so you will have multiple >> transcripts per >> gene. This makes things much more difficult, as a >> given methylation site >> may overlap multiple transcripts of the same gene. So >> that makes it >> hard to >> merge your original data with the overlapping transcripts. >> >> You could do something like >> >> ex<- >> fiveUTRsByTranscript(TxDb.** >> Mmusculus.UCSC.mm10.knownGene, >> use.names >> >> = TRUE) >> >> ex2<- do.call("rbind", sapply(ex[ex %in% x], >> elementMetadata))$exon_id >> >> then you could use unique Gene IDs thusly >> >> my.trans<- select(Mus.musculus, >> unique(as.character(ex2)), >> >> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID") >> >> That should at least give you a start. See where >> you can go on your >> own, >> >> and let us know if you get stuck. >> >> Best, >> >> Jim >> >> >> >> I'm still really new to R and I lack many >> concepts you may consider >> >> basic... I'm awfully sorry >> >> Best >> >> JL >> >> 2012/11/8 James W. MacDonald<jmacdon@uw.edu>> <mailto:jmacdon@uw.edu><**mailto:jmacdon@uw.edu >> >> <mailto:jmacdon@uw.edu>>> >> >> Hi Jose, >> >> >> On 11/8/2012 8:19 AM, José Luis Lavín wrote: >> >> Dear Bioconductor list, >> >> I write you this email asking for a >> Bioconductor module that >> allows me to >> annotate genomic coordinates and get >> different GeneIds. >> I'll show you an example of what I'm >> referring to: >> >> I have this data: >> Chromosome coordinate >> chr17 31246506 >> >> >> It depends on what that coordinate is. Is >> it the start of a >> transcript? A SNP? Do you really just have >> a single coordinate, or >> do you have a range? What species are we >> talking about here? >> >> Depending on what your data are, you might >> want to use either one >> of the TxDb packages, or a SNPlocs >> package. There really isn't >> much to go on here. If I assume this is a >> coordinate that one >> might think is within an exon, and if I >> further assume you are >> working with H. sapiens, I could suggest >> something like >> >> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >> ex<- >> exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, >> "gene") >> >> x<- GRanges(seq = "chr17", IRanges(start = >> 31245606, width = 1)) >> >> ex[ex %in% x] >> >> or maybe more appropriately >> >> names(ex)[ex %in% x] >> >> which will give you the Gene ID, and you >> can go from there using >> the org.Hs.eg.db package. >> >> If however, your coordinate isn't in an >> exon, but might be in a >> UTR, you can look at ?exonsBy to see what >> other sequences you can >> extract to compare with. >> >> If these are SNPs, then you can look at >> the help pages for the >> relevant SNPlocs package. >> >> Best, >> >> Jim >> >> >> >> which can also be written this way by >> the program that yielded >> the result: >> chr17.31246506 >> >> And I need to convert this data into a >> gene name and known >> gene Ids, such >> as: >> >> Gene name Entrez_ID Ensembl_ID >> >> Tff3 NM_011575 20050 >> Can you please advice me about a >> module able to perform this >> ID conversion >> using a list of "chr17.31246506" type >> coordinates as input? >> >> Thanks in advance >> >> With best wishes >> >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> ><mailto:bioconduct**or@r-project.org <bioconductor@r-project.org=""> >> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> https://stat.ethz.ch/mailman/** >> listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" biocond="" uctor=""> >> Search the archives: >> >> http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> >> -- James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> >> >> >> -- >> -- >> Dr. José Luis Lavín Trueba >> >> Dpto. de Producción Agraria >> Grupo de Genética y Microbiología >> Universidad Pública de Navarra >> 31006 Pamplona >> Navarra >> SPAIN >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/**listinfo/biocond uctor<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> >> http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>> >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<ht tps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.** >> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.** >> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> -- Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> >> >> >> -- >> -- >> Dr. José Luis Lavín Trueba >> >> Dpto. de Producción Agraria >> Grupo de Genética y Microbiología >> Universidad Pública de Navarra >> 31006 Pamplona >> Navarra >> SPAIN >> > > -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

On 11/14/2012 07:51 AM, Jos? Luis Lav?n wrote: > Dear Valerie, > > *I've tried your last approach in the following way:* > > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > library(VariantAnnotation) > > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > > y <- read.delim("/path_to_file/ids_noM.txt", sep=".", header=FALSE, > as.is <http: as.is="">=TRUE) > gr<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width = 20)) > > #gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, > 31245606, 11248806), width = 20)) > > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > head(loc) > > idx <- loc[ ,c("QUERYID", "GENEID")] > idx <- idx[!duplicated(idx)] #no duplicar ids > head(idx) > geneid <- idx[!is.na <http: is.na="">(idx$GENEID)] #Remove NA GENEID's and > call select(). > > genetable <- select(Mus.musculus, keytype="ENTREZID", > keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL")) > head(genetable) > > id <- paste(as.character(seqnames(geneid)), ".", start(geneid), sep="") > data.frame(id, genetable) > > * and I get this error*: > > /Error in data.frame(id, genetable) : > arguments imply differing number of rows: 180912, 12962/ The error means you have a different number of rows in 'id' and 'genetable'. Take a look at the ?select page and work through the examples at the bottom. What does select() return? What happens if one of the keys you give to select() doesn't have a match? What if there are multiple matches? If the results aren't one-to-one you will need to match the ENTREZID in the select() result to the ENTREZID in the genetable. Valerie > > > It seems we are losing rows in the process,but I have no clue where or > why this happens... > > Please forgive me for being still so unable to catch R file processing > dynamics properly. > > Thank you again > > > > > 2012/11/14 Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> > > Hello, > > Are you familiar with the R help pages? For any function or class > object you can type ? followed by the name. > > ?locateVariants > ?select > > The page for locateVariants explains each of the the metadata > columns in the result. One of the columns is 'QUERYID'. This id > represents the row in your original query (i.e., the GRanges you > entered as the 'query' argument). The page also explains that the > result has one row for each range-transcript match. > > gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, > 31245606, 11248806), width = 20)) > > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > > The first range hit an intergenic region and we therefore have > PRECEDEID and FOLLOWID. The second range hit a coding region in a > single transcript and the third range hit an intron in two different > transcripts. Because the third range hit 2 transcripts we have 2 > rows of data, each has a unique TXID but the same GENEID. > > > loc > GRanges with 4 ranges and 7 metadata columns: > > seqnames ranges strand | LOCATION QUERYID > TXID > <rle> <iranges> <rle> | <factor> <integer> <integer> > [1] chr1 [41245606, 41245625] * | intergenic 1 <na> > [2] chr17 [31245606, 31245625] * | coding 2 > 47716 > [3] chr16 [11248806, 11248825] * | intron 3 > 46316 > [4] chr16 [11248806, 11248825] * | intron 3 > 46317 > > CDSID GENEID PRECEDEID FOLLOWID > <integer> <character> <character> <character> > [1] <na> <na> 211798 381339 > [2] 184246 11307 <na> <na> > [3] <na> 14852 <na> <na> > [4] <na> 14852 <na> <na> > > > Pull out the unique combinations of GENEID and QUERYID. > > idx <- loc[ ,c("QUERYID", "GENEID")] > idx <- idx[!duplicated(idx)] > > idx > GRanges with 3 ranges and 2 metadata columns: > seqnames ranges strand | QUERYID GENEID > <rle> <iranges> <rle> | <integer> <character> > [1] chr1 [41245606, 41245625] * | 1 <na> > [2] chr17 [31245606, 31245625] * | 2 11307 > [3] chr16 [11248806, 11248825] * | 3 14852 > > Remove NA GENEID's and call select(). > geneid <- idx[!is.na <http: is.na="">(idx$GENEID)] > genetable <- select(Mus.musculus, keytype="ENTREZID", > keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL")) > > genetable > > ENTREZID ENSEMBL > 1 11307 ENSMUSG00000024030 > 2 14852 ENSMUSG00000062203 > > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > 2 G1 to S phase transition 1 > > You can use the QUERYID to map back to an original list of > identifiers. If you just need 'chromosome.start' information you can > extract that from the ranges in geneid. Note that the QUERYID maps > back to the original 'query' but those same ranges are also included > in the result. > id <- paste(as.character(seqnames(__geneid)), ".", start(geneid), > sep="") > > data.frame(id, genetable) > id ENTREZID ENSEMBL > 1 chr17.31245606 11307 ENSMUSG00000024030 > 2 chr16.11248806 14852 ENSMUSG00000062203 > > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > 2 G1 to S phase transition 1 > > > Valerie > > > > On 11/14/12 01:00, Jos? Luis Lav?n wrote: > > Dear Valerie, > > I have an issue with the results of your approach. Now I get a > table with the following fields: > > ENTREZID ENSEMBL GENENAME > > 1 216285 ENSMUSG00000036602 ALX homeobox 1 > > > But the Identifier I converted to obtain the previously > mentioned data is not present in that table (eg. > chr10.100837476) , so there's nothing I can do with this results > table because I can't correlate my previous chr.location type > identifiers to the results of the table. > I guess I've done something wrong somewhere in the code I > modified so that I missed my "chr.coordinate" ids column. So > I'll paste the code here again to see if any of you can tell me > where I lost that column. > > source("http://bioconductor.__org/biocLite.R > <http: bioconductor.org="" bioclite.r="">") > biocLite('Mus.musculus') > biocLite('TxDb.Mmusculus.UCSC.__mm9.knownGene') > > biocLite('VariantAnnotation') > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.__mm9.knownGene) > > txdb <- TxDb.Mmusculus.UCSC.mm9.__knownGene > y <- read.delim("/path_to_dir/ids___noM.txt", sep=".", > header=FALSE, as.is <http: as.is=""> <http: as.is="">=TRUE) > > probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], > width=1)) > > library(VariantAnnotation) > loc <- locateVariants(query=probes, subject=txdb, > region=AllVariants()) > head(loc) > > entrezid <- loc$GENEID > entrezid2<-select(Mus.__musculus, keytype="ENTREZID", > keys=entrezid, cols=c("GENENAME", "ENSEMBL")) > > write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, > sep = ",", eol = "\n") > > Thank you again for your advice > > > 2012/11/14 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > > > Hi all, > > > On 11/12/2012 10:24 AM, Valerie Obenchain wrote: > > Hi Jose, > > Here is a slightly different approach. > > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.__mm9.knownGene) > txdb <- TxDb.Mmusculus.UCSC.mm9.__knownGene > > ## I assume you've figured out how to read your data into a > GRanges. > ## Here we use a small example. > gr <- GRanges(seq = "chr17", IRanges(start = 31245606, > width = > 20)) > > ## The locateVariants() function in the VariantAnnotation > package will > ## give you the GENEIDs for all ranges in 'query'. If the > range does not > ## fall in a gene (i.e., it is intergenic), then the > PRECEDEID and > ## FOLLOWID are provided. The LOCATION column indicates > what > ## region the range fell in. See ?locateVariants for > details > on the > ## different regions and the ability to set 'upstream' and > 'downstream' > ## values for promoter regions. > library(VariantAnnotation) > loc <- locateVariants(query=gr, subject=txdb, > region=AllVariants()) > > > loc > GRanges with 1 range and 7 metadata columns: > seqnames ranges strand | LOCATION > QUERYID TXID > <rle> <iranges> <rle> | <factor> <integer> <integer> > [1] chr17 [31245606, 31245625] * | coding > 1 47716 > CDSID GENEID PRECEDEID FOLLOWID > <integer> <character> <character> <character> > [1] 184246 11307 <na> <na> > > > Nice! So IIUC locateVariants() is not intrinsically for > variants but > seems to be generally useful for "locating" any set of genomic > positions. Shouldn't we have this (or something similar) in > GenomicFeatures (probably with a different name)? > > Cheers, > H. > > > > > ## Use the select() function to map these GENEID's to > the other > ## values you are interested in. The GENEID's from > locateVariants() > ## are Entrez ID's so we use those as our 'keytype'. > entrezid <- loc$GENEID > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > ENTREZID ENSEMBL > 1 11307 ENSMUSG00000024030 > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > > > Valerie > > > > On 11/12/12 06:59, Jos? Luis Lav?n wrote: > > Hello all, > > First of all I want to thank everybody that gave me > advice > on this > subject. > trying to follow the advice, I added some modifications > mixing codes from > Tim, Harris and James , but it seems I got lost > somewhere > in between... > I'm sorry for bothering you all again, but I'm afraid I > need some more > help. > > I need to read my ids.txt file and then iterate use > each > line of id > (chr.position) to perform the annotation process > with it. > I thought of a > "for" loop to achieve it, but I do not seem to catch the > essence of R > processes and I guess I miss my tryout. > Please have a look at my "disaster" and help me > with it If > such thing is > possible... > > biocLite('Mus.musculus') > require(Mus.musculus) > require(TxDb.Mmusculus.UCSC.__mm9.knownGene) > txdb<- > transcriptsBy(TxDb.Mmusculus.__UCSC.mm9.knownGene, > 'gene') > egid<- names(txdb) > name<- unlist(mget(egid, org.Mm.egSYMBOL, > ifnotfound=NA)) > length(name) == length(egid) ## TRUE, OK to use > esbl<- unlist(mget(egid, org.Mm.egENSEMBL, > ifnotfound=NA)) > length(esbl) == length(egid) ## FALSE, do not use > > #read input table > coordTable<- read.delim(/path_to_dir/ids.__txt, > sep=".", > header=FALSE, > as.is <http: as.is=""> <http: as.is=""> > > =TRUE) > > for(i in 1:length(coordTable)) > { > probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], > width = 1), > strand='*') > } > > genome(probes)<- 'mm9' ## prevents some stoopid > mistakes > > geneBodyProbes<- findOverlaps(probes, txdb) > geneBodyProbes > > write.table > > (geneBodyProbes,file="/path___to_dir/trans_id.tsv",quote=__F ALSE,sep="\t") > > ## Hits of length 1 > ## queryLength: 1 > ## subjectLength: 21761 > ## queryHits subjectHits > ##<integer> <integer> > ## 1 1 1641 > ## > > name[subjectHits(__geneBodyProbes)] > > ## 11307 # egid > ## "Abcg1" # name > ## > > org.Mm.egCHRLOC[['11307']] > > ## 17 > ## 31057694 > ## > > org.Mm.egENSEMBL[['11307']] > > ## [1] "ENSMUSG00000024030" > > promotersByGene<- flank(txdb, 1500) # or however many > bases you want > promotersByGene<- flank(promotersByGene, 200, > FALSE) # a > little extra > promoterProbes<- findOverlaps(probes, promotersByGene) > promoterProbes > > write.table > > (promoterProbes,file="/path___to_dir/promo_trans_id.tsv",__q uote=FALSE,sep="\t") > > > ## Hits of length 0 > ## queryLength: 1 > ## subjectLength: 21761 > ## > ## read the GRanges and GenomicFeatures vignettes > for more > > write.table > > (geneBodyProbes,file="/path___to_dir/trans_id.tsv",quote=__F ALSE,sep="\t") > > > *Thanks in advance* > > JL > > 2012/11/8 Harris A. Jaffee<hj at="" jhu.edu=""> <mailto:hj at="" jhu.edu=""> <mailto:hj at="" jhu.edu="" <mailto:hj="" at="" jhu.edu="">>> > > > On the elementary end of all this... > > If the sites are on a *file*, one per line, you > could do > > y<- read.delim(filename, sep=".", > header=FALSE, as.is <http: as.is=""> > <http: as.is="">=TRUE) > > > etc. > > On Nov 8, 2012, at 11:38 AM, James W. MacDonald > wrote: > > Hi Jose, > > On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: > > Dear James, > > To answer your questions swiftly, the > features > are methylation sites > > (that's why I only have a coordinate instead of > having > a Start/End > pair) in > mouse (mm9) genome and I have a list of those > in the > following format: > > chr17.31246506 > > How could I read the list so that I can > input > the "chr" and the > > "coordinate" parameters into the expression you > recommended? > > First you need to split your data to chr > and pos. > > chr.pos<- do.call("rbind", strsplit(<your> vector > of chr17.pos data>, > > "\\.")) > > Note that your vector of chr.pos data may have > been in as a factor, so > > you may need to wrap with an as.character(). If you > don't know what I > mean > by that, you should first read 'An Introduction > to R' ( > http://cran.r-project.org/doc/__manuals/R-intro.html > <http: cran.r-project.org="" doc="" manuals="" r-intro.html="">). > That will be time > well spent. > > x<- GRanges(seq = "chr17", IRanges(start = > 31245606, width = 1)) > > Both the seqnames and ranges argument to > GRanges() > can be based on > > vectors. So if you had a matrix (called 'y') like > > chr16 3423543 > chr3 403992 > chr18 3494202 > > then you can do > > x<- GRanges(y[,1], IRanges(start = y[,2], > width = 1)) > > see ?GRanges for more info. > > As a side note, IIRC, methylation sites are > not in > general found in > > exons, but are more likely to be found upstream > of a > given gene. You > might > then want to use fiveUTRsByTranscript() rather than > exonsBy(). See > ?exonsBy > after loading the GenomicFeatures package. > > I 'm lookin forward to obtain a table > where my > "coordinate-based Id" > > appears in a column, and the gene_name and if > possible, the > Entrez/Ensembl > Ids in two other columns : > > Coordinate Gene_name Entrez_ID Ensembl_ID > > Is it easy to do this in R? > > Of course! Everything is easy to do in R. You > should see my sweet R > > toast 'n coffee maker ;-D > > But you should note that the > fiveUTRByTranscript() > function is > > transcript based (obvs), and so you will have > multiple > transcripts per > gene. This makes things much more difficult, as a > given methylation site > may overlap multiple transcripts of the same > gene. So > that makes it > hard to > merge your original data with the overlapping > transcripts. > > You could do something like > > ex<- > > fiveUTRsByTranscript(TxDb.__Mmusculus.UCSC.mm10.knownGene, > use.names > > = TRUE) > > ex2<- do.call("rbind", sapply(ex[ex %in% x], > elementMetadata))$exon_id > > then you could use unique Gene IDs thusly > > my.trans<- select(Mus.musculus, > unique(as.character(ex2)), > > c("CHR","CHRLOC","ENTREZID","__ENSEMBL"), > "ENTREZID") > > That should at least give you a start. See > where > you can go on your > own, > > and let us know if you get stuck. > > Best, > > Jim > > > > I'm still really new to R and I lack many > concepts you may consider > > basic... I'm awfully sorry > > Best > > JL > > 2012/11/8 James W. > MacDonald<jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu=""> > <mailto:jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">><__mailto:jmacdon at uw.edu > <mailto:jmacdon at="" uw.edu=""> > > <mailto:jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu="">>>> > > Hi Jose, > > > On 11/8/2012 8:19 AM, Jos? Luis > Lav?n wrote: > > Dear Bioconductor list, > > I write you this email asking for a > Bioconductor module that > allows me to > annotate genomic coordinates > and get > different GeneIds. > I'll show you an example of > what I'm > referring to: > > I have this data: > Chromosome coordinate > chr17 31246506 > > > It depends on what that coordinate > is. Is > it the start of a > transcript? A SNP? Do you really > just have > a single coordinate, or > do you have a range? What species > are we > talking about here? > > Depending on what your data are, > you might > want to use either one > of the TxDb packages, or a SNPlocs > package. There really isn't > much to go on here. If I assume > this is a > coordinate that one > might think is within an exon, and if I > further assume you are > working with H. sapiens, I could > suggest > something like > > > library(TxDb.Hsapiens.UCSC.__hg19.knownGene) > ex<- > > exonsBy(TxDb.Hsapiens.UCSC.__hg19.knownGene, "gene") > > x<- GRanges(seq = "chr17", > IRanges(start = > 31245606, width = 1)) > > ex[ex %in% x] > > or maybe more appropriately > > names(ex)[ex %in% x] > > which will give you the Gene ID, > and you > can go from there using > the org.Hs.eg.db package. > > If however, your coordinate isn't in an > exon, but might be in a > UTR, you can look at ?exonsBy to > see what > other sequences you can > extract to compare with. > > If these are SNPs, then you can look at > the help pages for the > relevant SNPlocs package. > > Best, > > Jim > > > > which can also be written this > way by > the program that yielded > the result: > chr17.31246506 > > And I need to convert this data > into a > gene name and known > gene Ids, such > as: > > Gene name Entrez_ID Ensembl_ID > > Tff3 NM_011575 20050 > Can you please advice me about a > module able to perform this > ID conversion > using a list of > "chr17.31246506" type > coordinates as input? > > Thanks in advance > > With best wishes > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">><mailto:bioconduct__or at="" r-project.org=""> <mailto:bioconductor at="" r-project.org=""> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational > Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN > > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN

ADD REPLY • link 11.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Jose, On 11/14/2012 4:00 AM, Jos? Luis Lav?n wrote: > Dear Valerie, > > I have an issue with the results of your approach. Now I get a table with > the following fields: > > ENTREZID ENSEMBL GENENAME > > 1 216285 ENSMUSG00000036602 ALX homeobox 1 > > > But the Identifier I converted to obtain the previously mentioned data is > not present in that table (eg. chr10.100837476) , so there's nothing I can > do with this results table because I can't correlate my previous > chr.location type identifiers to the results of the table. > I guess I've done something wrong somewhere in the code I modified so that > I missed my "chr.coordinate" ids column. So I'll paste the code here again > to see if any of you can tell me where I lost that column. You didn't lose a column, you just left data behind. Using Valerie's example, slightly modified: > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > loc <- as.data.frame(loc) > gn <- select(Mus.musculus, loc$GENEID, c("GENENAME","ENSEMBL"), "ENTREZID") > the.data <- merge(gn, loc, by.x = 1, by.y = 10) > the.data[,c(4,5,1:3)] seqnames start ENTREZID ENSEMBL 1 chr17 31245606 11307 ENSMUSG00000024030 GENENAME 1 ATP-binding cassette, sub-family G (WHITE), member 1 There are other data in the data.frame that you might want to keep as well. Best, Jim > > source("http://bioconductor.org/biocLite.R") > biocLite('Mus.musculus') > biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene') > > biocLite('VariantAnnotation') > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > txdb<- TxDb.Mmusculus.UCSC.mm9.knownGene > y<- read.delim("/path_to_dir/ids_noM.txt", sep=".", header=FALSE, as.is > =TRUE) > probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) > > library(VariantAnnotation) > loc<- locateVariants(query=probes, subject=txdb, region=AllVariants()) > head(loc) > > entrezid<- loc$GENEID > entrezid2<-select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > > write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep = ",", > eol = "\n") > > Thank you again for your advice > > > 2012/11/14 Hervé Pagès<hpages at="" fhcrc.org=""> > >> Hi all, >> >> >> On 11/12/2012 10:24 AM, Valerie Obenchain wrote: >> >>> Hi Jose, >>> >>> Here is a slightly different approach. >>> >>> library(Mus.musculus) >>> library(TxDb.Mmusculus.UCSC.**mm9.knownGene) >>> txdb<- TxDb.Mmusculus.UCSC.mm9.**knownGene >>> >>> ## I assume you've figured out how to read your data into a GRanges. >>> ## Here we use a small example. >>> gr<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) >>> >>> ## The locateVariants() function in the VariantAnnotation package will >>> ## give you the GENEIDs for all ranges in 'query'. If the range does not >>> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and >>> ## FOLLOWID are provided. The LOCATION column indicates what >>> ## region the range fell in. See ?locateVariants for details on the >>> ## different regions and the ability to set 'upstream' and 'downstream' >>> ## values for promoter regions. >>> library(VariantAnnotation) >>> loc<- locateVariants(query=gr, subject=txdb, region=AllVariants()) >>> >>> > loc >>> GRanges with 1 range and 7 metadata columns: >>> seqnames ranges strand | LOCATION QUERYID TXID >>> <rle> <iranges> <rle> |<factor> <integer> <integer> >>> [1] chr17 [31245606, 31245625] * | coding 1 47716 >>> CDSID GENEID PRECEDEID FOLLOWID >>> <integer> <character> <character> <character> >>> [1] 184246 11307<na> <na> >>> >> Nice! So IIUC locateVariants() is not intrinsically for variants but >> seems to be generally useful for "locating" any set of genomic >> positions. Shouldn't we have this (or something similar) in >> GenomicFeatures (probably with a different name)? >> >> Cheers, >> H. >> >> >> >>> ## Use the select() function to map these GENEID's to the other >>> ## values you are interested in. The GENEID's from locateVariants() >>> ## are Entrez ID's so we use those as our 'keytype'. >>> entrezid<- loc$GENEID >>> select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >>> cols=c("GENENAME", "ENSEMBL")) >>> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, >>> cols=c("GENENAME", "ENSEMBL")) >>> ENTREZID ENSEMBL >>> 1 11307 ENSMUSG00000024030 >>> GENENAME >>> 1 ATP-binding cassette, sub-family G (WHITE), member 1 >>> >>> >>> Valerie >>> >>> >>> >>> On 11/12/12 06:59, Jos? Luis Lav?n wrote: >>> >>>> Hello all, >>>> >>>> First of all I want to thank everybody that gave me advice on this >>>> subject. >>>> trying to follow the advice, I added some modifications mixing codes from >>>> Tim, Harris and James , but it seems I got lost somewhere in between... >>>> I'm sorry for bothering you all again, but I'm afraid I need some more >>>> help. >>>> >>>> I need to read my ids.txt file and then iterate use each line of id >>>> (chr.position) to perform the annotation process with it. I thought of a >>>> "for" loop to achieve it, but I do not seem to catch the essence of R >>>> processes and I guess I miss my tryout. >>>> Please have a look at my "disaster" and help me with it If such thing is >>>> possible... >>>> >>>> biocLite('Mus.musculus') >>>> require(Mus.musculus) >>>> require(TxDb.Mmusculus.UCSC.**mm9.knownGene) >>>> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene, 'gene') >>>> egid<- names(txdb) >>>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) >>>> length(name) == length(egid) ## TRUE, OK to use >>>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) >>>> length(esbl) == length(egid) ## FALSE, do not use >>>> >>>> #read input table >>>> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".", header=FALSE, >>>> as.is >>>> =TRUE) >>>> >>>> for(i in 1:length(coordTable)) >>>> { >>>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), >>>> strand='*') >>>> } >>>> >>>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes >>>> >>>> geneBodyProbes<- findOverlaps(probes, txdb) >>>> geneBodyProbes >>>> >>>> write.table >>>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >>>> FALSE,sep="\t") >>>> >>>> ## Hits of length 1 >>>> ## queryLength: 1 >>>> ## subjectLength: 21761 >>>> ## queryHits subjectHits >>>> ##<integer> <integer> >>>> ## 1 1 1641 >>>> ## >>>> >>>> name[subjectHits(**geneBodyProbes)] >>>> >>>> ## 11307 # egid >>>> ## "Abcg1" # name >>>> ## >>>> >>>> org.Mm.egCHRLOC[['11307']] >>>> >>>> ## 17 >>>> ## 31057694 >>>> ## >>>> >>>> org.Mm.egENSEMBL[['11307']] >>>> >>>> ## [1] "ENSMUSG00000024030" >>>> >>>> promotersByGene<- flank(txdb, 1500) # or however many bases you want >>>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra >>>> promoterProbes<- findOverlaps(probes, promotersByGene) >>>> promoterProbes >>>> >>>> write.table >>>> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",** >>>> quote=FALSE,sep="\t") >>>> >>>> >>>> ## Hits of length 0 >>>> ## queryLength: 1 >>>> ## subjectLength: 21761 >>>> ## >>>> ## read the GRanges and GenomicFeatures vignettes for more >>>> >>>> write.table >>>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >>>> FALSE,sep="\t") >>>> >>>> >>>> *Thanks in advance* >>>> >>>> JL >>>> >>>> 2012/11/8 Harris A. Jaffee<hj at="" jhu.edu=""> >>>> >>>> On the elementary end of all this... >>>>> If the sites are on a *file*, one per line, you could do >>>>> >>>>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) >>>>> >>>>> etc. >>>>> >>>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >>>>> >>>>> Hi Jose, >>>>>> On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: >>>>>> >>>>>>> Dear James, >>>>>>> >>>>>>> To answer your questions swiftly, the features are methylation sites >>>>>>> >>>>>> (that's why I only have a coordinate instead of having a Start/End >>>>> pair) in >>>>> mouse (mm9) genome and I have a list of those in the following format: >>>>> >>>>>> chr17.31246506 >>>>>>> How could I read the list so that I can input the "chr" and the >>>>>>> >>>>>> "coordinate" parameters into the expression you recommended? >>>>>> First you need to split your data to chr and pos. >>>>>> >>>>>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, >>>>>> >>>>> "\\.")) >>>>> >>>>>> Note that your vector of chr.pos data may have been in as a factor, so >>>>>> >>>>> you may need to wrap with an as.character(). If you don't know what I >>>>> mean >>>>> by that, you should first read 'An Introduction to R' ( >>>>> http://cran.r-project.org/doc/**manuals/R-intro.html<http: cran="" .r-project.org="" doc="" manuals="" r-intro.html="">). >>>>> That will be time >>>>> well spent. >>>>> >>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>>> Both the seqnames and ranges argument to GRanges() can be based on >>>>>> >>>>> vectors. So if you had a matrix (called 'y') like >>>>> >>>>>> chr16 3423543 >>>>>> chr3 403992 >>>>>> chr18 3494202 >>>>>> >>>>>> then you can do >>>>>> >>>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >>>>>> >>>>>> see ?GRanges for more info. >>>>>> >>>>>> As a side note, IIRC, methylation sites are not in general found in >>>>>> >>>>> exons, but are more likely to be found upstream of a given gene. You >>>>> might >>>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See >>>>> ?exonsBy >>>>> after loading the GenomicFeatures package. >>>>> >>>>>> I 'm lookin forward to obtain a table where my "coordinate- based Id" >>>>>> appears in a column, and the gene_name and if possible, the >>>>> Entrez/Ensembl >>>>> Ids in two other columns : >>>>> >>>>>> Coordinate Gene_name Entrez_ID Ensembl_ID >>>>>>> Is it easy to do this in R? >>>>>>> >>>>>> Of course! Everything is easy to do in R. You should see my sweet R >>>>>> >>>>> toast 'n coffee maker ;-D >>>>> >>>>>> But you should note that the fiveUTRByTranscript() function is >>>>>> >>>>> transcript based (obvs), and so you will have multiple transcripts per >>>>> gene. This makes things much more difficult, as a given methylation site >>>>> may overlap multiple transcripts of the same gene. So that makes it >>>>> hard to >>>>> merge your original data with the overlapping transcripts. >>>>> >>>>>> You could do something like >>>>>> >>>>>> ex<- fiveUTRsByTranscript(TxDb.**Mmusculus.UCSC.mm10.knownGene, >>>>>> use.names >>>>>> >>>>> = TRUE) >>>>> >>>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id >>>>>> >>>>>> then you could use unique Gene IDs thusly >>>>>> >>>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)), >>>>>> >>>>> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID") >>>>> >>>>>> That should at least give you a start. See where you can go on your >>>>>> own, >>>>>> >>>>> and let us know if you get stuck. >>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> I'm still really new to R and I lack many concepts you may consider >>>>>> basic... I'm awfully sorry >>>>>> Best >>>>>>> JL >>>>>>> >>>>>>> 2012/11/8 James W. MacDonald<jmacdon at="" uw.edu<**mailto:jmacdon="" at="" uw.edu="">> >>>>>>> >>>>>>> Hi Jose, >>>>>>> >>>>>>> >>>>>>> On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: >>>>>>> >>>>>>> Dear Bioconductor list, >>>>>>> >>>>>>> I write you this email asking for a Bioconductor module that >>>>>>> allows me to >>>>>>> annotate genomic coordinates and get different GeneIds. >>>>>>> I'll show you an example of what I'm referring to: >>>>>>> >>>>>>> I have this data: >>>>>>> Chromosome coordinate >>>>>>> chr17 31246506 >>>>>>> >>>>>>> >>>>>>> It depends on what that coordinate is. Is it the start of a >>>>>>> transcript? A SNP? Do you really just have a single coordinate, or >>>>>>> do you have a range? What species are we talking about here? >>>>>>> >>>>>>> Depending on what your data are, you might want to use either one >>>>>>> of the TxDb packages, or a SNPlocs package. There really isn't >>>>>>> much to go on here. If I assume this is a coordinate that one >>>>>>> might think is within an exon, and if I further assume you are >>>>>>> working with H. sapiens, I could suggest something like >>>>>>> >>>>>>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >>>>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene") >>>>>>> >>>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>>>> >>>>>>> ex[ex %in% x] >>>>>>> >>>>>>> or maybe more appropriately >>>>>>> >>>>>>> names(ex)[ex %in% x] >>>>>>> >>>>>>> which will give you the Gene ID, and you can go from there using >>>>>>> the org.Hs.eg.db package. >>>>>>> >>>>>>> If however, your coordinate isn't in an exon, but might be in a >>>>>>> UTR, you can look at ?exonsBy to see what other sequences you can >>>>>>> extract to compare with. >>>>>>> >>>>>>> If these are SNPs, then you can look at the help pages for the >>>>>>> relevant SNPlocs package. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> which can also be written this way by the program that yielded >>>>>>> the result: >>>>>>> chr17.31246506 >>>>>>> >>>>>>> And I need to convert this data into a gene name and known >>>>>>> gene Ids, such >>>>>>> as: >>>>>>> >>>>>>> Gene name Entrez_ID Ensembl_ID >>>>>>> >>>>>>> Tff3 NM_011575 20050 >>>>>>> Can you please advice me about a module able to perform this >>>>>>> ID conversion >>>>>>> using a list of "chr17.31246506" type coordinates as input? >>>>>>> >>>>>>> Thanks in advance >>>>>>> >>>>>>> With best wishes >>>>>>> >>>>>>> >>>>>>> >>>>>>> ______________________________**_________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org<**mailto:Bioconductor at r-project.** >>>>>>> org<bioconductor at="" r-project.org="">> >>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor< https://stat.ethz.ch/mailman/listinfo/bioconductor> >>>>>>> Search the archives: >>>>>>> >>>>>>> http://news.gmane.org/gmane.**science.biology.informatics.** >>>>> conductor<http: news.gmane.org="" gmane.science.biology.informatic="" s.conductor=""> >>>>> >>>>>>> -- James W. MacDonald, M.S. >>>>>>> Biostatistician >>>>>>> University of Washington >>>>>>> Environmental and Occupational Health Sciences >>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>> Seattle WA 98105-6099 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> Dr. Jos? Luis Lav?n Trueba >>>>>>> >>>>>>> Dpto. de Producci?n Agraria >>>>>>> Grupo de Gen?tica y Microbiolog?a >>>>>>> Universidad P?blica de Navarra >>>>>>> 31006 Pamplona >>>>>>> Navarra >>>>>>> SPAIN >>>>>>> >>>>>> -- >>>>>> James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: st="" at.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>> Search the archives: >>>>>> >>>>> http://news.gmane.org/gmane.**science.biology.informatics.**cond uctor<http: news.gmane.org="" gmane.science.biology.informatics.conducto="" r=""> >>>>> >>>>> >>>>> >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.4 years ago James W. MacDonald 65k

0

Entering edit mode

You don't say what the disaster was, but I'll try to push you in a better direction. On Nov 12, 2012, at 9:59 AM, Jos? Luis Lav?n wrote: Hello all, > > First of all I want to thank everybody that gave me advice on this subject. trying to follow the advice, I added some modifications mixing codes from Tim, Harris and James , but it seems I got lost somewhere in between... > I'm sorry for bothering you all again, but I'm afraid I need some more help. > > I need to read my ids.txt file and then iterate use each line of id (chr.position) to perform the annotation process with it. I thought of a "for" loop to achieve it, but I do not seem to catch the essence of R processes and I guess I miss my tryout. > Please have a look at my "disaster" and help me with it If such thing is possible... > > biocLite('Mus.musculus') > require(Mus.musculus) > require(TxDb.Mmusculus.UCSC.mm9.knownGene) > txdb <- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene') > egid <- names(txdb) > name <- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) > length(name) == length(egid) ## TRUE, OK to use > esbl <- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) > length(esbl) == length(egid) ## FALSE, do not use > > #read input table > coordTable <- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is=TRUE) > > for(i in 1:length(coordTable)) > { > probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), > strand='*') > } If this appears to be working at all, it is because it is using an object 'y' from sometime earlier, when it should be using coordTable, obtained from the read.delim call. You don't need to loop, and if you did, it would probably be over 1:nrow(coordTable). (Examine length(coordTable) to see what it is.) I believe you can just do y <- coordTable <- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is=TRUE) probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) strand defaults to "*"; see ?GRanges > genome(probes) <- 'mm9' ## prevents some stoopid mistakes > > geneBodyProbes <- findOverlaps(probes, txdb) > geneBodyProbes > > write.table (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") > > ## Hits of length 1 > ## queryLength: 1 > ## subjectLength: 21761 > ## queryHits subjectHits > ## <integer> <integer> > ## 1 1 1641 > ## > > name[subjectHits(geneBodyProbes)] > > ## 11307 # egid > ## "Abcg1" # name > ## > > org.Mm.egCHRLOC[['11307']] > > ## 17 > ## 31057694 > ## > > org.Mm.egENSEMBL[['11307']] > > ## [1] "ENSMUSG00000024030" > > promotersByGene <- flank(txdb, 1500) # or however many bases you want > promotersByGene <- flank(promotersByGene, 200, FALSE) # a little extra > promoterProbes <- findOverlaps(probes, promotersByGene) > promoterProbes > > write.table (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",q uote=FALSE,sep="\t") > > ## Hits of length 0 > ## queryLength: 1 > ## subjectLength: 21761 > ## > ## read the GRanges and GenomicFeatures vignettes for more > > write.table (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t") > > > Thanks in advance > > JL > > 2012/11/8 Harris A. Jaffee <hj at="" jhu.edu=""> > On the elementary end of all this... > > If the sites are on a *file*, one per line, you could do > > y <- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) > > etc. > > On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: > > > Hi Jose, > > > > On 11/8/2012 10:28 AM, Jos? Luis Lav?n wrote: > >> Dear James, > >> > >> To answer your questions swiftly, the features are methylation sites (that's why I only have a coordinate instead of having a Start/End pair) in mouse (mm9) genome and I have a list of those in the following format: > >> > >> chr17.31246506 > >> > >> How could I read the list so that I can input the "chr" and the "coordinate" parameters into the expression you recommended? > > > > First you need to split your data to chr and pos. > > > > chr.pos <- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, "\\.")) > > > > Note that your vector of chr.pos data may have been in as a factor, so you may need to wrap with an as.character(). If you don't know what I mean by that, you should first read 'An Introduction to R' (http://cran.r-project.org/doc/manuals/R-intro.html). That will be time well spent. > > > >> > >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > > > > Both the seqnames and ranges argument to GRanges() can be based on vectors. So if you had a matrix (called 'y') like > > > > chr16 3423543 > > chr3 403992 > > chr18 3494202 > > > > then you can do > > > > x <- GRanges(y[,1], IRanges(start = y[,2], width = 1)) > > > > see ?GRanges for more info. > > > > As a side note, IIRC, methylation sites are not in general found in exons, but are more likely to be found upstream of a given gene. You might then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy after loading the GenomicFeatures package. > > > >> > >> I 'm lookin forward to obtain a table where my "coordinate-based Id" appears in a column, and the gene_name and if possible, the Entrez/Ensembl Ids in two other columns : > >> > >> Coordinate Gene_name Entrez_ID Ensembl_ID > >> > >> Is it easy to do this in R? > > > > Of course! Everything is easy to do in R. You should see my sweet R toast 'n coffee maker ;-D > > > > But you should note that the fiveUTRByTranscript() function is transcript based (obvs), and so you will have multiple transcripts per gene. This makes things much more difficult, as a given methylation site may overlap multiple transcripts of the same gene. So that makes it hard to merge your original data with the overlapping transcripts. > > > > You could do something like > > > > ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE) > > ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id > > > > then you could use unique Gene IDs thusly > > > > my.trans <- select(Mus.musculus, unique(as.character(ex2)), c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID") > > > > That should at least give you a start. See where you can go on your own, and let us know if you get stuck. > > > > Best, > > > > Jim > > > > > > > >> > >> I'm still really new to R and I lack many concepts you may consider basic... I'm awfully sorry > >> > >> Best > >> > >> JL > >> > >> 2012/11/8 James W. MacDonald <jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu="">> > >> > >> Hi Jose, > >> > >> > >> On 11/8/2012 8:19 AM, Jos? Luis Lav?n wrote: > >> > >> Dear Bioconductor list, > >> > >> I write you this email asking for a Bioconductor module that > >> allows me to > >> annotate genomic coordinates and get different GeneIds. > >> I'll show you an example of what I'm referring to: > >> > >> I have this data: > >> Chromosome coordinate > >> chr17 31246506 > >> > >> > >> It depends on what that coordinate is. Is it the start of a > >> transcript? A SNP? Do you really just have a single coordinate, or > >> do you have a range? What species are we talking about here? > >> > >> Depending on what your data are, you might want to use either one > >> of the TxDb packages, or a SNPlocs package. There really isn't > >> much to go on here. If I assume this is a coordinate that one > >> might think is within an exon, and if I further assume you are > >> working with H. sapiens, I could suggest something like > >> > >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > >> > >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) > >> > >> ex[ex %in% x] > >> > >> or maybe more appropriately > >> > >> names(ex)[ex %in% x] > >> > >> which will give you the Gene ID, and you can go from there using > >> the org.Hs.eg.db package. > >> > >> If however, your coordinate isn't in an exon, but might be in a > >> UTR, you can look at ?exonsBy to see what other sequences you can > >> extract to compare with. > >> > >> If these are SNPs, then you can look at the help pages for the > >> relevant SNPlocs package. > >> > >> Best, > >> > >> Jim > >> > >> > >> > >> which can also be written this way by the program that yielded > >> the result: > >> chr17.31246506 > >> > >> And I need to convert this data into a gene name and known > >> gene Ids, such > >> as: > >> > >> Gene name Entrez_ID Ensembl_ID > >> > >> Tff3 NM_011575 20050 > >> Can you please advice me about a module able to perform this > >> ID conversion > >> using a list of "chr17.31246506" type coordinates as input? > >> > >> Thanks in advance > >> > >> With best wishes > >> > >> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> -- James W. MacDonald, M.S. > >> Biostatistician > >> University of Washington > >> Environmental and Occupational Health Sciences > >> 4225 Roosevelt Way NE, # 100 > >> Seattle WA 98105-6099 > >> > >> > >> > >> > >> -- > >> -- > >> Dr. Jos? Luis Lav?n Trueba > >> > >> Dpto. de Producci?n Agraria > >> Grupo de Gen?tica y Microbiolog?a > >> Universidad P?blica de Navarra > >> 31006 Pamplona > >> Navarra > >> SPAIN > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > -- > Dr. Jos? Luis Lav?n Trueba > > Dpto. de Producci?n Agraria > Grupo de Gen?tica y Microbiolog?a > Universidad P?blica de Navarra > 31006 Pamplona > Navarra > SPAIN

ADD REPLY • link 11.4 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

José Luis Lavín ▴ 280

@jose-luis-lavin-5529

Last seen 9.6 years ago

Dear Bioconductor list, I write you this email asking for a Bioconductor module that allows me to annotate genomic coordinates and get different GeneIds. I'll show you an example of what I'm referring to: I have this data: Chromosome coordinate chr17 31246506 which can also be written this way by the program that yielded the result: chr17.31246506 And I need to convert this data into a gene name and known gene Ids, such as: Gene name Entrez_ID Ensembl_ID Tff3 NM_011575 20050 Can you please advice me about a module able to perform this ID conversion using a list of "chr17.31246506" type coordinates as input? Thanks in advance With best wishes -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

@Admin I've had a problem with some invalid embedded charset in my previous post tryouts, please remove any invalid submission. ---------------------------------------------------------------------- -------------------------------------------- Dear Bioconductor list, I write you this email asking for a Bioconductor module that allows me to annotate genomic coordinates and get different GeneIds. I'll show you an example of what I'm referring to: I have this data: Chromosome coordinate chr17 31246506 which can also be written this way by the program that yielded the result: chr17.31246506 And I need to convert this data into a gene name and known gene Ids, such as: Gene name Entrez_ID Ensembl_ID Tff3 NM_011575 20050 Can you please advice me about a module able to perform this ID conversion using a list of "chr17.31246506" type coordinates as input? Thanks in advance With best wishes JL [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

José Luis Lavín ▴ 280

@jose-luis-lavin-5529

Last seen 9.6 years ago

Dear all, I tried Valerie's approach but I came across an error I don't really know how to fix (I not familiar with R errors yet, but if I keep performing so badly I guess I will soon come across quite a few of them in a short time...) I will write the code I tested and the error I got: #I had to install some libraries from Bioconductor at the beginning source("http://bioconductor.org/biocLite.R") biocLite('Mus.musculus') biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene') biocLite('VariantAnnotation') library(Mus.musculus) library(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene #gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) y <- read.delim("/path_to_dir/ids.txt", sep=".", header=FALSE, as.is=TRUE) probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) *#Lets see if ids.txt was read correctly into probes* head(probes) 1 chr13 21272514 2 chr13 21272519 3 chr13 21272525 4 chr13 21272533 5 chr13 21295151 6 chr13 21295172 *#seems correct to me* library(VariantAnnotation) loc <- locateVariants(query=y, subject=txdb, region=AllVariants()) *#Error in function (classes, fdef, mtable) : #unable to find an inherited method for function "locateVariants", #for signature "data.frame", "TranscriptDb", "AllVariants"* #*Everything stops here* loc entrezid <- loc$GENEID select(Mus.musculus, keytype="ENTREZID", keys=entrezid, cols=c("GENENAME", "ENSEMBL")) select(Mus.musculus, keytype="ENTREZID", keys=entrezid, cols=c("GENENAME", "ENSEMBL")) Thanks in advance to all the people that spends time helping newbies like me ;) 2012/11/12 Valerie Obenchain <vobencha@fhcrc.org> > Hi Jose, > > Here is a slightly different approach. > > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.**mm9.knownGene) > txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene > > ## I assume you've figured out how to read your data into a GRanges. > ## Here we use a small example. > gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) > > ## The locateVariants() function in the VariantAnnotation package will > ## give you the GENEIDs for all ranges in 'query'. If the range does not > ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and > ## FOLLOWID are provided. The LOCATION column indicates what > ## region the range fell in. See ?locateVariants for details on the > ## different regions and the ability to set 'upstream' and 'downstream' > ## values for promoter regions. > library(VariantAnnotation) > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants()) > > > loc > GRanges with 1 range and 7 metadata columns: > seqnames ranges strand | LOCATION QUERYID TXID > <rle> <iranges> <rle> | <factor> <integer> <integer> > [1] chr17 [31245606, 31245625] * | coding 1 47716 > CDSID GENEID PRECEDEID FOLLOWID > <integer> <character> <character> <character> > [1] 184246 11307 <na> <na> > > > ## Use the select() function to map these GENEID's to the other > ## values you are interested in. The GENEID's from locateVariants() > ## are Entrez ID's so we use those as our 'keytype'. > entrezid <- loc$GENEID > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, cols=c("GENENAME", > "ENSEMBL")) > > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, > cols=c("GENENAME", "ENSEMBL")) > ENTREZID ENSEMBL > 1 11307 ENSMUSG00000024030 > GENENAME > 1 ATP-binding cassette, sub-family G (WHITE), member 1 > > > Valerie > > > > > On 11/12/12 06:59, José Luis Lavín wrote: > >> Hello all, >> >> First of all I want to thank everybody that gave me advice on this >> subject. >> trying to follow the advice, I added some modifications mixing codes from >> Tim, Harris and James , but it seems I got lost somewhere in between... >> I'm sorry for bothering you all again, but I'm afraid I need some more >> help. >> >> I need to read my ids.txt file and then iterate use each line of id >> (chr.position) to perform the annotation process with it. I thought of a >> "for" loop to achieve it, but I do not seem to catch the essence of R >> processes and I guess I miss my tryout. >> Please have a look at my "disaster" and help me with it If such thing is >> possible... >> >> biocLite('Mus.musculus') >> require(Mus.musculus) >> require(TxDb.Mmusculus.UCSC.**mm9.knownGene) >> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene, 'gene') >> egid<- names(txdb) >> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA)) >> length(name) == length(egid) ## TRUE, OK to use >> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA)) >> length(esbl) == length(egid) ## FALSE, do not use >> >> #read input table >> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".", header=FALSE, >> as.is >> =TRUE) >> >> for(i in 1:length(coordTable)) >> { >> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1), >> strand='*') >> } >> >> genome(probes)<- 'mm9' ## prevents some stoopid mistakes >> >> geneBodyProbes<- findOverlaps(probes, txdb) >> geneBodyProbes >> >> write.table >> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >> FALSE,sep="\t") >> >> ## Hits of length 1 >> ## queryLength: 1 >> ## subjectLength: 21761 >> ## queryHits subjectHits >> ##<integer> <integer> >> ## 1 1 1641 >> ## >> >> name[subjectHits(**geneBodyProbes)] >> >> ## 11307 # egid >> ## "Abcg1" # name >> ## >> >> org.Mm.egCHRLOC[['11307']] >> >> ## 17 >> ## 31057694 >> ## >> >> org.Mm.egENSEMBL[['11307']] >> >> ## [1] "ENSMUSG00000024030" >> >> promotersByGene<- flank(txdb, 1500) # or however many bases you want >> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra >> promoterProbes<- findOverlaps(probes, promotersByGene) >> promoterProbes >> >> write.table >> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",** >> quote=FALSE,sep="\t") >> >> ## Hits of length 0 >> ## queryLength: 1 >> ## subjectLength: 21761 >> ## >> ## read the GRanges and GenomicFeatures vignettes for more >> >> write.table >> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=** >> FALSE,sep="\t") >> >> >> *Thanks in advance* >> >> >> JL >> >> 2012/11/8 Harris A. Jaffee<hj@jhu.edu> >> >> On the elementary end of all this... >>> >>> If the sites are on a *file*, one per line, you could do >>> >>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE) >>> >>> etc. >>> >>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote: >>> >>> Hi Jose, >>>> >>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote: >>>> >>>>> Dear James, >>>>> >>>>> To answer your questions swiftly, the features are methylation sites >>>>> >>>> (that's why I only have a coordinate instead of having a Start/End >>> pair) in >>> mouse (mm9) genome and I have a list of those in the following format: >>> >>>> chr17.31246506 >>>>> >>>>> How could I read the list so that I can input the "chr" and the >>>>> >>>> "coordinate" parameters into the expression you recommended? >>> >>>> First you need to split your data to chr and pos. >>>> >>>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">, >>>> >>> "\\.")) >>> >>>> Note that your vector of chr.pos data may have been in as a factor, so >>>> >>> you may need to wrap with an as.character(). If you don't know what I >>> mean >>> by that, you should first read 'An Introduction to R' ( >>> http://cran.r-project.org/doc/**manuals/R-intro.html<http: cran.r="" -project.org="" doc="" manuals="" r-intro.html="">). >>> That will be time >>> well spent. >>> >>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>> >>>> Both the seqnames and ranges argument to GRanges() can be based on >>>> >>> vectors. So if you had a matrix (called 'y') like >>> >>>> chr16 3423543 >>>> chr3 403992 >>>> chr18 3494202 >>>> >>>> then you can do >>>> >>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1)) >>>> >>>> see ?GRanges for more info. >>>> >>>> As a side note, IIRC, methylation sites are not in general found in >>>> >>> exons, but are more likely to be found upstream of a given gene. You >>> might >>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See >>> ?exonsBy >>> after loading the GenomicFeatures package. >>> >>>> I 'm lookin forward to obtain a table where my "coordinate-based Id" >>>>> >>>> appears in a column, and the gene_name and if possible, the >>> Entrez/Ensembl >>> Ids in two other columns : >>> >>>> Coordinate Gene_name Entrez_ID Ensembl_ID >>>>> >>>>> Is it easy to do this in R? >>>>> >>>> Of course! Everything is easy to do in R. You should see my sweet R >>>> >>> toast 'n coffee maker ;-D >>> >>>> But you should note that the fiveUTRByTranscript() function is >>>> >>> transcript based (obvs), and so you will have multiple transcripts per >>> gene. This makes things much more difficult, as a given methylation site >>> may overlap multiple transcripts of the same gene. So that makes it hard >>> to >>> merge your original data with the overlapping transcripts. >>> >>>> You could do something like >>>> >>>> ex<- fiveUTRsByTranscript(TxDb.**Mmusculus.UCSC.mm10.knownGene, >>>> use.names >>>> >>> = TRUE) >>> >>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id >>>> >>>> then you could use unique Gene IDs thusly >>>> >>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)), >>>> >>> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID") >>> >>>> That should at least give you a start. See where you can go on your own, >>>> >>> and let us know if you get stuck. >>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> I'm still really new to R and I lack many concepts you may consider >>>>> >>>> basic... I'm awfully sorry >>> >>>> Best >>>>> >>>>> JL >>>>> >>>>> 2012/11/8 James W. MacDonald<jmacdon@uw.edu<**mailto:jmacdon@uw.edu>> >>>>> >>>>> Hi Jose, >>>>> >>>>> >>>>> On 11/8/2012 8:19 AM, José Luis Lavín wrote: >>>>> >>>>> Dear Bioconductor list, >>>>> >>>>> I write you this email asking for a Bioconductor module that >>>>> allows me to >>>>> annotate genomic coordinates and get different GeneIds. >>>>> I'll show you an example of what I'm referring to: >>>>> >>>>> I have this data: >>>>> Chromosome coordinate >>>>> chr17 31246506 >>>>> >>>>> >>>>> It depends on what that coordinate is. Is it the start of a >>>>> transcript? A SNP? Do you really just have a single coordinate, or >>>>> do you have a range? What species are we talking about here? >>>>> >>>>> Depending on what your data are, you might want to use either one >>>>> of the TxDb packages, or a SNPlocs package. There really isn't >>>>> much to go on here. If I assume this is a coordinate that one >>>>> might think is within an exon, and if I further assume you are >>>>> working with H. sapiens, I could suggest something like >>>>> >>>>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene") >>>>> >>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1)) >>>>> >>>>> ex[ex %in% x] >>>>> >>>>> or maybe more appropriately >>>>> >>>>> names(ex)[ex %in% x] >>>>> >>>>> which will give you the Gene ID, and you can go from there using >>>>> the org.Hs.eg.db package. >>>>> >>>>> If however, your coordinate isn't in an exon, but might be in a >>>>> UTR, you can look at ?exonsBy to see what other sequences you can >>>>> extract to compare with. >>>>> >>>>> If these are SNPs, then you can look at the help pages for the >>>>> relevant SNPlocs package. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> which can also be written this way by the program that yielded >>>>> the result: >>>>> chr17.31246506 >>>>> >>>>> And I need to convert this data into a gene name and known >>>>> gene Ids, such >>>>> as: >>>>> >>>>> Gene name Entrez_ID Ensembl_ID >>>>> >>>>> Tff3 NM_011575 20050 >>>>> Can you please advice me about a module able to perform this >>>>> ID conversion >>>>> using a list of "chr17.31246506" type coordinates as input? >>>>> >>>>> Thanks in advance >>>>> >>>>> With best wishes >>>>> >>>>> >>>>> >>>>> ______________________________**_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org<**mailto:Bioconductor@r-project.** >>>>> org <bioconductor@r-project.org>> >>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<htt ps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: >>>>> >>>>> http://news.gmane.org/gmane.**science.biology.informatics.**con ductor<http: news.gmane.org="" gmane.science.biology.informatics.conduct="" or=""> >>> >>>> >>>>> -- James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Dr. José Luis Lavín Trueba >>>>> >>>>> Dpto. de Producción Agraria >>>>> Grupo de Genética y Microbiología >>>>> Universidad Pública de Navarra >>>>> 31006 Pamplona >>>>> Navarra >>>>> SPAIN >>>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hi, On Tue, Nov 13, 2012 at 5:00 AM, Jos? Luis Lav?n <jluis.lavin at="" unavarra.es=""> wrote: > Dear all, > > I tried Valerie's approach but I came across an error I don't really know > how to fix (I not familiar with R errors yet, but if I keep performing so > badly I guess I will soon come across quite a few of them in a short > time...) > > I will write the code I tested and the error I got: > > #I had to install some libraries from Bioconductor at the beginning > > source("http://bioconductor.org/biocLite.R") > biocLite('Mus.musculus') > biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene') > biocLite('VariantAnnotation') > > > library(Mus.musculus) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > > #gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) > > y <- read.delim("/path_to_dir/ids.txt", sep=".", header=FALSE, as.is=TRUE) > > > probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) > > *#Lets see if ids.txt was read correctly into probes* > > head(probes) > > 1 chr13 21272514 > 2 chr13 21272519 > 3 chr13 21272525 > 4 chr13 21272533 > 5 chr13 21295151 > 6 chr13 21295172 > > *#seems correct to me* This doesn't seem correct to me, actually. Taking the `head` of a proper GRanges object should print some "GRanges" like stuff around the data, for instance: R> probes <- GRanges('chr13', IRanges(c(21272514, 21272519, 21272525), width=1)) R> head(probes) GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr13 [21272514, 21272514] * [2] chr13 [21272519, 21272519] * [3] chr13 [21272525, 21272525] * --- seqlengths: chr13 NA You see how these two are very different? You can also check to see what your `probes` object is, like so: R> class(probes) [1] "GRanges" attr(,"package") [1] "GenomicRanges" But I'm guessing your `probes` object is actually still a data.frame (in which case I am a bit surprised that your call to `probes <- GRanges(y[,1], ...)` didn't result in an error) because: > library(VariantAnnotation) > > loc <- locateVariants(query=y, subject=txdb, region=AllVariants()) > > *#Error in function (classes, fdef, mtable) : > #unable to find an inherited method for function "locateVariants", > #for signature "data.frame", "TranscriptDb", "AllVariants"* R is telling you that you are trying to call `locateVariants` on a query that is a `data.frame` object, and there is no `locateVariants` function defined like that. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Dear Steve, I repeated the whole process and this time i get a GRanges object like the one you shown after checking it as you said. GRanges with 6 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr10 [100051744, 100051744] * [2] chr10 [100051760, 100051760] * [3] chr10 [100051762, 100051762] * [4] chr10 [100051780, 100051780] * [5] chr10 [100051790, 100051790] * [6] chr10 [100837476, 100837476] * --- seqlengths: chr1 chr10 chr11 ... chrY chrY_random NA NA NA ... NA NA You gave me the clue to find the error in my command, I was passing "*y*" instead of "*probes*", eg: loc <- locateVariants(query=*y*, subject=txdb, region=AllVariants()) instead of: loc <- locateVariants(query=*probes*, subject=txdb, region=AllVariants()) Thus the error shown was as you said: "But I'm guessing your `probes` object is actually still a data.frame (in which case I am a bit surprised that your call to `probes <- GRanges(y[,1], ...)` didn't result in an error) " Now after changing the command it works ;) Thank you all very much for your support! PD. I almost forgot it, but I also had to filter ChrM related features because otherwise I got the following error: *Error in .findOverlaps.circle(circle.length, seqselect(q_ranges, q_idx), : overlap type "within" is not yet supported for circular sequence chrM In addition: Warning messages: 1: In .setActiveSubjectSeq(query, subject) : circular sequence(s) in query 'chrM' ignored 2: In .setActiveSubjectSeq(query, subject) : circular sequence(s) in query 'chrM' ignored 3: In .setActiveSubjectSeq(query, subject) : circular sequence(s) in query 'chrM' ignored 4: In .setActiveSubjectSeq(query, subject) : circular sequence(s) in query 'chrM' ignored 5: In .setActiveSubjectSeq(query, subject) : circular sequence(s) in query 'chrM' ignored* Hope this may be of interest for someone sometime in the future ;) 2012/11/13 Steve Lianoglou <mailinglist.honeypot@gmail.com> > Hi, > > On Tue, Nov 13, 2012 at 5:00 AM, José Luis Lavín > <jluis.lavin@unavarra.es> wrote: > > Dear all, > > > > I tried Valerie's approach but I came across an error I don't really know > > how to fix (I not familiar with R errors yet, but if I keep performing so > > badly I guess I will soon come across quite a few of them in a short > > time...) > > > > I will write the code I tested and the error I got: > > > > #I had to install some libraries from Bioconductor at the beginning > > > > source("http://bioconductor.org/biocLite.R") > > biocLite('Mus.musculus') > > biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene') > > biocLite('VariantAnnotation') > > > > > > library(Mus.musculus) > > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > > > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene > > > > #gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20)) > > > > y <- read.delim("/path_to_dir/ids.txt", sep=".", header=FALSE, as.is > =TRUE) > > > > > > probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1)) > > > > *#Lets see if ids.txt was read correctly into probes* > > > > head(probes) > > > > 1 chr13 21272514 > > 2 chr13 21272519 > > 3 chr13 21272525 > > 4 chr13 21272533 > > 5 chr13 21295151 > > 6 chr13 21295172 > > > > *#seems correct to me* > > This doesn't seem correct to me, actually. Taking the `head` of a > proper GRanges object should print some "GRanges" like stuff around > the data, for instance: > > R> probes <- GRanges('chr13', IRanges(c(21272514, 21272519, 21272525), > width=1)) > R> head(probes) > > GRanges with 3 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr13 [21272514, 21272514] * > [2] chr13 [21272519, 21272519] * > [3] chr13 [21272525, 21272525] * > --- > seqlengths: > chr13 > NA > > You see how these two are very different? > > You can also check to see what your `probes` object is, like so: > > R> class(probes) > [1] "GRanges" > attr(,"package") > [1] "GenomicRanges" > > But I'm guessing your `probes` object is actually still a data.frame > (in which case I am a bit surprised that your call to `probes <- > GRanges(y[,1], ...)` didn't result in an error) because: > > > library(VariantAnnotation) > > > > loc <- locateVariants(query=y, subject=txdb, region=AllVariants()) > > > > *#Error in function (classes, fdef, mtable) : > > #unable to find an inherited method for function "locateVariants", > > #for signature "data.frame", "TranscriptDb", "AllVariants"* > > R is telling you that you are trying to call `locateVariants` on a > query that is a `data.frame` object, and there is no `locateVariants` > function defined like that. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- -- Dr. José Luis Lavín Trueba Dpto. de Producción Agraria Grupo de Genética y Microbiología Universidad Pública de Navarra 31006 Pamplona Navarra SPAIN [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hi Jose, Thanks for reporting the bug. You should not have to remove circular chromosomes yourself, it should be done automatically. I've also reduced the number of warnings (re: circular sequences) to one vs multiple. Fixes are in devel, 1.5.13 and release 1.4.4 which will be available via biocLite() Wednesday ~9am PST or in svn immediately. Valerie > > PD. I almost forgot it, but I also had to filter ChrM related features > because otherwise I got the following error: > > *Error in .findOverlaps.circle(circle.length, seqselect(q_ranges, q_idx), : > overlap type "within" is not yet supported for circular sequence chrM > In addition: Warning messages: > 1: In .setActiveSubjectSeq(query, subject) : > circular sequence(s) in query 'chrM' ignored > 2: In .setActiveSubjectSeq(query, subject) : > circular sequence(s) in query 'chrM' ignored > 3: In .setActiveSubjectSeq(query, subject) : > circular sequence(s) in query 'chrM' ignored > 4: In .setActiveSubjectSeq(query, subject) : > circular sequence(s) in query 'chrM' ignored > 5: In .setActiveSubjectSeq(query, subject) : > circular sequence(s) in query 'chrM' ignored* > > Hope this may be of interest for someone sometime in the future ;) > >

ADD REPLY • link 11.4 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.