Dear all,
I tried Valerie's approach but I came across an error I don't really
know
how to fix (I not familiar with R errors yet, but if I keep performing
so
badly I guess I will soon come across quite a few of them in a short
time...)
I will write the code I tested and the error I got:
#I had to install some libraries from Bioconductor at the beginning
source("
http://bioconductor.org/biocLite.R")
biocLite('Mus.musculus')
biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene')
biocLite('VariantAnnotation')
library(Mus.musculus)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
#gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
y <- read.delim("/path_to_dir/ids.txt", sep=".", header=FALSE,
as.is=TRUE)
probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1))
*#Lets see if ids.txt was read correctly into probes*
head(probes)
1 chr13 21272514
2 chr13 21272519
3 chr13 21272525
4 chr13 21272533
5 chr13 21295151
6 chr13 21295172
*#seems correct to me*
library(VariantAnnotation)
loc <- locateVariants(query=y, subject=txdb, region=AllVariants())
*#Error in function (classes, fdef, mtable) :
#unable to find an inherited method for function "locateVariants",
#for signature "data.frame", "TranscriptDb", "AllVariants"*
#*Everything stops here*
loc
entrezid <- loc$GENEID
select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
cols=c("GENENAME",
"ENSEMBL"))
select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
cols=c("GENENAME",
"ENSEMBL"))
Thanks in advance to all the people that spends time helping newbies
like
me
;)
2012/11/12 Valerie Obenchain <vobencha@fhcrc.org>
> Hi Jose,
>
> Here is a slightly different approach.
>
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.**mm9.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene
>
> ## I assume you've figured out how to read your data into a GRanges.
> ## Here we use a small example.
> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
>
> ## The locateVariants() function in the VariantAnnotation package
will
> ## give you the GENEIDs for all ranges in 'query'. If the range does
not
> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
> ## FOLLOWID are provided. The LOCATION column indicates what
> ## region the range fell in. See ?locateVariants for details on the
> ## different regions and the ability to set 'upstream' and
'downstream'
> ## values for promoter regions.
> library(VariantAnnotation)
> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
>
> > loc
> GRanges with 1 range and 7 metadata columns:
> seqnames ranges strand | LOCATION QUERYID
TXID
> <rle> <iranges> <rle> | <factor> <integer> <integer>
> [1] chr17 [31245606, 31245625] * | coding 1
47716
> CDSID GENEID PRECEDEID FOLLOWID
> <integer> <character> <character> <character>
> [1] 184246 11307 <na> <na>
>
>
> ## Use the select() function to map these GENEID's to the other
> ## values you are interested in. The GENEID's from locateVariants()
> ## are Entrez ID's so we use those as our 'keytype'.
> entrezid <- loc$GENEID
> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
cols=c("GENENAME",
> "ENSEMBL"))
> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
> cols=c("GENENAME", "ENSEMBL"))
> ENTREZID ENSEMBL
> 1 11307 ENSMUSG00000024030
> GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>
>
> Valerie
>
>
>
>
> On 11/12/12 06:59, José Luis Lavín wrote:
>
>> Hello all,
>>
>> First of all I want to thank everybody that gave me advice on this
>> subject.
>> trying to follow the advice, I added some modifications mixing
codes from
>> Tim, Harris and James , but it seems I got lost somewhere in
between...
>> I'm sorry for bothering you all again, but I'm afraid I need some
more
>> help.
>>
>> I need to read my ids.txt file and then iterate use each line of id
>> (chr.position) to perform the annotation process with it. I thought
of a
>> "for" loop to achieve it, but I do not seem to catch the essence of
R
>> processes and I guess I miss my tryout.
>> Please have a look at my "disaster" and help me with it If such
thing is
>> possible...
>>
>> biocLite('Mus.musculus')
>> require(Mus.musculus)
>> require(TxDb.Mmusculus.UCSC.**mm9.knownGene)
>> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene, 'gene')
>> egid<- names(txdb)
>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>> length(name) == length(egid) ## TRUE, OK to use
>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>> length(esbl) == length(egid) ## FALSE, do not use
>>
>> #read input table
>> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".",
header=FALSE,
>> as.is
>> =TRUE)
>>
>> for(i in 1:length(coordTable))
>> {
>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>> strand='*')
>> }
>>
>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>>
>> geneBodyProbes<- findOverlaps(probes, txdb)
>> geneBodyProbes
>>
>> write.table
>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=**
>> FALSE,sep="\t")
>>
>> ## Hits of length 1
>> ## queryLength: 1
>> ## subjectLength: 21761
>> ## queryHits subjectHits
>> ##<integer> <integer>
>> ## 1 1 1641
>> ##
>>
>> name[subjectHits(**geneBodyProbes)]
>>
>> ## 11307 # egid
>> ## "Abcg1" # name
>> ##
>>
>> org.Mm.egCHRLOC[['11307']]
>>
>> ## 17
>> ## 31057694
>> ##
>>
>> org.Mm.egENSEMBL[['11307']]
>>
>> ## [1] "ENSMUSG00000024030"
>>
>> promotersByGene<- flank(txdb, 1500) # or however many bases you
want
>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little
extra
>> promoterProbes<- findOverlaps(probes, promotersByGene)
>> promoterProbes
>>
>> write.table
>> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",**
>> quote=FALSE,sep="\t")
>>
>> ## Hits of length 0
>> ## queryLength: 1
>> ## subjectLength: 21761
>> ##
>> ## read the GRanges and GenomicFeatures vignettes for more
>>
>> write.table
>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=**
>> FALSE,sep="\t")
>>
>>
>> *Thanks in advance*
>>
>>
>> JL
>>
>> 2012/11/8 Harris A. Jaffee<hj@jhu.edu>
>>
>> On the elementary end of all this...
>>>
>>> If the sites are on a *file*, one per line, you could do
>>>
>>> y<- read.delim(filename, sep=".", header=FALSE,
as.is=TRUE)
>>>
>>> etc.
>>>
>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>>
>>> Hi Jose,
>>>>
>>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>>
>>>>> Dear James,
>>>>>
>>>>> To answer your questions swiftly, the features are methylation
sites
>>>>>
>>>> (that's why I only have a coordinate instead of having a
Start/End
>>> pair) in
>>> mouse (mm9) genome and I have a list of those in the following
format:
>>>
>>>> chr17.31246506
>>>>>
>>>>> How could I read the list so that I can input the "chr" and the
>>>>>
>>>> "coordinate" parameters into the expression you recommended?
>>>
>>>> First you need to split your data to chr and pos.
>>>>
>>>> chr.pos<- do.call("rbind", strsplit(<your vector="" of="" chr17.pos="" data="">,
>>>>
>>> "
\."))
>>>
>>>> Note that your vector of chr.pos data may have been in as a
factor, so
>>>>
>>> you may need to wrap with an as.character(). If you don't know
what I
>>> mean
>>> by that, you should first read 'An Introduction to R' (
>>>
http://cran.r-project.org/doc/**manuals/R-intro.html<http: cran.r="" -project.org="" doc="" manuals="" r-intro.html="">).
>>> That will be time
>>> well spent.
>>>
>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>>
>>>> Both the seqnames and ranges argument to GRanges() can be based
on
>>>>
>>> vectors. So if you had a matrix (called 'y') like
>>>
>>>> chr16 3423543
>>>> chr3 403992
>>>> chr18 3494202
>>>>
>>>> then you can do
>>>>
>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>>
>>>> see ?GRanges for more info.
>>>>
>>>> As a side note, IIRC, methylation sites are not in general found
in
>>>>
>>> exons, but are more likely to be found upstream of a given gene.
You
>>> might
>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See
>>> ?exonsBy
>>> after loading the GenomicFeatures package.
>>>
>>>> I 'm lookin forward to obtain a table where my "coordinate-based
Id"
>>>>>
>>>> appears in a column, and the gene_name and if possible, the
>>> Entrez/Ensembl
>>> Ids in two other columns :
>>>
>>>> Coordinate Gene_name Entrez_ID Ensembl_ID
>>>>>
>>>>> Is it easy to do this in R?
>>>>>
>>>> Of course! Everything is easy to do in R. You should see my sweet
R
>>>>
>>> toast 'n coffee maker ;-D
>>>
>>>> But you should note that the fiveUTRByTranscript() function is
>>>>
>>> transcript based (obvs), and so you will have multiple transcripts
per
>>> gene. This makes things much more difficult, as a given
methylation site
>>> may overlap multiple transcripts of the same gene. So that makes
it hard
>>> to
>>> merge your original data with the overlapping transcripts.
>>>
>>>> You could do something like
>>>>
>>>> ex<- fiveUTRsByTranscript(TxDb.**Mmusculus.UCSC.mm10.knownGene,
>>>> use.names
>>>>
>>> = TRUE)
>>>
>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x],
elementMetadata))$exon_id
>>>>
>>>> then you could use unique Gene IDs thusly
>>>>
>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>>>>
>>> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID")
>>>
>>>> That should at least give you a start. See where you can go on
your own,
>>>>
>>> and let us know if you get stuck.
>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>> I'm still really new to R and I lack many concepts you may
consider
>>>>>
>>>> basic... I'm awfully sorry
>>>
>>>> Best
>>>>>
>>>>> JL
>>>>>
>>>>> 2012/11/8 James W.
MacDonald<jmacdon@uw.edu<**mailto:jmacdon@uw.edu>>
>>>>>
>>>>> Hi Jose,
>>>>>
>>>>>
>>>>> On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>>
>>>>> Dear Bioconductor list,
>>>>>
>>>>> I write you this email asking for a Bioconductor module
that
>>>>> allows me to
>>>>> annotate genomic coordinates and get different GeneIds.
>>>>> I'll show you an example of what I'm referring to:
>>>>>
>>>>> I have this data:
>>>>> Chromosome coordinate
>>>>> chr17 31246506
>>>>>
>>>>>
>>>>> It depends on what that coordinate is. Is it the start of a
>>>>> transcript? A SNP? Do you really just have a single
coordinate, or
>>>>> do you have a range? What species are we talking about here?
>>>>>
>>>>> Depending on what your data are, you might want to use
either one
>>>>> of the TxDb packages, or a SNPlocs package. There really
isn't
>>>>> much to go on here. If I assume this is a coordinate that
one
>>>>> might think is within an exon, and if I further assume you
are
>>>>> working with H. sapiens, I could suggest something like
>>>>>
>>>>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>>>>>
>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width =
1))
>>>>>
>>>>> ex[ex %in% x]
>>>>>
>>>>> or maybe more appropriately
>>>>>
>>>>> names(ex)[ex %in% x]
>>>>>
>>>>> which will give you the Gene ID, and you can go from there
using
>>>>> the org.Hs.eg.db package.
>>>>>
>>>>> If however, your coordinate isn't in an exon, but might be
in a
>>>>> UTR, you can look at ?exonsBy to see what other sequences
you can
>>>>> extract to compare with.
>>>>>
>>>>> If these are SNPs, then you can look at the help pages for
the
>>>>> relevant SNPlocs package.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>> which can also be written this way by the program that
yielded
>>>>> the result:
>>>>> chr17.31246506
>>>>>
>>>>> And I need to convert this data into a gene name and
known
>>>>> gene Ids, such
>>>>> as:
>>>>>
>>>>> Gene name Entrez_ID Ensembl_ID
>>>>>
>>>>> Tff3 NM_011575 20050
>>>>> Can you please advice me about a module able to perform
this
>>>>> ID conversion
>>>>> using a list of "chr17.31246506" type coordinates as
input?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> With best wishes
>>>>>
>>>>>
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>>
Bioconductor@r-project.org<**mailto:Bioconductor@r-project.**
>>>>> org <bioconductor@r-project.org>>
>>>>>
https://stat.ethz.ch/mailman/**listinfo/bioconductor<htt ps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>>> Search the archives:
>>>>>
>>>>>
http://news.gmane.org/gmane.**science.biology.informatics.**con
ductor<http: news.gmane.org="" gmane.science.biology.informatics.conduct="" or="">
>>>
>>>>
>>>>> -- James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Dr. José Luis Lavín Trueba
>>>>>
>>>>> Dpto. de Producción Agraria
>>>>> Grupo de Genética y Microbiología
>>>>> Universidad Pública de Navarra
>>>>> 31006 Pamplona
>>>>> Navarra
>>>>> SPAIN
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor@r-project.org
>>>>
https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>> Search the archives:
>>>>
>>>
http://news.gmane.org/gmane.**science.biology.informatics.**conduc
tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>>
>>>
>>>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>>
https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives:
http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor="">
>>
>
>
--
--
Dr. José Luis Lavín Trueba
Dpto. de Producción Agraria
Grupo de Genética y Microbiología
Universidad Pública de Navarra
31006 Pamplona
Navarra
SPAIN
--
--
Dr. José Luis Lavín Trueba
Dpto. de Producción Agraria
Grupo de Genética y Microbiología
Universidad Pública de Navarra
31006 Pamplona
Navarra
SPAIN
[[alternative HTML version deleted]]