Getting the start and end positions of a list of genes
Dear listserv, I am a long-time R user, novice Bioconductor user. I am quickly realizing they are not the same thing. I have a very basic question that I hope you can help me with. I have a list of genes in Arabidopsis thaliana. I want to input this list into R/Bioconductor and output a table listing the start and end positions of each gene. Specific code that will get the job done will be the most helpful for me. Also, please tell me the specific packages and databases and such I must load into memory. I am a total newbie at this. Thanks in advance, ----------------------------------- Josh Banta, Ph.D Assistant Professor Department of Biology The University of Texas at Tyler Tyler, TX 75799 Tel: (903) 565-5655 http://plantevolutionaryecology.org -- output of sessionInfo(): > gene.pos <- data.frame(matrix(nrow = 3, ncol = 4)) > gene.list <- c("At5g35790", "AT5g60910", "AT1g16560") > gene.pos[,1] <- gene.list > colnames(gene.pos) <- c("gene", "chromosome", "nuc_sequence_start" , "nuc_sequence_end") > > gene.pos gene chromosome nuc_sequence_start nuc_sequence_end 1 At5g35790 NA NA NA 2 AT5g60910 NA NA NA 3 AT1g16560 NA NA NA > > #now what? How do I fill in the blanks? -- Sent via the guest posting facility at bioconductor.org.
good spec, but i can't get through the whole thing just now. this could get you started source("http://bioconductor.org/biocLite.R") biocLite("TxDb.Athaliana.BioMart.plantsmart12") library(TxDb.Athaliana.BioMart.plantsmart12) txdb = TxDb.Athaliana.BioMart.plantsmart12 tr = transcriptsBy(txdb, by="gene") > tr GRangesList of length 33602: $AT1G01010 GRanges with 1 range and 2 elementMetadata cols: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] 1 [3631, 5899] + | 9694 AT1G01010.1$AT1G01020 GRanges with 2 ranges and 2 elementMetadata cols: seqnames ranges strand | tx_id tx_name [1] 1 [5928, 8737] - | 29355 AT1G01020.1 [2] 1 [6790, 8737] - | 29354 AT1G01020.2 $AT1G01030 GRanges with 1 range and 2 elementMetadata cols: seqnames ranges strand | tx_id tx_name [1] 1 [11649, 13714] - | 26358 AT1G01030.1 ... <33599 more elements> --- seqlengths: 3 4 1 5 2 Pt Mt NA NA NA NA NA NA NA you could use an org.At* package a bit more simply, use the CHRLOC and CHRLOCEND elements. please look at the metadata page of bioconductor.org INSTALL node for your organism. this should be a standard use case or faq, perhaps On Sun, Jun 17, 2012 at 6:33 PM, Josh [guest] <guest@bioconductor.org>wrote: > > Dear listserv, > > I am a long-time R user, novice Bioconductor user. I'll get you a step further:

On 6/17/12 5:57 PM, "Vincent Carey" <stvjc at="" channing.harvard.edu=""> wrote:

# assuming that for each gene's coordinate, you want the extreme starts and ends of its (potentially multiple) transcripts:
gene.gr <- reduce(tr)  # ISA GenomicRange
gene.df<-asgene.gr,'data.frame')  # whose names are the gene identifiers

Now its a matter of coercing column names, and selecting from the BioMart data just the rows for your identifiers (and checking they are all there, and complaining if not).

Cheers,
Malcolm Cook 