Question: How do I use biomaRt to get upstreamFlank Genomic Sequence for many Genomes?
Hello All, Problem: I would like to obtain the genomic sequence that is upstream (~500 bp) of a specific bacterial gene. I want to get this sequence for all bacteria genomes that have the gene. On EcoCyc I see that many (> 100) bacteria have the gene but I do not know how to get all of the sequence in a high-throughput manner so I was going to use biomaRt to get the sequence and send to alignment programs later. I have read through the vignette and tried to get the function to work with a non- ensembl MART to no avail. I also was presented with an error (see below) that suggested I report to the mailing list. It looks like I will also have to query each of the 249 bacterial genomes in the "bacterial_mart_7" Mart individually (with getLDS or getBM) which does not seem high-throughput at all... are there any other suggestions that will allow me to take advantage a the large amount of bacterial genomic data for homology studies? Thank you for your help. Noah Attempted Solution (for a single genome): > bacGenome = useMart("bacterial_mart_7", dataset = "esc_20_gene") Checking attributes ... ok Checking filters ... ok > > filters = c("external_gene_id") > > attributes = c("external_gene_id","upstream_flank") > > values = list(external_gene_id = c("fis"), 500) > seq = getBM(attributes=attributes, filters = filters, values = values, mart= bacGenome, + checkFilters= FALSE) V1 1 fis Error in getBM(attributes = attributes, filters = filters, values = values, : The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query. Please report this to the mailing list. > sessionInfo() R version 2.11.0 (2010-04-22) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rtracklayer_1.8.1 RCurl_1.3-1 bitops_1.0-4.1 biomaRt_2.4.0 loaded via a namespace (and not attached): [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.0 GenomicRanges_1.0.1 IRanges_1.6.0 [6] tools_2.11.0 XML_2.8-1
