Question

How to use gene_flank on getBM()

0

Entering edit mode

kris.petrini • 0

@krispetrini-7786

Last seen 4.5 years ago

Italy

Hi,

I have to use getBM to retrive a flanking region with biomart, but I don't find any guide to set the parameters of the attribute "gene_flank" anyone can help me?

This is my script:

library("biomaRt")
agilent_ID<-scan("/home/cristiano/kris/Scrivania/Temporanei_tirocinio/R/Agilent ID spot.csv", ,what="character",skip=1,quiet=TRUE)[1]
oldmouse<-useMart(biomart="ENSEMBL_MART_ENSEMBL", host="may2012.archive.ensembl.org", path="/biomart/martservice",dataset ="mmusculus_gene_ensembl")
annot <- c("ensembl_gene_id", "external_gene_id", "start_position", "strand", "gene_flank", "5utr", "3utr", "3_utr_start", "5_utr_start", "description", "gene_exon_intron" , "gene_exon")
filt <- "efg_agilent_wholegenome_4x44k_v1"
tutto <- getBM(annot, filt, agilent_ID, oldmouse)
write(tutto, "tutto.fa")

getBm gene_flank upstream_flank • 1.8k views

ADD COMMENT • link 9.5 years ago kris.petrini • 0

score 0 · Answer 1 · 2015-05-29

Hi Kris,

Not sure what's going on but here is another way to get your flanking sequences. It uses getBM() in a minimalistic way, only to retrieve the mapping between gene IDs and Agilent IDs. Then it relies on the TxDb and BSgenome infrastructure to extract the flanking sequences.

We proceed in 5 steps:

1. Use getBM() to retrieve the mapping between genes and Agilent IDs, and use this mapping to map the Agilent IDs you have in agilent_ID:

gene2agilentid <- getBM(c("ensembl_gene_id", "efg_agilent_wholegenome_4x44k_v1"),
                        mart=oldmouse)

Let's say your Agilent IDs are:

agilent_ID <- c("A_52_P368690", "A_52_P108579")

Map them to the corresponding gene IDs with:

gene_ID <- gene2agilentid[match(agilent_ID, gene2agilentid[[2]]), 1]
gene_ID
#  [1] "ENSMUSG00000048534" "ENSMUSG00000074155"

2. Prepare the TxDb object (it will contain the genomic locations of all transcripts, exons, cds, and genes):

library(GenomicFeatures)
txdb <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl",
                            host="may2012.archive.ensembl.org")

Note that you can save the TxDb object with saveDb() for later use or to share with collaborators (load it again with loadDb()).

3. Extract the locations of the flanking regions from the TxDb object:

Extract the gene locations:

mm9_genes <- genes(txdb)

Get rid of the genes located on the scaffolds (we won't be able to make any use of them because the BSgenome object for mm9 doesn't contain the scaffold sequences):

mm9_genes <- keepStandardChromosomes(mm9_genes)

Extract the locations of the (upstream) flanking regions:

mm9_flanks <- flank(mm9_genes, 1000)  # 1000 upstream nucleotides
## or, if you want the downstream flanking regions:
#mm9_flanks <- flank(mm9_genes, 1000, start=FALSE)

4. Extract the flanking sequences from the corresponding BSgenome object:

txdb is based on the NCBIM37 assembly so we will use the BSgenome.Mmusculus.UCSC.mm9 package which contains the chromosome sequences for this assembly.

library(BSgenome.Mmusculus.UCSC.mm9)
genome <- BSgenome.Mmusculus.UCSC.mm9

Before we can use getSeq() to extract the flanking sequences, we need to make sure that genome and mm9_flanks use the same chromosome naming convention. Right now they don't (genome uses UCSC convention, and m9_flanks uses NCBI convention), but we can fix this with:

seqlevelsStyle(mm9_flanks) <- seqlevelsStyle(genome)

Now we can extract the sequences:

mm9_flank_seqs <- getSeq(BSgenome.Mmusculus.UCSC.mm9, mm9_flanks)

5. Subset with the gene_ID vector to get the flanking sequences corresponding to your Agilent IDs:

mm9_flank_seqs[gene_ID]
#   A DNAStringSet instance of length 2
#     width seq                                               names
# [1]  1000 GTTGTTGTTTTGTTGTTGTTTAG...AGCCACCACAGAAGGCATGCACA ENSMUSG00000048534
# [2]  1000 CCAATATTCAGAGACATAAACAT...TCCCAGAACCACCTAGGCCTCTC ENSMUSG00000074155

The final result is a DNAStringSet object parallel to your original vector of Agilent IDs (agilent_ID), that is, with 1 DNA sequence per ID in agilent_ID.

Hope this helps,
H.

score 0 · Answer 2 · 2015-06-01

Thanks for the answer, I've tried to do what you suggested me, but I receive this error:

Carico il pacchetto richiesto: IRanges
Error in unloadNamespace(package) : 
  namespace ‘IRanges’ is imported by ‘GenomeInfoDb’ so cannot be unloaded
Error in library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc) : 
  Package ‘IRanges’ version 2.2.1 cannot be unloaded

do you know what is the problem?

PS with this procedure can I retrieve also 5utr, 3utr, exons, and introns?