Hi Kris,
Not sure what's going on but here is another way to get your flanking sequences. It uses getBM()
in a minimalistic way, only to retrieve the mapping between gene IDs and Agilent IDs. Then it relies on the TxDb and BSgenome infrastructure to extract the flanking sequences.
We proceed in 5 steps:
1. Use getBM()
to retrieve the mapping between genes and Agilent IDs, and use this mapping to map the Agilent IDs you have in agilent_ID
:
gene2agilentid <- getBM(c("ensembl_gene_id", "efg_agilent_wholegenome_4x44k_v1"),
mart=oldmouse)
Let's say your Agilent IDs are:
agilent_ID <- c("A_52_P368690", "A_52_P108579")
Map them to the corresponding gene IDs with:
gene_ID <- gene2agilentid[match(agilent_ID, gene2agilentid[[2]]), 1]
gene_ID
# [1] "ENSMUSG00000048534" "ENSMUSG00000074155"
2. Prepare the TxDb object (it will contain the genomic locations of all transcripts, exons, cds, and genes):
library(GenomicFeatures)
txdb <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl",
host="may2012.archive.ensembl.org")
Note that you can save the TxDb object with saveDb()
for later use or to share with collaborators (load it again with loadDb()
).
3. Extract the locations of the flanking regions from the TxDb object:
Extract the gene locations:
mm9_genes <- genes(txdb)
Get rid of the genes located on the scaffolds (we won't be able to make any use of them because the BSgenome object for mm9 doesn't contain the scaffold sequences):
mm9_genes <- keepStandardChromosomes(mm9_genes)
Extract the locations of the (upstream) flanking regions:
mm9_flanks <- flank(mm9_genes, 1000) # 1000 upstream nucleotides
## or, if you want the downstream flanking regions:
#mm9_flanks <- flank(mm9_genes, 1000, start=FALSE)
4. Extract the flanking sequences from the corresponding BSgenome object:
txdb
is based on the NCBIM37 assembly so we will use the BSgenome.Mmusculus.UCSC.mm9 package which contains the chromosome sequences for this assembly.
library(BSgenome.Mmusculus.UCSC.mm9)
genome <- BSgenome.Mmusculus.UCSC.mm9
Before we can use getSeq()
to extract the flanking sequences, we need to make sure that genome
and mm9_flanks
use the same chromosome naming convention. Right now they don't (genome
uses UCSC convention, and m9_flanks
uses NCBI convention), but we can fix this with:
seqlevelsStyle(mm9_flanks) <- seqlevelsStyle(genome)
Now we can extract the sequences:
mm9_flank_seqs <- getSeq(BSgenome.Mmusculus.UCSC.mm9, mm9_flanks)
5. Subset with the gene_ID
vector to get the flanking sequences corresponding to your Agilent IDs:
mm9_flank_seqs[gene_ID]
# A DNAStringSet instance of length 2
# width seq names
# [1] 1000 GTTGTTGTTTTGTTGTTGTTTAG...AGCCACCACAGAAGGCATGCACA ENSMUSG00000048534
# [2] 1000 CCAATATTCAGAGACATAAACAT...TCCCAGAACCACCTAGGCCTCTC ENSMUSG00000074155
The final result is a DNAStringSet object parallel to your original vector of Agilent IDs (agilent_ID
), that is, with 1 DNA sequence per ID in agilent_ID
.
Hope this helps,
H.