Question: GenomicFeatures and custom genomes?
gravatar for david.rinker
5 weeks ago by
david.rinker10 wrote:

Hi I want to use GenomicFeatures to extract upstream sequences from a genome using gene_IDs

My organism (Anopheles melas) has an assembled genome (fasta) and a gtf file.

It seems like these are all the files I should need, however I am not finding it easy to figure out on how to load these two data types for genomicFeatures to use? So far all I have figured out that I need to make a TxDb from gff; I have no idea what to do with the fasta file.

Could someone (BRIEFLY) tell me which packages and objects I need to use and how they all mesh together? 

I have read over the manual as well as the Rsamtools manual but am more confused than anything.

Thank you

ADD COMMENTlink modified 4 weeks ago by Hervé Pagès ♦♦ 13k • written 5 weeks ago by david.rinker10
gravatar for Martin Morgan
4 weeks ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

Depending on your OS, i'd suggest creating a '2bit' (rtracklayer, TwoBitFile) or indexed fasta (FaFile, Rsamtools) file.

For the gff, I think you can rtracklayer::import() it and get something lighter weight but still useful, or create the TxDb for greater flexibility.

The glue is Biostrings::getSeq(), which will take the 2bit or fasta file as first argument and genomic ranges as second returning the sequences corresponding to the ranges.

ADD COMMENTlink modified 29 days ago • written 4 weeks ago by Martin Morgan ♦♦ 20k
gravatar for Hervé Pagès
4 weeks ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:


Using extractUpstreamSeqs() from the GenomicFeatures package:

genome <- readDNAStringSet(fasta_file)

txdb <- makeTxDbFromGFF(gtf_file)
extractUpstreamSeqs(genome, txdb)

If you have a 2-bit file instead of FASTA, replace genome <- readDNAStringSet(fasta_file) with:

genome <- TwoBitFile(twobit_file)

The 2-bit format supports very efficient random access so this solution avoids loading the full genome sequences in memory. Note however that this won't work if the sequences contain IUAPAC ambiguity letters other than N (the 2-bit format does not support them).

More information and examples in ?extractUpstreamSeqs.



ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Hervé Pagès ♦♦ 13k

Just realizing now that extractUpstreamSeqs() doesn't take a DNAStringSet object, sorry. We should add this at some point. So for now you could either use:

genome <- FaFile(fasta_file)

as Martin suggested, or convert your FASTA file to 2-bit format with something like:

export(readDNAStringSet(fasta_file), "mygenome.2bit")

Note that using FaFile on a compressed FASTA file has been reported to be unreliable on Windows in the past.

Hope this helps,


ADD REPLYlink written 4 weeks ago by Hervé Pagès ♦♦ 13k
gravatar for david.rinker
4 weeks ago by
david.rinker10 wrote:

Ok, guess this question is a bust. 

If the developers are reading, it might be a useful addition to the manual and/or vignette to outline this process. More and more "non-model" organisms are being sequenced and analyzed--those users are inevitably going to want to do what I'm currently trying to do. 

If I can figure this out I'll post my summary here to hopefully enlighten others. 

ADD COMMENTlink written 4 weeks ago by david.rinker10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 178 users visited in the last hour