Search
Question: GenomicFeatures and custom genomes?
1
9 months ago by
david.rinker10
david.rinker10 wrote:

Hi I want to use GenomicFeatures to extract upstream sequences from a genome using gene_IDs

My organism (Anopheles melas) has an assembled genome (fasta) and a gtf file.

It seems like these are all the files I should need, however I am not finding it easy to figure out on how to load these two data types for genomicFeatures to use? So far all I have figured out that I need to make a TxDb from gff; I have no idea what to do with the fasta file.

Could someone (BRIEFLY) tell me which packages and objects I need to use and how they all mesh together?

I have read over the manual as well as the Rsamtools manual but am more confused than anything.

Thank you

modified 9 months ago by Hervé Pagès ♦♦ 13k • written 9 months ago by david.rinker10
2
9 months ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

Depending on your OS, i'd suggest creating a '2bit' (rtracklayer, TwoBitFile) or indexed fasta (FaFile, Rsamtools) file.

For the gff, I think you can rtracklayer::import() it and get something lighter weight but still useful, or create the TxDb for greater flexibility.

The glue is Biostrings::getSeq(), which will take the 2bit or fasta file as first argument and genomic ranges as second returning the sequences corresponding to the ranges.

1
9 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Using extractUpstreamSeqs() from the GenomicFeatures package:

library(Biostrings)

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(gtf_file)
extractUpstreamSeqs(genome, txdb)

If you have a 2-bit file instead of FASTA, replace genome <- readDNAStringSet(fasta_file) with:

library(rtracklayer)
genome <- TwoBitFile(twobit_file)

The 2-bit format supports very efficient random access so this solution avoids loading the full genome sequences in memory. Note however that this won't work if the sequences contain IUAPAC ambiguity letters other than N (the 2-bit format does not support them).

More information and examples in ?extractUpstreamSeqs.

Cheers,

H.

1

Just realizing now that extractUpstreamSeqs() doesn't take a DNAStringSet object, sorry. We should add this at some point. So for now you could either use:

library(Rsamtools)
genome <- FaFile(fasta_file)


as Martin suggested, or convert your FASTA file to 2-bit format with something like:

export(readDNAStringSet(fasta_file), "mygenome.2bit")

Note that using FaFile on a compressed FASTA file has been reported to be unreliable on Windows in the past.

Hope this helps,

H.

0
9 months ago by
david.rinker10
david.rinker10 wrote:

Ok, guess this question is a bust.

If the developers are reading, it might be a useful addition to the manual and/or vignette to outline this process. More and more "non-model" organisms are being sequenced and analyzed--those users are inevitably going to want to do what I'm currently trying to do.

If I can figure this out I'll post my summary here to hopefully enlighten others.