Using GenomicFeatures to extract the upstream and downstream region of a transcript
1
0
Entering edit mode
@courtneystairs-13043
Last seen 5.8 years ago

Hi there,

I'm trying to use GenomicFeatures to extract the sequences upstream and downstream of my transcripts. I have loaded my GTF file, however, I'm not sure how to proceed.

I'm trying to extract 150 bp upstream and downstream of each gene as well as the CDS itself. If this isn't possible, I can extract the upstream and downstream regions and stitch them with the CDS. I see there is a extractUpstreamSeqs() and promoters() option, but I'm not quite sure how to use this (I'm terrible at R).

I'm working on a non-model organism that does not have UTRs encoded in the GTF file or else I could likely use a simpler method.

Any help would be greatly appreciated :)

Courtney

genomicfeatures • 5.7k views
ADD COMMENT
1
Entering edit mode

You can try this, but probably won't work for cds.

library(RMariaDB)
library(GenomicFeatures)

TX<-makeTxDbFromUCSC("name.of.reference.genome.on.UCSC","ensGene")
TX
xscript<-transcripts(TX)      # GRanges for transcripts
genes<-genes(TX)      # GRanges for genes
ex<-exonsBy(TX,by="gene") # GRanges for each exon of each gene
cd<-cds(TX) #GRanges for each CDS

promoters(genes,upstream=150,downstream=150)->p1   # up/down 150bp from each gene

promoters(xscript,upstream=150,downstream=150)->p2 # up/down 150bp from each transcript
ADD REPLY
5
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Please be aware that some familiarity with R is highly recommended before you can efficiently use Bioconductor.

It's not clear to me whether you want to extract the upstream and downstream regions (i.e. their coordinates with respect to the reference genome) or the genomic sequences (DNA) corresponding to these regions. If you want the latter you'll need to have access to the genomic sequences of your organism. I'm assuming you want the latter. The genomic sequences are typically made available by the annotation provider as FASTA or 2-bit files. They need to match exactly the assembly that the annotations in your GTF file are based on.

First, use makeTxDbFromGFF() to import your GTF file as a TxDb object:

txdb <- makeTxDbFromGFF("path/to/GTF/file")

Note that makeTxDbFromGFF() only imports gene, transcript, exon, and CDS locations. Strictly speaking that's all what is needed to represent gene models. This means that even if your GTF file had entries for UTRs, makeTxDbFromGFF() would ignore them (UTRs can easily be inferred from exons and CDS, which is exactly what fiveUTRsByTranscript(txdb) and threeUTRsByTranscript(txdb) do for you).

Then to extract the 150 bp upstream sequences (with respect to the genes):

genes <- genes(txdb)
gene_upstream150_seqs <- extractUpstreamSeqs(x, genes, width=150)

where x is typically a BSgenome object containing the genomic sequences of your assembly, or a TwoBitFile or FaFile object (see ?extractUpstreamSeqs for more information). Making a BSgenome data package for your organism is a whole topic per se so I won't describe it here. See the "How to forge a BSgenome data package" vignette in the BSgenome software package if you are interested.

gene_upstream150_seqs will be a named DNAStringSet object parallel to genes i.e. it will contain one upstream sequence per gene in genes. Also the names on it will be the same as the names on genes. In addition it will carry some metadata of interest that you can get with mcols(gene_upstream150_seqs).

Unfortunately we don't have an extractDownstreamSeqs() function (was never requested, should be easy to add though), so a little bit more work is required in order to extract the 150 bp downstream sequences:

gene_downstream150 <- trim(suppressWarnings(flank(genes, width=150, start=FALSE)))
gene_downstream150_seqs <- getSeq(x, gene_downstream150)
mcols(gene_downstream150_seqs) <- mcols(gene_upstream150_seqs)

gene_downstream150_seqs will also be a named DNAStringSet object parallel to genes.

Finally to extract the CDS sequences:

cds <- cdsBy(txdb, by="tx")
cds_seqs <- extractTranscriptSeqs(x, cds)

cds_seqs will again be a named DNAStringSet object but this time it won't necessarily be parallel to genes. That's because a given gene can have zero or more than one CDS sequence. The names on cds_seqs will be the internal ids of the corresponding transcripts. The corresponding transcript names and gene ids can be put on the object as metadata columns with:

tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
mapping <- mcols(tx)
mapping$gene_id <- as.character(mapping$gene_id)
rownames(mapping) <- mapping$tx_id
mcols(cds_seqs) <- mapping[names(cds_seqs), ]

Hope this will help get you started.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Wonderful! Thank you Herve!! Thank you for your description of the methods, i really appreciate it. I will give this a go and come back if I have any questions :)

ADD REPLY
0
Entering edit mode

Can I import a file in fasta format for use in the x variable as for the GTF Hervé Pagès?

ADD REPLY
0
Entering edit mode

Sorry that I missed this.

The answer is in:

> showMethods(getSeq)
Function: getSeq (package Biostrings)
x="BSgenome"
x="FaFile"
x="FaFileList"
x="TwoBitFile"
x="XStringSet"

So you can't just do getSeq("path/to/my/fasta/file", gene_downstream150). You need to crate a FaFile object first. Please see ?FaFile in the Rsamtools package for the details.

It's generally better to start a new question rather than asking in the comment section of an old thread. This will make your question much less likely to be missed.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6