Entering edit mode
Paul Shannon
▴
750
@paul-shannon-5161
Last seen 10.2 years ago
To summarize our off-list discussion, and present our final suggested
answer, using a sample entrez geneID to demonstrate, we offer the code
shown below.
A somewhat lengthy exposition of the 9 steps involved will be found
below, following the 18 lines of excutable code.
Thanks for posing this question, Eric, and for your patience while I
worked up a reply.
- Paul
library(Biostrings)
library(org.Hs.eg.db)
library (BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
trak2 <- "66008" # sample gene
transcripts.grL <- transcriptsBy(txdb, by="gene")[trak2]
transcript.names <- mcols(unlist(transcripts.grL))$tx_name #
"uc002uyb.4" "uc002uyc.2"
intronsByTx.grL <- intronsByTranscript(txdb,
use.names=TRUE)[transcript.names]
print(elementLengths(intronsByTx.grL))
intron.seqs <- getSeq(Hsapiens, unlist(intronsByTx.grL))
intron.seqs.by.transcript <- relist(intron.seqs,
skeleton=intronsByTx.grL)
print(elementLengths(intron.seqs.by.transcript))
# uc002uyb.4 uc002uyc.2
# 15 7
exonsByTx.grL <- exonsBy(txdb, by="tx",
use.names=TRUE)[transcript.names]
print(elementLengths(exonsByTx.grL))
exons.seq <- getSeq(Hsapiens, unlist(exonsByTx.grL))
exon.seqs.by.transcript <- relist(exons.seq, skeleton=exonsByTx.grL)
print(elementLengths(exon.seqs.by.transcript))
# uc002uyb.4 uc002uyc.2
# 16 8
---- The strategy
1) DNA sequence for the introns and exons of a gene, is best
understood as
"sequence for the introns and exons for each known transcript of a
gene"
2) So the first task is to identify all the transcripts associated
with the gene of interest
This will be represented as a GRangesList, where each element in the
list is
a) named by the geneID
b) contains a GRanges of the coordinates, name and id of each
associated transcript
All of this information comes from a Bioc "TxDB" object, which in this
case is our representation
of the UCSC "knownGene" table.
3) Once you have such a GRangesList, for one or more geneIDs, you want
to ignore (for now)
that gene information, and just extract the names of all of the
transcripts. To do
this extraction, you must flatted (or "unlist") the GRangesList.
4) Now you have a simple character vector containing the names of all
the
transcripts associated with your gene of interest.
5) Use those names to obtain, as a GRangesList structured by
transcript name, the
coordinates of all introns, then of all exons. Don't be confused or
put off
that we provide similar methods with slighly dissimilar names:
intronsByTranscript()
exonsBy()
6) You are almost ready to call getSeq () with the coordinates
returned at step 5.
But getSeq will not accept a GRangesList. It does accept a GRanges.
So you
must unlist the intron and exon coordinates, shedding the "by
transcript" structure.
7) getSeq returns a DNAStringSet corresponding to the GRanges
(actually, the unlisted GRangesList)
you supplied.
8) This DNAstringSet now needs to be relisted, so that the by-
transcript organization is
imposed, producing a DNAStringSetList whose elements each have a
transcript name.
9) All of these intron and exon sequences can associated with the
geneID/s you started with
by referring to the "transcripts.grL" you created at the very start.
On Jun 19, 2013, at 5:33 PM, Eric Foss [guest] wrote:
>
> Is there a way for me to use Bioconductor to take a list of gene
names and give me back a list of genomic sequences, preferably with
the exons and introns easily differentiable (e.g. exons in upper case
and introns in lower case)?
>
> Thank you.
>
> Eric
>
>
> -- output of sessionInfo():
>
> not relevant
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor