Dear Valerie (and other of the Bioconductor team),
Following toy example used to work with Bioconductor 3.0 but does not work with the current version (3.1):
library(VariantAnnotation)
## simulated data with two genes on a single chromosome
# GFF3 information for two genes with two exons each, one on the + and one on the - strand
gene.name <- paste0(rep("gene", 6), rep(LETTERS[1:2], each=3))
exon.name <- c("", "exon1", "exon2", "", "exon2", "exon1")
exon.rank <- as.integer(c(0, 1, 2, 0, 2, 1))
sim.gr <- GRanges(seqnames=Rle("chr1", 6),
ranges=IRanges(rep(c(1, 1, 11), 2), end=rep(c(15, 5, 15), 2),
names=paste0(gene.name, exon.name)),
strand=Rle(c("+", "-"), c(3, 3)), score=rep(0, 6),
phase=rep(0, 6), type=rep(c("mRNA", "CDS", "CDS"), 2),
Parent=gene.name, exon_rank=exon.rank)
CDS.flag <- sim.gr$type == "CDS"
sim.grl <- split(sim.gr[CDS.flag], substr(names(sim.gr[CDS.flag]), 1, 5))
# simulated genome with single chromosome of length 20
sim.genome.ss <- DNAStringSet(c("AAAAACCCCCGGGGGTTTTT"))
names(sim.genome.ss) <- c("chr1")
writeXStringSet(sim.genome.ss, "Sim.Genome.fasta")
sim.genome.fa <- open(FaFile("Sim.Genome.fasta"))
# transcriptome from genome, exon_rank column is necessary for correct order
# of exons on negative strand
sim.trans.ss <- getTranscriptSeqs(sim.grl, sim.genome.fa)
Output from sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.14.10 Rsamtools_1.20.4 Biostrings_2.36.3 XVector_0.8.0 GenomicFeatures_1.20.1 [6] AnnotationDbi_1.30.1 Biobase_2.28.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1 IRanges_2.2.7 [11] S4Vectors_0.6.3 BiocGenerics_0.14.0 BiocInstaller_1.18.4 loaded via a namespace (and not attached): [1] zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.20 BSgenome_1.36.3 tools_3.2.1 [6] DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1 rtracklayer_1.28.7 futile.options_1.0.0 [11] bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.24.0 RSQLite_1.0.0 XML_3.98-1.3
I get exactly the same error on my Mac running OS X Mavericks. Any suggestions as to what I am doing wrong would be nice.
Tjeerd
One more thing. When I run your code with 1.14.11 I get this error from GenomicFeatures::extractTranscriptSeqs():
> sim.trans.ss <- getTranscriptSeqs(sim.grl, sim.genome.fa)
Error in .check_exon_rank(tx1) :
"exon_rank" inner metadata column in GRangesList object 'transcripts'
does not contain increasing consecutive ranks for some transcripts
In addition: Warning message:
'getTranscriptSeqs' is deprecated.
Use 'extractTransciptSeqs' instead.
See help("Deprecated")
extractTranscriptSeqs() handles exons on the negative strand - there's no need to re-order them such that the rank is decreasing for the negative strand.
Valerie
Hi Valerie,
Thank you for your quick answer. Code now certainly runs by using GenomicFeatures::extractTranscriptSeqs() and getting rid of the exon_rank feature. However, extractTranscriptSeqs() gives a different output from gffread (from the Cufflinks package) for negative strand genes. Specifically, it seems to take the reverse complement of each exon but orders the exons same as on the plus strand. Here is the same toy example, slightly elaborated:
## simulated data with two genes on a single chromosome
library(GenomicFeatures); library(Rsamtools); library(rtracklayer)
# GFF3 information for two genes with two exons each, one on the + and one on the - strand
gene.name <- paste0(rep("gene", 6), rep(LETTERS[1:2], each=3))
exon.name <- c("", "exon1", "exon2", "", "exon2", "exon1")
sim.gr <- GRanges(seqnames=Rle("chr1", 6),
ranges=IRanges(rep(c(1, 1, 11), 2), end=rep(c(15, 5, 15), 2),
names=paste0(gene.name, exon.name)),
strand=Rle(c("+", "-"), c(3, 3)), score=rep(0, 6),
phase=rep(0, 6), type=rep(c("mRNA", "CDS", "CDS"), 2),
Parent=gene.name)
# split CDS part on gene.name to make a list of two genes with two exons each
CDS.flag <- sim.gr$type == "CDS"
sim.grl <- split(sim.gr[CDS.flag], substr(names(sim.gr[CDS.flag]), 1, 5))
# simulated genome with single chromosome of length 20
sim.genome.ss <- DNAStringSet(c("TAAAACGCCCGGCGGTTTAT"))
names(sim.genome.ss) <- c("chr1")
writeXStringSet(sim.genome.ss, "Sim.Genome.fasta")
sim.genome.fa <- open(FaFile("Sim.Genome.fasta"))
# transcriptome from genome
sim.trans.ss <- extractTranscriptSeqs(sim.genome.fa, sim.grl)
writeXStringSet(sim.trans.ss, "sim.trans.biocond.fasta")
# Parent column is necessary for correct grouping of exons by gffread
export.gff3(sim.gr, "Sim.gff3")
# generate transcriptome FASTA from genome FASTA and GFF/GTF file with
# /Applications/BioInformatics/cufflinks-2.2.1/gffread \
# -w sim.trans.gffread.fasta -g Sim.Genome.fasta Sim.gff3
with file sim.trans.bicond.fasta reading:
>geneA
TAAAAGGCGG
>geneB
TTTTACCGCC
with file sim.trans.gffread.fasta reading:
>geneA CDS=1-10
TAAAAGGCGG
>geneB CDS=1-10
CCGCCTTTTA
The problem is with the order of the exons on the "-" strand.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exons <- exonsBy(txdb, by="tx")
The GenomicFeatures extractors always return exons (or any feature) in the order of transcription. For the "-" strand this means the first exon (ie, rank 1) is the most 3' wrt the "+" strand and the last is the most 5' wrt the "+" strand.
> exons[4074]
GRangesList object of length 1:
$4074
GRanges object with 4 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] chr1 [16607, 16765] - | 13970 <NA> 1
[2] chr1 [15796, 15942] - | 13968 <NA> 2
[3] chr1 [14970, 15038] - | 13966 <NA> 3
[4] chr1 [14362, 14829] - | 13964 <NA> 4
Changing the order of the "-" strand exons in sim.grl gives the same answer you see with Cufflinks.
> sim.grl2 = GRangesList(sim.grl[[1]], rev(sim.grl[[2]]))
> sim.trans.ss <- extractTranscriptSeqs(sim.genome.fa, sim.grl2)
> sim.trans.ss
A DNAStringSet instance of length 2
width seq
[1] 10 TAAAAGGCGG
[2] 10 CCGCCTTTTA
Valerie
Hi Valerie,
This works indeed, thank you. It is actually in the examples of the manual for extractTranscriptSeqs(), in the TOY EXAMPLE as
However, the fact that tx2 needs to be reversed is not commented upon, could be stressed more IMHO. More generally, it would be nice to add an example based on GFF annotation files. For people working outside of human and standard model organisms (mouse, drosophila, yeast etc) GFF file are often the standard annotation files. It would be nice if there were some more relevant examples in the Bioconductor documentation.
Take care from Tjeerd
Hi Tjeerd,
Thanks for your feedback. I'll add something to the man page for
extractTranscriptSeqs()
about this.Note that when working with GFF/GTF files, the recommended path is to first load the annotations into a TxDb object with
makeTxDbFromGFF()
, and then to extract the exons grouped by transcript withexonsBy()
.exonsBy()
takes care of returning the exons sorted by ascending rank for each transcript (that is, ordered from 5' to 3' in transcript space), whatever the strand is, so the user doesn't need to worry about getting the exons in the right order before callingextractTranscriptSeqs()
.Hope this helps,
H,
Done in GenomicFeatures 1.20.3 (release) and 1.21.17 (devel).
Cheers,
H.
Hi Herve,
This certainly helped, thank you very much. One more qualm: my toy example is intentionally structured as a GeneDB GFF3 file (I use Pknowlesi.noseq.gff3) . These files have CDS feature types but *no* exon features. Such a GFF3 file leads to the following error in exonsBy (or cdsBy):
Error in .make_splicings(exons, cds, stop_codons) : some CDS cannot be mapped to an exon
When I change "CDS" to "exon" all is fine. Of course I can run a global find and replace to replace all occurrences of "CDS" with "exon" in the file but this seems rather inelegant (especially since the gffread utility works fine with the original file).
More generally, I actually read the GenomicFeatures vignette a while ago but gave up on this route when I saw that I needed to compile my genome into a BSGenome package. This is a misreading from my end but none of the worked-out examples in the vignette were relevant for my use case. While excellent, your documentation is a bit skewed to the human/model organism researchers who are probably in the majority. It would be nice to have an example in the manual where one starts with just a genome FASTA file and a GFF annotation file. I study a monkey malaria (Plasmodium knowlesi) and a monkey host (Macaca mulatta) and for both organisms information is distributed as genome FASTA with GFF3/GTF annotation file. I imagine this use case would apply to many more investigators.
Hi Tjeerd,
makeTxDbFromGFF()
supports some kind of GFF files where CDS are direct children of genes. In that case the exons and transcripts are inferred from the CDS. The GenomicFeatures package actually contains such a file (used in the unit tests):Your file doesn't seem to fall into this category though but I'd be happy to look at it and try to make
makeTxDbFromGFF()
work on it. Can you please provide the link to the Pknowlesi.noseq.gff3 file? Also it would help if you could show the exact commands you used that caused this error.Thanks,
H.
Hi Herve,
Here is some code that compares the transcriptome from bioconductor with the one from gffread. For Mmul this works fine but I am struggling with Pkno. An (indirect) link to the Pknowlesi.noseq.gff is from ftp.sanger.ac.uk/pub/project/pathogens/gff3/CURRENT
# source("TransFromGFF.R")
# extracting transcripts from a genome FASTA file using GFF annotation files
#
# Version 0.4, 23 August 2015 by Tjeerd Dijkstra
# Tested with R 3.2.2 and Bioconductor 3.1 under OS X 10.9.5 on an i7-2635QM@2GHz
library(GenomicFeatures)
## transcriptome from genome using TxDb for Macaca mulatta
# Mmul genome and annotation from www.unmc.edu/rhesusgenechip/index.htm#NewRhesusGenome
Mm.txdb <- makeTxDbFromGFF("~/Expres/RUMPHI/Mmul-Genome/Mmul_Annotation_v7.6.8.gtf")
# Mm.tx.grl <- cdsBy(Mm.txdb, by = "tx", use.names = TRUE)
Mm.tx.grl <- exonsBy(Mm.txdb, by = "tx", use.names = TRUE)
Mm.genome.fa <- open(FaFile("~/Expres/RUMPHI/Mmul-Genome/Mmul_Genome_v7.fasta"))
Mm.trans.biocond <- extractTranscriptSeqs(Mm.genome.fa, Mm.tx.grl)
writeXStringSet(Mm.trans.biocond, "Mm.trans.biocond.fasta")
# generate transcriptome FASTA from genome FASTA and GTF file with gffread
system2("/Applications/BioInformatics/cufflinks-2.2.1/gffread",
args = paste("-w Mm.trans.gffread.fasta",
"-g ~/Expres/RUMPHI/Mmul-Genome/Mmul_Genome_v7.fasta",
"~/Expres/RUMPHI/Mmul-Genome/Mmul_Annotation_v7.6.8.gtf"))
# compare bioconductor transcriptome with gffread one
Mm.trans.gffread <- readDNAStringSet("Mm.trans.gffread.fasta")
# gffread adds stuff to the name of transcripts which I cut off by some grep magic
match.index <- match(names(Mm.trans.biocond), gsub("(\\ ).*", "", names(Mm.trans.gffread)))
# names match except for CNTFR_transcript_01 that exonsBy warns about
cat(sprintf("Number of names of bioconductor and gffread transcriptome that do not match %d\n",
sumis.na(match.index))))
nchar.biocond <- nchar(Mm.trans.biocond); nchar.gffread <- nchar(Mm.trans.gffread[match.index])
# all lengths are the same ... great
length.differ.flag <- (nchar.biocond != nchar.gffread)
cat(sprintf("Number of lengths of bioconductor and gffread transcriptome that do not match %d\n",
sum(length.differ.flag)))
## transcriptome from genome using TxDb for Plasmodium knowlesi
# data downloaded by following the ftp link on www.genedb.org/Homepage/Pknowlesi
# Error in .make_splicings(exons, cds, stop_codons): some CDS cannot be mapped to an exon
Pk.txdb <- makeTxDbFromGFF("~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.noseq.gff3")
# # following attempt to filter out non-protein coding genes also fails
# Pk.gene.cds.gr <- Pk.gr[Pk.gr$type == "gene" | Pk.gr$type == "mRNA"| Pk.gr$type == "CDS"]
# Pk.txdb <- makeTxDbFromGRangesPk.gene.cds.gr)
Pk.tx.grl <- cdsBy(Pk.txdb, by = "tx", use.names = TRUE)
Pk.genome.fa <- open(FaFile("~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.genome.fasta"))
Pk.trans.biocond <- extractTranscriptSeqs(Pk.genome.fa, Pk.tx.grl)
writeXStringSet(Pk.trans.biocond, "Pk.trans.biocond.fasta")
# generate transcriptome FASTA from genome FASTA and GFF/GTF file with gffread
system2("/Applications/BioInformatics/cufflinks-2.2.1/gffread",
args = paste("-w Pk.trans.gffread.fasta",
"-g ~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.genome.fasta",
"~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.noseq.gff3"))
Pk.trans.gffread <- readDNAStringSet("Pk.trans.gffread.fasta")
# gffread adds stuff to the name of transcripts which I cut off by some grep magic
match.index <- match(names(Pk.trans.biocond), gsub("(\\ ).*", "", names(Pk.trans.gffread)))
# names match except for CNTFR_transcript_01 that exonsBy warns about
cat(sprintf("Number of names of bioconductor and gffread transcriptome that do not match %d\n",
sumis.na(match.index))))
nchar.biocond <- nchar(Pk.trans.biocond); nchar.gffread <- nchar(Pk.trans.gffread[match.index])
# all lengths are the same ... great
length.differ.flag <- (nchar.biocond != nchar.gffread)
cat(sprintf("Number of lengths of bioconductor and gffread transcriptome that do not match %d\n",
sum(length.differ.flag)))
Thank you once more for looking into this! Tjeerd
Hi Tjeerd,
Thanks for providing the file. I'll look at it and will get back here once I have a solution.
H.
Hi Herve,
Not in a hurry but it would be nice to have a solution for my struggles with the GeneDB GFF3 files.
Take care,
Tjeerd
I'm on it. Thanks for your patience.
H.
Hi Herve,
Did you manage to get to it?
Cheers from Tjeerd
Not yet sorry. Other urgent matters keep showing up. I won't be able to tackle this before next week. Thanks again for your patience.
H.
Hi Herve,
Now that BioConductor 3.2 is out, would you have time for my (little) issue?
Take care,
Tjeerd
Hi Tjeerd,
I finally had a chance to add support for the GFF files from GeneDB to
makeTxDbFromGFF()
. This is in GenomicFeatures 1.22.1, which will become available in BioC 3.2 (the current release) over the weekend.Sorry that it took so long and thanks for your patience. Please don't hesitate to let us know if you run into any further problem.
Thanks,
H.
Hi Herve,
The new makeTxDbFromGFF() does fine with my GeneDB GFF file and gives the same transcripts as the gffread utility. Thank you for your efforts, this makes manipulation of genomic features much easier for me.
Take care,
Tjeerd
This is good news. Thanks for the feedback! I really appreciate the fact that you took the time to compare with what gffread gives you.
Cheers,
H.