Hi everyone,
I am new to TxDb objects and stuff like that, so please excuse my questions. I downloaded both base features and repeat feature data sets from Aedes aegypti from Vectorbase (https://www.vectorbase.org/downloads?field_organism_taxonomy_tid=372&field_download_file_type_tid=All&field_download_file_format_tid=474&field_status_value=Current).
I would then like to create a TxDb object and from that a annoGR object to use this with annotationPeakInBatch from ChipPeakAnno but I fail to do so with the repeat feature data set. The created object is just empty. However, with the base feature data set (and same bit of code) it works quite well.
Does anyone has any suggestions for me how to solve that?
Thanks a ton,
Rebecca
>Aaegypti_TxDb_rep=makeTxDbFromGFF("Aedes-aegypti-Liverpool_REPEATFEATURES_AaegL3.gff3", format="auto", dataSource = "EnsembleMetazoa", organism = "Aedes aegypti") Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warnmeldung: In .local(con, format, text, ...) : gff-version directive indicates version is 3, not 3 >Aaegypti_TxDb_rep TxDb object: #Db type: TxDb #Supporting package: GenomicFeatures #Data source: EnsembleMetazoa #Organism: Aedes aegypti #Taxonomy ID: 7159 #miRBase build ID: NA #Genome: NA #transcript_nrow: 0 #exon_nrow: 0 #cds_nrow: 0 #Db created by: GenomicFeatures package from Bioconductor #Creation time: 2015-11-19 13:57:31 +0100 (Thu, 19 Nov 2015) #GenomicFeatures version at creation time: 1.22.4 #RSQLite version at creation time: 1.0.0 #DBSCHEMAVERSION: 1.1 > keytypes(Aaegypti_TxDb_rep) [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME" > columns(Aaegypti_TxDb_rep) [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID" "EXONNAME" [11] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART" "TXSTRAND" [21] "TXTYPE" >sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 [6] LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.20.0 loaded via a namespace (and not attached): [1] tools_3.2.2 >source("http://bioconductor.org/biocLite.R") Bioconductor version 3.2 (BiocInstaller 1.20.0), ?biocLite for help
I actually tried to call
makeTxDbFromGFF()
onAedes-aegypti-Liverpool_BASEFEATURES_AaegL3.3.gff3.gz
and most transcripts in the file got imported in the TxDb object except some unusual types of transcripts (e.g. SRP_RNA, RNase_P_RNA, RNase_MRP_RNA, and others). As a result,makeTxDbFromGFF()
displayed a warning message saying that some exons were orphaned and dropped.I just made a change to
makeTxDbFromGFF()
to address this and now all transcripts and exons in the file are imported (18840 transcripts and 72345 exons). The change is in GenomicFeatures 1.22.5 (BioC release) and 1.23.9 (BioC devel). Both packages should become available viabiocLite()
on Saturday.Cheers,
H.
Hi Herve,
thanks for your answer. Yes, it is working quite well with the base feature data set, but not with the other one. Is there a way to also make it an TxDb object? I have a peak list and I would just like t see which of my peaks are close or in those repeat features (with using AnnotatePeakInBatch). The repeat feature data set from Vectorbase contains all information for that (supercont, startsite, end, featurename), just like the genes in the base feature data set, so it is not quite clear to me why it should not work.
Thanks a ton,
Rebecca
A TxDb object is a container for storing transcript, exon, cds, and gene information. The
Aedes-aegypti-Liverpool_REPEATFEATURES_AaegL3.gff3.gz
file contains no gene or transcript information, only information about repeat regions. This is whymakeTxDbFromGFF()
produces an empty TxDb on that file.Try instead to import the file as a GRanges object (with
import()
from the rtracklayer package) and you'll get all the repeat regions in the object. I don't know if AnnotatePeakInBatch can take a GRanges object as input though, that's something you would need to check. If it doesn't, or if you can't find this information in AnnotatePeakInBatch's documentation, you can always ask the AnnotatePeakInBatch's authors to help (please ask a new question if you do so and tag it with AnnotatePeakInBatch).Cheers,
H.