MakeTxDbFromGFF creates empty object
1
0
Entering edit mode
@rebeccahalbach-9222
Last seen 7.2 years ago
Netherlands

Hi everyone,

I would then like to create a TxDb object and from that a annoGR object to use this with annotationPeakInBatch from ChipPeakAnno but I fail to do so with the repeat feature data set. The created object is just empty. However, with the base feature data set (and same bit of code) it works quite well.

Does anyone has any suggestions for me how to solve that?

Thanks a ton,

Rebecca

>Aaegypti_TxDb_rep=makeTxDbFromGFF("Aedes-aegypti-Liverpool_REPEATFEATURES_AaegL3.gff3",
format="auto", dataSource = "EnsembleMetazoa",
organism = "Aedes aegypti")

Import
genomic features from the file as a GRanges object ... OK

Prepare
the 'metadata' data frame ... OK

Make
the TxDb object ... OK

Warnmeldung:

In
.local(con, format, text, ...) : gff-version directive indicates version is   3, not 3

>Aaegypti_TxDb_rep
TxDb
object:
#Db type: TxDb
#Supporting package: GenomicFeatures
#Data source: EnsembleMetazoa
#Organism: Aedes aegypti
#Taxonomy ID: 7159
#miRBase build ID: NA
#Genome: NA
#transcript_nrow: 0
#exon_nrow: 0
#cds_nrow: 0
#Db created by: GenomicFeatures package from Bioconductor
#Creation time: 2015-11-19 13:57:31 +0100 (Thu, 19 Nov 2015)
#GenomicFeatures version at creation time: 1.22.4
#RSQLite version at creation time: 1.0.0
#DBSCHEMAVERSION: 1.1

> keytypes(Aaegypti_TxDb_rep)
[1]
"CDSID"    "CDSNAME"  "EXONID"
"EXONNAME" "GENEID"   "TXID"
"TXNAME"

> columns(Aaegypti_TxDb_rep)
[1]
"CDSCHROM"   "CDSEND"     "CDSID"
"CDSNAME"    "CDSSTART"   "CDSSTRAND"
"EXONCHROM"  "EXONEND"    "EXONID"
"EXONNAME"

[11]
"EXONRANK"   "EXONSTART"  "EXONSTRAND"
"GENEID"     "TXCHROM"    "TXEND"
"TXID"       "TXNAME"     "TXSTART"
"TXSTRAND"

[21]
"TXTYPE"

>sessionInfo()

R
version 3.2.2 (2015-08-14)

Platform:
x86_64-pc-linux-gnu (64-bit)

Running
under: Ubuntu 14.04.3 LTS

locale:

[1]
LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
LC_MONETARY=de_DE.UTF-8

[6]
LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C

[11]
LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached
base packages:

[1]
stats     graphics  grDevices utils     datasets  methods   base

other
attached packages:

[1]
BiocInstaller_1.20.0

via a namespace (and not attached):

[1]
tools_3.2.2

>source("http://bioconductor.org/biocLite.R")

Bioconductor
version 3.2 (BiocInstaller 1.20.0), ?biocLite for help
maketxdbfromgff • 1.5k views
0
Entering edit mode
@herve-pages-1542
Last seen 6 hours ago
Seattle, WA, United States

Hi Rebecca,

There are 2 GFF3 files available at the URL you provided:

1. Aedes-aegypti-Liverpool_BASEFEATURES_AaegL3.3.gff3.gz: Liverpool strain AaegL3.3 geneset in GFF3 format.
2. Aedes-aegypti-Liverpool_REPEATFEATURES_AaegL3.gff3.gz: Liverpool strain AaegL3 repeat features (RepeatMasker, Dust, TRF) in GFF3 format.

You picked up the 2nd one, which doesn't seem to contain gene or transcript information. You'll probably have better luck with the 1st one.

Cheers,

H.

0
Entering edit mode

I actually tried to call makeTxDbFromGFF() on Aedes-aegypti-Liverpool_BASEFEATURES_AaegL3.3.gff3.gz and most transcripts in the file got imported in the TxDb object except some unusual types of transcripts (e.g. SRP_RNA, RNase_P_RNA, RNase_MRP_RNA, and others). As a result, makeTxDbFromGFF() displayed a warning message saying that some exons were orphaned and dropped.

I just made a change to makeTxDbFromGFF() to address this and now all transcripts and exons in the file are imported (18840 transcripts and 72345 exons). The change is in GenomicFeatures 1.22.5 (BioC release) and 1.23.9 (BioC devel). Both packages should become available via biocLite() on Saturday.

Cheers,

H.

0
Entering edit mode

Hi Herve,

thanks for your answer. Yes, it is working quite well with the base feature data set, but not with the other one. Is there a way to also make it an TxDb object? I have a peak list and I would just like t see which of my peaks are close or in those repeat features (with using AnnotatePeakInBatch). The repeat feature data set from Vectorbase contains all information for that (supercont, startsite, end, featurename), just like the genes in the base feature data set, so it is not quite clear to me why it should not work.

Thanks a ton,

Rebecca

0
Entering edit mode

A TxDb object is a container for storing transcript, exon, cds, and gene information. The Aedes-aegypti-Liverpool_REPEATFEATURES_AaegL3.gff3.gz file contains no gene or transcript information, only information about repeat regions. This is why makeTxDbFromGFF() produces an empty TxDb on that file.

Try instead to import the file as a GRanges object (with import() from the rtracklayer package) and you'll get all the repeat regions in the object. I don't know if AnnotatePeakInBatch can take a GRanges object as input though, that's something you would need to check. If it doesn't, or if you can't find this information in AnnotatePeakInBatch's documentation, you can always ask the AnnotatePeakInBatch's authors to help (please ask a new question if you do so and tag it with AnnotatePeakInBatch).

Cheers,

H.