Hi,
This file contains only lines of type miRNA_primary_transcript
or miRNA
, and it has no Parent
attribute so it's not really suited for being stored in a TxDb object. Note that one of the primary purposes of a TxDb object is to describe the hierarchical relationship between genes, transcripts, exons, and CDS. AFAIK microRNAs don't have exons and are not linked to genes so they don't really fit the TxDb model. Maybe they don't really fit the DESeq2 (and summarizeOverlaps()
) models either because these tools typically count reads using a GRangesList object that represents the exons grouped by transcript or by gene. Of course you can always force your microRNAs into that model by considering that each miRNA is made of 1 exon:
library(rtracklayer)
gr <- import("dre.gff3")
library(GenomicFeatures)
transcripts <- data.frame(
tx_id=seq_along(gr),
tx_name=mcols(gr)$ID,
tx_type=mcols(gr)$type,
tx_chrom=as.factor(seqnames(gr)),
tx_start=start(gr),
tx_end=end(gr),
tx_strand=as.factor(strand(gr))
)
splicings <- data.frame(
tx_id=transcripts$tx_id,
exon_rank=1,
exon_start=transcripts$tx_start,
exon_end=transcripts$tx_end
)
txdb <- makeTxDb(transcripts, splicings, reassign.ids=TRUE)
Note that the above code is for the current devel version of BioC (3.1). If you use the current release (BioC 3.0), you need to use makeTranscriptDb()
instead of makeTxDb()
and the tx_type
column will be ignored.
Then you can extract the exons grouped by transcript in the usual way:
exonsBy(txdb, by="tx", use.names=TRUE)
# GRangesList object of length 888:
# $MI0002023
# GRanges object with 1 range and 3 metadata columns:
# seqnames ranges strand | exon_id exon_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] 1 [451141, 451218] + | 1 <NA> 1
#
# $MIMAT0001851
# GRanges object with 1 range and 3 metadata columns:
# seqnames ranges strand | exon_id exon_name exon_rank
# [1] 1 [451151, 451172] + | 2 <NA> 1
#
# $MI0002180
# GRanges object with 1 range and 3 metadata columns:
# seqnames ranges strand | exon_id exon_name exon_rank
# [1] 1 [1275348, 1275428] + | 3 <NA> 1
#
# ...
# <885 more elements>
# -------
# seqinfo: 28 sequences from an unspecified genome; no seqlengths
But note that this GRangesList object is not much different from the original GRanges object gr
.
I would suggest that you consult with the DESeq2 authors first to check that this is a valid approach. If so, and if having the microRNAs in a TxDb makes it easier to do DE analysis, then maybe we should modify makeTxDbFromGFF()
to support this kind of GFF3 file with only microRNAs.
Anyway for now I applied a patch to rtracklayer and GenomicFeatures in BioC release to make makeTranscriptDbFromGFF()
fail more gracefully (and with a more informative error message) on this kind of GFF3 file.
Cheers,
H.
Hi,
This error comes from
rtracklayer::import()
, whichmakeTranscriptDbFromGFF()
calls internally to load the GFF or GTF data into a GRanges object. I typically get this error on compressed GTF files when I forget to specifyformat="gtf"
because in that casertracklayer::import()
fails to automatically detect the format and tries to load the data as GFF. You can try to run that step manually first to confirm.Otherwise it would help if we could access to these files so we can reproduce the problem. Also please provide the output of
sessionInfo()
.Thanks,
H.