For a GTF file downloaded from iGenome, say, hg19.gtf, a significant proportion of genes will be dropped without warning when use makeTxDbFromGFF to import it.
As I examined those genes in the source GTF file, I believe the genes with multiple chromosome locations (e.g. chr1 and chr2) will be dropped when using makeTxDbFromGFF.
For example, if you grep a gene name called "XGY2" in the iGenome GTF file, you get:
chrX unknown exon 2670337 2670376 . + . gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254"; tss_id "TSS24556"; chrX unknown exon 2688591 2688632 . + . gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254"; tss_id "TSS24556"; chrX unknown exon 2692757 2693037 . + . gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254"; tss_id "TSS24556"; chrY unknown exon 2620337 2620376 . + . gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254_1"; tss_id "TSS3943"; chrY unknown exon 2638591 2638632 . + . gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254_1"; tss_id "TSS3943"; chrY unknown exon 2642757 2643037 . + . gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254_1"; tss_id "TSS3943";
However, this gene will not be imported into the TxDb using makeTxDbFromGFF, it is just silently dropped.
I don't know if anyone has faced the problem before. It could be the case that TxDb is designed using unique gene id, but I would consider it as a severe bug as people are losing their annotations without awareness.
Please help to reproduce the problem, thanks.
I'm facing the same problem and I found that the makeTxDbFromGFF() function already dropped the exons if they span multiple chromosomes and/or multiple strands.
I built a GTF file that contains only four exons from one gene, but from different strands of chr1. The makeTxDbFromGFF() function gave warning messages:
> s2=makeTxDbFromGFF("../del4-2str.GTF")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(type, phase) :
The "phase" metadata column contains non-NA values for features of type
exon. This information was ignored.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
The following transcripts were dropped because no genomic ranges could
be found for them and their ranges could not be inferred from their
exons either (because they have them on both strands): TRN_100126488
and the returned object have no exons in it. I'm using GenomicFeatures_1.32.0