makeTxDbFromGFF drops genes which have multiple chromosome locations. (with iGenome GTF)
Entering edit mode
Marlin ▴ 20
Last seen 3.9 years ago

For a GTF file downloaded from iGenome, say, hg19.gtf, a significant proportion of genes will be dropped without warning when use makeTxDbFromGFF to import it.

As I examined those genes in the source GTF file, I believe the genes with multiple chromosome locations (e.g. chr1 and chr2) will be dropped when using makeTxDbFromGFF.


For example, if you grep a gene name called "XGY2" in the iGenome GTF file, you get:

chrX    unknown    exon    2670337    2670376    .    +    .    gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254"; tss_id "TSS24556";
chrX    unknown    exon    2688591    2688632    .    +    .    gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254"; tss_id "TSS24556";
chrX    unknown    exon    2692757    2693037    .    +    .    gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254"; tss_id "TSS24556";
chrY    unknown    exon    2620337    2620376    .    +    .    gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254_1"; tss_id "TSS3943";
chrY    unknown    exon    2638591    2638632    .    +    .    gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254_1"; tss_id "TSS3943";
chrY    unknown    exon    2642757    2643037    .    +    .    gene_id "XGY2"; gene_name "XGY2"; transcript_id "NR_003254_1"; tss_id "TSS3943";

However, this gene will not be imported into the TxDb using makeTxDbFromGFF, it is just silently dropped.

I don't know if anyone has faced the problem before. It could be the case that TxDb is designed using unique gene id, but I would consider it as a severe bug as people are losing their annotations without awareness.

Please help to reproduce the problem, thanks.


genomicfeatures gtf maketxdbfromgff iGenome • 1.4k views
Entering edit mode
Last seen 7 days ago
United States

Marlin provided a nice reproducible example on the Bioc-devel mailing list. The problem is not that genes on multiple chromosomes are not imported, but rather that genes() by default reports only genes that have all exons on the same strand. In the current GenomicFeatures (1.26.0) the help page ?genes shows

     genes(x, columns="gene_id", filter=NULL, vals=NULL,
single.strand.genes.only: 'TRUE' or 'FALSE'. If 'TRUE' (the default),
          then genes that have exons located on both strands of the
          same chromosome or on two different chromosomes are dropped.
          In that case, the genes are returned in a GRanges object.
          Otherwise, all genes are returned in a GRangesList object
          with the columns specified thru the 'columns' argument set as
          _top level_ metadata columns. (Please keep in mind that the
          _top level_ metadata columns of a GRangesList object are not
          displayed by the 'show' method.)

I believe this is because the original contract of genes() was to return a GRanges() object, which is not consistent with exons on separate chromosomes. Other functions, e.g., exons(), did not face this constraint and report all exons.

A related recent change (in response to keepSeqlevels for GRangesList: drop elements with multiple seqnames ) adds pruning.mode to seqlevels<-, so that genes with exons on several seqlevels are not necessarily dropped; I believe this change is still being incorporated.



Entering edit mode

I'm facing the same problem and I found that the makeTxDbFromGFF() function already dropped the exons if they span multiple chromosomes and/or multiple strands.

I built a GTF file that contains only four exons from one gene, but from different strands of chr1. The makeTxDbFromGFF() function gave warning messages:

> s2=makeTxDbFromGFF("../del4-2str.GTF")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(type, phase) :
  The "phase" metadata column contains non-NA values for features of type
  exon. This information was ignored.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because no genomic ranges could
  be found for them and their ranges could not be inferred from their
  exons either (because they have them on both strands): TRN_100126488

and the returned object have no exons in it. I'm using GenomicFeatures_1.32.0


Login before adding your answer.

Traffic: 368 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6