A warning : stop codons that cannot be mapped to an exon in using makeTxDbFromGFF
0
0
Entering edit mode
@shangguandong1996-21805
Last seen 2.2 years ago
China

Hello A warnging appears when I make a Txdb using GenomicFeaures

I am sorry I can not find the gtf download link for Araport11

Welcome to Araport.org. Funding of project has been discontinued so teams from The Arabidopsis Information Resource (TAIR), the National Center for Genome Resources (NCGR), and the Bio-Analytic Resource for Plant Biology (BAR) have taken over its operation and have refreshed/expanded the functionalities that were available at Araport as in the table below.

But the GFF download link is TAIR_download

library(GenomicFeatures)
Tx_At <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf",
                         chrominfo = Seqinfo(seqnames = paste0("Chr",c(1:5,"C","M")),
                                             seqlengths = c(30427671,19698289,23459830,18585056,26975502,154478,366924)))

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of
  type stop_codon, exon. This information was ignored.
2: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have stop
  codons that cannot be mapped to an exon: AT1G07320.4, AT1G18180.2,
  AT1G30230.1, AT1G36380.1, AT1G52415.1, AT1G52940.1, AT2G18110.1,
  AT2G35130.1, AT2G35130.3, AT2G39050.1, AT3G13445.1, AT3G17660.1,
  AT3G17660.3, AT3G52070.2, AT3G54680.1, AT3G59450.2, AT4G13730.2,
  AT4G17730.1, AT4G39620.2, AT5G01520.2, AT5G22794.1, AT5G27710.2,
  AT5G45240.2

And it can make the Txdb lose 6 genes compared with my featureCounts result

> genes(Tx_At)
GRanges object with 37330 ranges and 1 metadata column:
            seqnames        ranges strand |     gene_id
               <Rle>     <IRanges>  <Rle> | <character>
  AT1G01010     Chr1     3631-5899      + |   AT1G01010
  AT1G01020     Chr1     6788-9130      - |   AT1G01020
  AT1G01030     Chr1   11649-13714      - |   AT1G01030
  AT1G01040     Chr1   23121-31227      + |   AT1G01040
  AT1G01050     Chr1   31170-33171      - |   AT1G01050
        ...      ...           ...    ... .         ...
  ATMG09730     ChrM 181268-181347      + |   ATMG09730
  ATMG09740     ChrM 191954-192025      - |   ATMG09740
  ATMG09950     ChrM 333651-333725      - |   ATMG09950
  ATMG09960     ChrM 347266-347348      + |   ATMG09960
  ATMG09980     ChrM 359666-359739      - |   ATMG09980
  -------
  seqinfo: 7 sequences from an unspecified genome
# My featureCounts result
> head(data)
          WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G01010       15       15      152      106
AT1G01020      353      461      255      279
AT1G03987        1        0        3        0
AT1G01030       23       39       74       56
AT1G01040      695      746     1831     1611
AT1G03993        0        0        0        0

> str(data)
'data.frame':   37336 obs. of  4 variables:
 $ WT_0h_R1: int  15 353 1 23 695 0 412 0 64 13 ...
 $ WT_0h_R2: int  15 461 0 39 746 0 541 0 118 7 ...
 $ WT_4h_R1: int  152 255 3 74 1831 0 517 0 349 0 ...
 $ WT_4h_R2: int  106 279 0 56 1611 0 512 0 629 0 ...

I have checked these 6 genes. And these genes are indeed the ones in the warning report

> rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
[1] "AT1G36380" "AT1G52415" "AT1G52940" "AT2G18110" "AT2G39050" "AT3G54680"
> gene_notin <- rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
> data[gene_notin, ]
          WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G36380       75       82      140      179
AT1G52415        0        0        0        0
AT1G52940        0        0        0        0
AT2G18110      522      488      622      686
AT2G39050       26       31      133      182
AT3G54680     1236     1777      727      728

I also check one of genes structure in GTF

Chr3    Araport11       5UTR    20244284        20244567        .       +       .       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       exon    20244284        20245202        .       +       .       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       start_codon     20244568        20244570        .       +       .       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       CDS     20244568        20245202        .       +       0       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       stop_codon      20245687        20245689        .       +       .       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       CDS     20245689        20245689        .       +       1       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       exon    20245689        20246080        .       +       .       transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3    Araport11       3UTR    20245690        20246080        .       +       .       transcript_id "AT3G54680.1"; gene_id "AT3G54680";

So I am wondering whether I can maintain 6 genes when make Txdb from GTF

I also find a assoc question in warning message from makeTxDbFromGFF in GenomicFeatures package

genomicranges genomicfeatures • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6