Hello A warnging appears when I make a Txdb using GenomicFeaures
I am sorry I can not find the gtf download link for Araport11
Welcome to Araport.org. Funding of project has been discontinued so teams from The Arabidopsis Information Resource (TAIR), the National Center for Genome Resources (NCGR), and the Bio-Analytic Resource for Plant Biology (BAR) have taken over its operation and have refreshed/expanded the functionalities that were available at Araport as in the table below.
But the GFF download link is TAIR_download
library(GenomicFeatures)
Tx_At <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf",
chrominfo = Seqinfo(seqnames = paste0("Chr",c(1:5,"C","M")),
seqlengths = c(30427671,19698289,23459830,18585056,26975502,154478,366924)))
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of
type stop_codon, exon. This information was ignored.
2: In .reject_transcripts(bad_tx, because) :
The following transcripts were rejected because they have stop
codons that cannot be mapped to an exon: AT1G07320.4, AT1G18180.2,
AT1G30230.1, AT1G36380.1, AT1G52415.1, AT1G52940.1, AT2G18110.1,
AT2G35130.1, AT2G35130.3, AT2G39050.1, AT3G13445.1, AT3G17660.1,
AT3G17660.3, AT3G52070.2, AT3G54680.1, AT3G59450.2, AT4G13730.2,
AT4G17730.1, AT4G39620.2, AT5G01520.2, AT5G22794.1, AT5G27710.2,
AT5G45240.2
And it can make the Txdb lose 6 genes compared with my featureCounts result
> genes(Tx_At)
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 Chr1 3631-5899 + | AT1G01010
AT1G01020 Chr1 6788-9130 - | AT1G01020
AT1G01030 Chr1 11649-13714 - | AT1G01030
AT1G01040 Chr1 23121-31227 + | AT1G01040
AT1G01050 Chr1 31170-33171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 181268-181347 + | ATMG09730
ATMG09740 ChrM 191954-192025 - | ATMG09740
ATMG09950 ChrM 333651-333725 - | ATMG09950
ATMG09960 ChrM 347266-347348 + | ATMG09960
ATMG09980 ChrM 359666-359739 - | ATMG09980
-------
seqinfo: 7 sequences from an unspecified genome
# My featureCounts result
> head(data)
WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G01010 15 15 152 106
AT1G01020 353 461 255 279
AT1G03987 1 0 3 0
AT1G01030 23 39 74 56
AT1G01040 695 746 1831 1611
AT1G03993 0 0 0 0
> str(data)
'data.frame': 37336 obs. of 4 variables:
$ WT_0h_R1: int 15 353 1 23 695 0 412 0 64 13 ...
$ WT_0h_R2: int 15 461 0 39 746 0 541 0 118 7 ...
$ WT_4h_R1: int 152 255 3 74 1831 0 517 0 349 0 ...
$ WT_4h_R2: int 106 279 0 56 1611 0 512 0 629 0 ...
I have checked these 6 genes. And these genes are indeed the ones in the warning report
> rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
[1] "AT1G36380" "AT1G52415" "AT1G52940" "AT2G18110" "AT2G39050" "AT3G54680"
> gene_notin <- rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
> data[gene_notin, ]
WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G36380 75 82 140 179
AT1G52415 0 0 0 0
AT1G52940 0 0 0 0
AT2G18110 522 488 622 686
AT2G39050 26 31 133 182
AT3G54680 1236 1777 727 728
I also check one of genes structure in GTF
Chr3 Araport11 5UTR 20244284 20244567 . + . transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 exon 20244284 20245202 . + . transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 start_codon 20244568 20244570 . + . transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 CDS 20244568 20245202 . + 0 transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 stop_codon 20245687 20245689 . + . transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 CDS 20245689 20245689 . + 1 transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 exon 20245689 20246080 . + . transcript_id "AT3G54680.1"; gene_id "AT3G54680";
Chr3 Araport11 3UTR 20245690 20246080 . + . transcript_id "AT3G54680.1"; gene_id "AT3G54680";
So I am wondering whether I can maintain 6 genes when make Txdb from GTF
I also find a assoc question in warning message from makeTxDbFromGFF in GenomicFeatures package