Question

How to prepare TxDb object using makeTxDbFromGFF with polycistronic transcripts?

0

Entering edit mode

Piotr Gawronski • 0

@piotr-gawronski-9027

Last seen 24 months ago

Poland

Dear All,

I want to prepare TxDb object from gff3 file using makeTxDbFromGFF function. The gff3 file contains the following lines:

###

Pt Araport11 gene 293 1522 . - . ID=ATCG00020;Name=psbA

Pt Araport11 mRNA 293 1522 . - . ID=ATCG00020.1;Parent=ATCG00020;Name=psbA

Pt Araport11 CDS 383 1444 . - 0 ID=ATCG00020:CDS:1;Parent=ATCG00020.1;Name=psbA:CDS:1

Pt Araport11 exon 293 1522 . - . ID=ATCG00020:exon:1;Parent=ATCG00020.1;Name=psbA:exon:1

Pt Araport11 protein 383 1444 . - . ID=ATCG00020.1-Protein;Name=D1;Derives_from=ATCG00020.1

###

Pt Araport11 gene 6853 7758 . + . ID=ATCG00070;Name=psbK

Pt Araport11 gene 6853 7758 . + . ID=ATCG00080;Name=psbI

Pt Araport11 mRNA 6853 7758 . + . ID=ATCG00070.1-ATCG00080.1;Parent=ATCG00070,ATCG00080;Name=psbK-psbI

Pt Araport11 exon 6853 7758 . + . ID=ATCG00070:exon:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbK-psbI

Pt Araport11 CDS 7017 7202 . + 0 ID=ATCG00070:CDS:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbK:CDS:1

Pt Araport11 CDS 7583 7693 . + 0 ID=ATCG00080:CDS:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbI:CDS:1

###

There are two mRNAs. First, encodes for one CDS while the second one contains 2 CDSs (bicistronic transcript). The gff3 file (especially second mRNA with two CDSs) was prepared according to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md. While running the command:

library(rtracklayer); library(GenomicFeatures)

txdb <- makeTxDbFromGFF("path_to_gff3_file.gff", format = "gff3")

the following massage appears:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The following transcripts have exons that contain more than one CDS (only the first CDS was kept for each exon): ATCG00070.1-ATCG00080.1OK

And indeed, there is only one CDS for second transcript (ATCG00070.1-ATCG00080.1)

So my question is if it is possible to represent polycistronic transcripts in TxDb object and if so, how to prepare/load gff file?

Thank you for help!

Piotr

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=pl_PL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.28.4 AnnotationDbi_1.38.1   Biobase_2.36.2         BSgenome_1.44.0        rtracklayer_1.36.4     Biostrings_2.44.1      XVector_0.16.0        
 [8] GenomicRanges_1.28.4   GenomeInfoDb_1.12.2    IRanges_2.10.2         S4Vectors_0.14.3       BiocGenerics_0.22.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12               compiler_3.4.2             bitops_1.0-6               tools_3.4.2                zlibbioc_1.22.0            biomaRt_2.32.1            
 [7] digest_0.6.12              bit_1.1-12                 RSQLite_2.0                memoise_1.1.0              tibble_1.3.4               lattice_0.20-35           
[13] pkgconfig_2.0.1            rlang_0.1.2                Matrix_1.2-11              DelayedArray_0.2.7         DBI_0.7                    yaml_2.1.14               
[19] GenomeInfoDbData_0.99.0    knitr_1.17                 bit64_0.9-7                grid_3.4.2                 XML_3.98-1.9               BiocParallel_1.10.1       
[25] blob_1.1.0                 Rsamtools_1.28.0           matrixStats_0.52.2         GenomicAlignments_1.12.1   SummarizedExperiment_1.6.3 RCurl_1.95-4.8

txdb gff3 polycistronic transcript • 4.4k views

ADD COMMENT • link 6.5 years ago Piotr Gawronski • 0

score 2 · Answer 1 · 2017-11-08

Hi,

A TxDb object can represent a polycistronic transcript only if it is "split" i.e. if it is assigned more than one transcript ID. Note that this "split" is what the major annotation providers seem to be doing e.g. NCBI reports 6 mRNAs for Entrez gene 114548 but NM_004895 and NM_001243133 actually correspond to the same mRNA, they only differ by their CDS (different start codon). See:

https://www.ncbi.nlm.nih.gov/gene/114548

If we import the refGene track in a TxDb object, this is what we see for gene 114548:

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("hg38", "refGene")
txlens <- transcriptLengths(txdb, with.cds_len=TRUE,
                                  with.utr5_len=TRUE,
                                  with.utr3_len=TRUE)
subset(txlens, gene_id == "114548")
#      tx_id      tx_name gene_id nexon tx_len cds_len utr5_len utr3_len
# 3359  3359 NM_001079821  114548    11   3849    3111      138      600
# 3360  3360 NM_001127461  114548     9   4320    2940      780      600
# 3361  3361 NM_001127462  114548     8   4286    2940      746      600
# 3362  3362 NM_001243133  114548     9   4457    3105      752      600
# 3363  3363    NM_004895  114548     9   4457    3111      746      600
# 3364  3364    NM_183395  114548     7   4115    2769      746      600

UCSC does this too e.g. mRNAs uc011cez.3 and uc062ysb.1 (gene 54790) are the same but the CDS differ:

txdb <- makeTxDbFromUCSC("hg38", "knownGene")
txlens <- transcriptLengths(txdb, with.cds_len=TRUE,
                                  with.utr5_len=TRUE,
                                  with.utr3_len=TRUE)
subset(txlens, gene_id == "54790")
#       tx_id    tx_name gene_id nexon tx_len cds_len utr5_len utr3_len
# 42798 42798 uc021xqk.1   54790     3   9233    3498      404     5331
# 42799 42799 uc011cez.3   54790    11  10166    6072      797     3297
# 42800 42800 uc062ysb.1   54790    11  10166    6009      860     3297
# 42801 42801 uc003hxj.3   54790    10   9588    3585      386     5617
# 42802 42802 uc003hxk.4   54790    11   9679    6009      386     3284
# 42803 42803 uc062ysc.1   54790     3    455       0        0        0
# 42804 42804 uc010ilp.3   54790     3   4973    3504      140     1329
# 42805 42805 uc062ysd.1   54790     4   4935    1914      103     2918
# 42806 42806 uc062yse.1   54790     2    391       0        0        0
# 42807 42807 uc062ysf.1   54790     3    865     330      535        0

Ensembl also does something similar.

The advantage of this "split" is that it avoids introducing a one-to-many relationship between transcript IDs and CDS or proteins. Keeping this relationship many-to-one simplifies data representation and downstream analyses.

That being said, it's true that the GFF3 format allows (and even seems to recommend) collapsing identical mRNAs that differ only by their CDS into a single mRNA line, thus making the resulting mRNA polycistronic. Unfortunately, makeTxDbFromGFF() doesn't support this at the moment. Note that in theory support could be added. What it would do is "split" each polycistronic mRNA by assigning more than one TxDb internal id (tx_id) to it (one id per CDS). For the moment, the workaround is to perform this "split" upstream i.e. when the GFF3 file is generated or before.

In the case of your GFF3 file, this means:

1) replacing mRNA line

Pt Araport11 mRNA 6853 7758 . + . ID=ATCG00070.1-ATCG00080.1;Parent=ATCG00070,ATCG00080;Name=psbK-psbI

with 2 lines

Pt Araport11 mRNA 6853 7758 . + . ID=ATCG00070.1;Parent=ATCG00070;Name=psbK
Pt Araport11 mRNA 6853 7758 . + . ID=ATCG00080.1;Parent=ATCG00080;Name=psbI

2) then modifying exon line

Pt Araport11 exon 6853 7758 . + . ID=ATCG00070:exon:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbK-psbI

so it becomes

Pt Araport11 exon 6853 7758 . + . Parent=ATCG00070.1,ATCG00080.1

(note the use of , instead of - to separate the 2 parents, also note that exon lines don't need to have an ID),

3) and finally modifying the CDS lines so that each of them points to the corresponding mRNA parent.

H.

score 0 · Answer 2 · 2017-11-09

0

Entering edit mode

Piotr Gawronski • 0

@piotr-gawronski-9027

Last seen 24 months ago

Poland

Hi Hervé Pagès,

thank you very much for your suggestion!

cheers!

ADD COMMENT • link 6.5 years ago Piotr Gawronski • 0