NA transcript values in my Txdb
1
1
Entering edit mode
A.J. ▴ 20
@aj-24333
Last seen 22 months ago
United States

Hi,

I have had this problem for a while now, I kind of solved it by using Ensembl's GFF file for the prairie vole, but it'd really like to get MakeTxdbFromGFF working from the NCBI GFF file (Here)

When I try to make a txdb out of that GFF file, I get

> txdb_maybe <- makeTxDbFromGFF("/mnt/Prairie_Vole_Data/Arjen_Folder/Arjen/genomefiles_mo/NCBI/GTF/GCF_000317375.1_MicOch1.0_genomic.gff") 

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in the
  TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported from the "transcript_id"
  attribute are not unique
3: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS (only the first CDS was
  kept for each exon): rna-NM_001289870.1, rna-NM_001290102.1, rna-NM_001290499.1,

So that if I do:

> transcripts(txdb_maybe)[10:5000]
GRanges object with 4991 ranges and 2 metadata columns:
            seqnames            ranges strand |     tx_id        tx_name
               <Rle>         <IRanges>  <Rle> | <integer>    <character>
     [1] NC_022009.1   3465941-3467344      + |        10 XM_005343117.2
     [2] NC_022009.1   3877767-3886973      + |        11 XM_026782825.1
     [3] NC_022009.1   3880864-3885230      + |        12 XM_026782826.1
     [4] NC_022009.1   3880864-3886973      + |        13 XM_005343118.2
     [5] NC_022009.1   3925279-3992427      + |        14 XM_013349067.2
     ...         ...               ...    ... .       ...            ...
  [4987] NC_022011.1 29621640-29622008      - |      4996           <NA>
  [4988] NC_022011.1 29655176-29690024      - |      4997 XM_005345785.3
  [4989] NC_022011.1 29698713-29726003      - |      4998 XM_005345787.3
  [4990] NC_022011.1 30128495-30135431      - |      4999 XM_005345790.3
  [4991] NC_022011.1 30128495-30135932      - |      5000 XM_026789513.1
  -------
  seqinfo: 571 sequences from an unspecified genome; no seqlengths

Some entries don't get a transcript name. But when I look for this specific entry in the GTF file, I find:

NC_022011.1 Gnomon  gene    29621640    29622008    .   -   .   gene_id "LOC101999479"; db_xref "GeneID:101999479"; gbkey "Gene"; gene "LOC101999479"; gene_biotype "pseudogene"; pseudo "true"; 

So according to my GTF this entry is a gene without a transcript. But why does it gets marked up as a transcript in my txdb then? And is there any way to prevent that from happening (like exclude those "NA" transcipts from my txdb)?

Many thanks, Arjen

GFF txdb • 1.9k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 7 hours ago
Seattle, WA, United States

Hi Arjen,

Hmm.. I'm surprised you can import this GTF file at all. Are we talking about the GTF file GCF_000317375.1_MicOch1.0_genomic.gtf located here? I get the following error when I try to import it with makeTxDbFromGFF():

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("GCF_000317375.1_MicOch1.0_genomic.gtf")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) : 
#   some CDS cannot be mapped to an exon
# In addition: Warning message:
# In .get_cds_IDX(mcols0$type, mcols0$phase) :
#   The "phase" metadata column contains non-NA values for features of type
#   stop_codon. This information was ignored.

But if I use the GFF3 file (from the same location), everything is fine:

txdb <- makeTxDbFromGFF("GCF_000317375.1_MicOch1.0_genomic.gff")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   some transcripts have no "transcript_id" attribute ==> their name
#   ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   the transcript names ("tx_name" column in the TxDb object) imported
#   from the "transcript_id" attribute are not unique
# 3: In .find_exon_cds(exons, cds) :
#   The following transcripts have exons that contain more than one CDS
#   (only the first CDS was kept for each exon): rna-NM_001289870.1,
#   rna-NM_001290102.1, rna-NM_001290499.1, rna-NM_001291243.1

See my sessionInfo() below.

About the unnamed transcripts: I do see an unnamed transcript at coordinates 29621640-29622008 like you did:

transcripts(txdb)[4994:4998]
# GRanges object with 5 ranges and 2 metadata columns:
#          seqnames            ranges strand |     tx_id        tx_name
#             <Rle>         <IRanges>  <Rle> | <integer>    <character>
#   [1] NC_022011.1 29591627-29617240      - |      4994 XM_005345783.2
#   [2] NC_022011.1 29591875-29614709      - |      4995 XM_013354377.2
#   [3] NC_022011.1 29621640-29622008      - |      4996           <NA>
#   [4] NC_022011.1 29655176-29690024      - |      4997 XM_005345785.3
#   [5] NC_022011.1 29698713-29726003      - |      4998 XM_005345787.3
#   -------
#   seqinfo: 571 sequences from an unspecified genome; no seqlengths

This is actually what I would expect based on what I see in the GFF3 file:

NC_022011.1     Gnomon  pseudogene      29621640        29622008        .       -       .       ID=gene-LOC101999479;Dbxref=GeneID:101999479;Name=LOC101999479;gbkey=Gene;gene=LOC101999479;...
NC_022011.1     Gnomon  exon    29621640        29622008        .       -       .       ID=id-LOC101999479;Parent=gene-LOC101999479;Dbxref=GeneID:101999479;gbkey=exon;gene=LOC101999479;...

This is a gene (pseudogene to be precise) with no reported transcript but one reported exon. What happened here is that makeTxDbFromGFF() treated it as if it had one transcript that spans the entire gene/exon. This is a feature. However it seems that makeTxDbFromGFF() didn't assign a name to this inferred transcript, which may sound unfortunate, but the name of the associated gene is not far away:

transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))[4994:4998]
# GRanges object with 5 ranges and 3 metadata columns:
#         seqnames            ranges strand |         gene_id     tx_id
#            <Rle>         <IRanges>  <Rle> | <CharacterList> <integer>
#   [1] NC_022011.1 29591627-29617240      - |          Klhdc4      4994
#   [2] NC_022011.1 29591875-29614709      - |          Klhdc4      4995
#   [3] NC_022011.1 29621640-29622008      - |    LOC101999479      4996
#   [4] NC_022011.1 29655176-29690024      - |          Slc7a5      4997
#   [5] NC_022011.1 29698713-29726003      - |            Ca5a      4998
#              tx_name
#          <character>
#   [1] XM_005345783.2
#   [2] XM_013354377.2
#   [3]           <NA>
#   [4] XM_005345785.3
#   [5] XM_005345787.3
#   -------
#   seqinfo: 571 sequences from an unspecified genome; no seqlengths

So it's easy to assign the gene name to all the unnamed transcripts with something like:

tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
unnamed_tx_idx <- which(is.na(mcols(tx)$tx_name))
mcols(tx)$tx_name[unnamed_tx_idx] <- as.character(mcols(tx)$gene_id[unnamed_tx_idx])
tx[4994:4998]
# GRanges object with 5 ranges and 3 metadata columns:
#          seqnames            ranges strand |         gene_id     tx_id
#             <Rle>         <IRanges>  <Rle> | <CharacterList> <integer>
#   [1] NC_022011.1 29591627-29617240      - |          Klhdc4      4994
#   [2] NC_022011.1 29591875-29614709      - |          Klhdc4      4995
#   [3] NC_022011.1 29621640-29622008      - |    LOC101999479      4996
#   [4] NC_022011.1 29655176-29690024      - |          Slc7a5      4997
#   [5] NC_022011.1 29698713-29726003      - |            Ca5a      4998
#              tx_name
#          <character>
#   [1] XM_005345783.2
#   [2] XM_013354377.2
#   [3]   LOC101999479
#   [4] XM_005345785.3
#   [5] XM_005345787.3
#   -------
#   seqinfo: 571 sequences from an unspecified genome; no seqlengths

As for why makeTxDbFromGFF() choked on the GTF file, I'm not sure, I didn't investigate. Maybe the file is too broken for makeTxDbFromGFF()'s taste? Even if NCBI, UCSC, and Ensembl are still using it, GTF is a deprecated format and GFF3 should be used whenever possible. See http://gmod.org/wiki/GFF2.

Hope this helps,

H.

sessionInfo():

R version 4.1.0 beta (2021-05-06 r80268)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r80268/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r80268/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.44.2 AnnotationDbi_1.54.1   Biobase_2.52.0        
[4] GenomicRanges_1.44.0   GenomeInfoDb_1.28.4    IRanges_2.26.0        
[7] S4Vectors_0.31.0       BiocGenerics_0.38.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] Rsamtools_2.8.0             Biostrings_2.60.2          
 [7] assertthat_0.2.1            digest_0.6.28              
 [9] utf8_1.2.2                  BiocFileCache_2.0.0        
[11] R6_2.5.1                    RSQLite_2.2.8              
[13] httr_1.4.2                  pillar_1.6.3               
[15] zlibbioc_1.38.0             rlang_0.4.11               
[17] progress_1.2.2              curl_4.3.2                 
[19] rstudioapi_0.13             blob_1.2.2                 
[21] Matrix_1.3-4                BiocParallel_1.26.2        
[23] stringr_1.4.0               RCurl_1.98-1.5             
[25] bit_4.0.4                   biomaRt_2.48.3             
[27] DelayedArray_0.18.0         compiler_4.1.0             
[29] rtracklayer_1.52.1          pkgconfig_2.0.3            
[31] tidyselect_1.1.1            KEGGREST_1.32.0            
[33] SummarizedExperiment_1.22.0 tibble_3.1.5               
[35] GenomeInfoDbData_1.2.6      matrixStats_0.61.0         
[37] XML_3.99-0.8                fansi_0.5.0                
[39] crayon_1.4.1                dplyr_1.0.7                
[41] dbplyr_2.1.1                GenomicAlignments_1.28.0   
[43] bitops_1.0-7                rappdirs_0.3.3             
[45] grid_4.1.0                  lifecycle_1.0.1            
[47] DBI_1.1.1                   magrittr_2.0.1             
[49] stringi_1.7.5               cachem_1.0.6               
[51] XVector_0.32.0              xml2_1.3.2                 
[53] ellipsis_0.3.2              filelock_1.0.2             
[55] generics_0.1.0              vctrs_0.3.8                
[57] rjson_0.2.20                restfulr_0.0.13            
[59] tools_4.1.0                 bit64_4.0.5                
[61] glue_1.4.2                  purrr_0.3.4                
[63] MatrixGenerics_1.4.3        hms_1.1.1                  
[65] fastmap_1.1.0               yaml_2.2.1                 
[67] memoise_2.0.0               BiocIO_1.2.0
ADD COMMENT
1
Entering edit mode

Yes! Wow, with this information I understood how to replace NA values in a GRanges object (that I made from my gff) and use that Granges object to make a txdb that I can use in the GeneRegionTrack of GVIZ. I am so happy...!

Thanks a lot, Arjen

P.S. I must have put the wrong the link for the GTF, as I was also using the GFF file. Sorry about that....!

ADD REPLY
0
Entering edit mode

Hi I'm having the same issue with NAs in my txdb: some entries do not have transcript_id= atributes. This won't work bc I need the txdb object itself to get cds using cdsBy().

# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in
  the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
     the transcript names ("tx_name" column in the TxDb object) imported from the
     "transcript_id" attribute are not unique

> tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
> tx
GRanges object with 33172 ranges and 3 metadata columns:
                seqnames     ranges strand |         gene_id     tx_id        tx_name
                   <Rle>  <IRanges>  <Rle> | <CharacterList> <integer>    <character>
      [1]    NC_029475.1       1-70      + |      AZ255_gt01         1           <NA>
      [2]    NC_029475.1    71-1051      + |      AZ255_gr02         2           <NA>
      [3]    NC_029475.1  1046-1115      + |      AZ255_gt02         3           <NA>
      [4]    NC_029475.1  1116-2707      + |      AZ255_gr01         4           <NA>
      [5]    NC_029475.1  2707-2781      + |      AZ255_gt03         5           <NA>
      ...            ...        ...    ... .             ...       ...            ...
  [33168] NW_021820213.1    56-1496      - |    LOC110471444     33168 XM_021532114.1
  [33169] NW_021820218.1   462-4045      + |    LOC110480953     33169 XM_021549355.1
  [33170] NW_021820218.1  3543-3612      + |    LOC116184804     33170 XR_004149481.1
  [33171] NW_021820218.1 5930-11773      - |    LOC110480959     33171 XM_021549366.2
  [33172] NW_021820219.1    5-35982      - |    LOC110473356     33172 XM_021535836.1
  -------
  seqinfo: 449 sequences from an unspecified genome; no seqlengths

# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript)

# use lapply function and unlist function and diretcly convert into a DNAStringSet
LargeDNAStringSet <- DNAStringSet(lapply(cds_by_trnascript, function(x) {unlist(x)})) 

# Write to fasta
writeXStringSet(LargeDNAStringSet, "my.fasta")

Error in .Call2("write_XStringSet_to_fasta", x, filexp_list, width, lkup,  : 
  'names(x)' contains NAs

Is there a way I can create transcript_id=gene_id for these names = <NA> entries or even just drop them in any step of this? Thank you

ADD REPLY
1
Entering edit mode
> tx2 <- tx[!is.na(tx$tx_name)]
> tx2
GRanges object with 32844 ranges and 3 metadata columns:
                seqnames        ranges strand |         gene_id     tx_id
                   <Rle>     <IRanges>  <Rle> | <CharacterList> <integer>
      [1]    NC_042565.1   41567-45737      + |    LOC110474963        38
      [2]    NC_042565.1   41568-45737      + |    LOC110474963        39
      [3]    NC_042565.1  61571-102813      + |           DCHS1        40
      [4]    NC_042565.1 108172-113800      + |            TPP1        41
      [5]    NC_042565.1 113961-115690      + |           TAF10        42
      ...            ...           ...    ... .             ...       ...
  [32840] NW_021820213.1       56-1496      - |    LOC110471444     33168
  [32841] NW_021820218.1      462-4045      + |    LOC110480953     33169
  [32842] NW_021820218.1     3543-3612      + |    LOC116184804     33170
  [32843] NW_021820218.1    5930-11773      - |    LOC110480959     33171
  [32844] NW_021820219.1       5-35982      - |    LOC110473356     33172
                 tx_name
             <character>
      [1] XR_002466193.2
      [2] XR_002466194.2
      [3] XM_021538775.2
      [4] XM_021538772.1
      [5] XM_021538773.1
      ...            ...
  [32840] XM_021532114.1
  [32841] XM_021549355.1
  [32842] XR_004149481.1
  [32843] XM_021549366.2
  [32844] XM_021535836.1
  -------
  seqinfo: 449 sequences from an unspecified genome; no seqlengths
ADD REPLY
0
Entering edit mode

Hi Thank you for the reply! I need to preserve the txdb structure in order to pass it in

cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript)

is there a way to then retrieve it based on the modified tx?

ADD REPLY
0
Entering edit mode

Rather than tacking this on to the end of an existing thread, please post a new question. And tag it with the packages you are using, and show how you get the 'dna' object.

ADD REPLY

Login before adding your answer.

Traffic: 620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6