Hi,
I have had this problem for a while now, I kind of solved it by using Ensembl's GFF file for the prairie vole, but it'd really like to get MakeTxdbFromGFF working from the NCBI GFF file (Here)
When I try to make a txdb out of that GFF file, I get
> txdb_maybe <- makeTxDbFromGFF("/mnt/Prairie_Vole_Data/Arjen_Folder/Arjen/genomefiles_mo/NCBI/GTF/GCF_000317375.1_MicOch1.0_genomic.gff")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in the
TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
the transcript names ("tx_name" column in the TxDb object) imported from the "transcript_id"
attribute are not unique
3: In .find_exon_cds(exons, cds) :
The following transcripts have exons that contain more than one CDS (only the first CDS was
kept for each exon): rna-NM_001289870.1, rna-NM_001290102.1, rna-NM_001290499.1,
So that if I do:
> transcripts(txdb_maybe)[10:5000]
GRanges object with 4991 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] NC_022009.1 3465941-3467344 + | 10 XM_005343117.2
[2] NC_022009.1 3877767-3886973 + | 11 XM_026782825.1
[3] NC_022009.1 3880864-3885230 + | 12 XM_026782826.1
[4] NC_022009.1 3880864-3886973 + | 13 XM_005343118.2
[5] NC_022009.1 3925279-3992427 + | 14 XM_013349067.2
... ... ... ... . ... ...
[4987] NC_022011.1 29621640-29622008 - | 4996 <NA>
[4988] NC_022011.1 29655176-29690024 - | 4997 XM_005345785.3
[4989] NC_022011.1 29698713-29726003 - | 4998 XM_005345787.3
[4990] NC_022011.1 30128495-30135431 - | 4999 XM_005345790.3
[4991] NC_022011.1 30128495-30135932 - | 5000 XM_026789513.1
-------
seqinfo: 571 sequences from an unspecified genome; no seqlengths
Some entries don't get a transcript name. But when I look for this specific entry in the GTF file, I find:
NC_022011.1 Gnomon gene 29621640 29622008 . - . gene_id "LOC101999479"; db_xref "GeneID:101999479"; gbkey "Gene"; gene "LOC101999479"; gene_biotype "pseudogene"; pseudo "true";
So according to my GTF this entry is a gene without a transcript. But why does it gets marked up as a transcript in my txdb then? And is there any way to prevent that from happening (like exclude those "NA" transcipts from my txdb)?
Many thanks, Arjen
Yes! Wow, with this information I understood how to replace NA values in a GRanges object (that I made from my gff) and use that Granges object to make a txdb that I can use in the GeneRegionTrack of GVIZ. I am so happy...!
Thanks a lot, Arjen
P.S. I must have put the wrong the link for the GTF, as I was also using the GFF file. Sorry about that....!
Hi I'm having the same issue with NAs in my txdb: some entries do not have transcript_id= atributes. This won't work bc I need the txdb object itself to get cds using cdsBy().
Is there a way I can create transcript_id=gene_id for these names = <NA> entries or even just drop them in any step of this? Thank you
Hi Thank you for the reply! I need to preserve the txdb structure in order to pass it in
is there a way to then retrieve it based on the modified
tx
?Rather than tacking this on to the end of an existing thread, please post a new question. And tag it with the packages you are using, and show how you get the 'dna' object.