Hi I'm having NAs in my txdb: some entries do not have transcript_id= atributes.
# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in
the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
the transcript names ("tx_name" column in the TxDb object) imported from the
"transcript_id" attribute are not unique
> tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
> tx
GRanges object with 33172 ranges and 3 metadata columns:
seqnames ranges strand | gene_id tx_id tx_name
<Rle> <IRanges> <Rle> | <CharacterList> <integer> <character>
[1] NC_029475.1 1-70 + | AZ255_gt01 1 <NA>
[2] NC_029475.1 71-1051 + | AZ255_gr02 2 <NA>
[3] NC_029475.1 1046-1115 + | AZ255_gt02 3 <NA>
[4] NC_029475.1 1116-2707 + | AZ255_gr01 4 <NA>
[5] NC_029475.1 2707-2781 + | AZ255_gt03 5 <NA>
... ... ... ... . ... ... ...
[33168] NW_021820213.1 56-1496 - | LOC110471444 33168 XM_021532114.1
[33169] NW_021820218.1 462-4045 + | LOC110480953 33169 XM_021549355.1
[33170] NW_021820218.1 3543-3612 + | LOC116184804 33170 XR_004149481.1
[33171] NW_021820218.1 5930-11773 - | LOC110480959 33171 XM_021549366.2
[33172] NW_021820219.1 5-35982 - | LOC110473356 33172 XM_021535836.1
-------
seqinfo: 449 sequences from an unspecified genome; no seqlengths
# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript, use.names = TRUE)
# Get dna seq
dna <- readDNAStringSet("/users/mfariasv/data/mfariasv/newBF20/BFgenomic.fa")
# use lapply function and unlist function and diretcly convert into a DNAStringSet
LargeDNAStringSet <- DNAStringSet(lapply(cds_by_trnascript, function(x) {unlist(x)}))
# Write to fasta
writeXStringSet(LargeDNAStringSet, "my.fasta")
Error in .Call2("write_XStringSet_to_fasta", x, filexp_list, width, lkup, :
'names(x)' contains NAs
This won't work,
> tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
> tx
> tx2 <- tx[!is.na(tx$tx_name)]
> tx2
I need to preserve the txdb structure in order to pass it in
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript)
is there a way to then retrieve it based on the modified tx?
or is there a way I can create transcript_id=gene_id for these names = <NA> entries or even just drop them in any step of this? Thank you
Perfect! Dropping them works for me. They are only a few... Sorry, I did use cdsBy(...use.names = TRUE). I guess just got lost when I was playing around trying to understand these objects. Thank you!