Question

NAs in my txdb tx_name

0

Entering edit mode

Madza Farias-Virgens • 0

@mfarias-virgens

Last seen 2.3 years ago

United States

Hi I'm having NAs in my txdb: some entries do not have transcript_id= atributes.

# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in
  the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
     the transcript names ("tx_name" column in the TxDb object) imported from the
     "transcript_id" attribute are not unique

> tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
> tx
GRanges object with 33172 ranges and 3 metadata columns:
                seqnames     ranges strand |         gene_id     tx_id        tx_name
                   <Rle>  <IRanges>  <Rle> | <CharacterList> <integer>    <character>
      [1]    NC_029475.1       1-70      + |      AZ255_gt01         1           <NA>
      [2]    NC_029475.1    71-1051      + |      AZ255_gr02         2           <NA>
      [3]    NC_029475.1  1046-1115      + |      AZ255_gt02         3           <NA>
      [4]    NC_029475.1  1116-2707      + |      AZ255_gr01         4           <NA>
      [5]    NC_029475.1  2707-2781      + |      AZ255_gt03         5           <NA>
      ...            ...        ...    ... .             ...       ...            ...
  [33168] NW_021820213.1    56-1496      - |    LOC110471444     33168 XM_021532114.1
  [33169] NW_021820218.1   462-4045      + |    LOC110480953     33169 XM_021549355.1
  [33170] NW_021820218.1  3543-3612      + |    LOC116184804     33170 XR_004149481.1
  [33171] NW_021820218.1 5930-11773      - |    LOC110480959     33171 XM_021549366.2
  [33172] NW_021820219.1    5-35982      - |    LOC110473356     33172 XM_021535836.1
  -------
  seqinfo: 449 sequences from an unspecified genome; no seqlengths

# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript, use.names = TRUE)

# Get dna seq

dna <- readDNAStringSet("/users/mfariasv/data/mfariasv/newBF20/BFgenomic.fa")

# use lapply function and unlist function and diretcly convert into a DNAStringSet
LargeDNAStringSet <- DNAStringSet(lapply(cds_by_trnascript, function(x) {unlist(x)})) 

# Write to fasta
writeXStringSet(LargeDNAStringSet, "my.fasta")

Error in .Call2("write_XStringSet_to_fasta", x, filexp_list, width, lkup,  : 
  'names(x)' contains NAs

This won't work,

> tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
> tx
> tx2 <- tx[!is.na(tx$tx_name)]
> tx2

I need to preserve the txdb structure in order to pass it in

cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript)

is there a way to then retrieve it based on the modified tx?

or is there a way I can create transcript_id=gene_id for these names = <NA> entries or even just drop them in any step of this? Thank you

GenomicFeatures Biostrings • 1.0k views

ADD COMMENT • link 2.4 years ago Madza Farias-Virgens • 0

score 1 · Answer 1 · 2021-11-15

Hi,

Where is txdb.cds_by_transcript coming from? Did you call cdsBy(txdb, by="tx", use.names=TRUE) to get it? Note that by using use.names=TRUE you're asking that the transcript names be set on the returned GRangesList object. And since some of those names are NAs, they cause problems later. In particular they end up on DNAStringSet object LargeDNAStringSet that you are trying to export to a FASTA file with Biostrings::writeXStringSet(). Problem is that Biostrings::writeXStringSet() wants to use the names that are on the DNAStringSet object to write the header line of each FASTA record.

So the question is: what do you want to see in the header lines of the FASTA file for those transcripts that don't have a name? Or do you want to just get rid of those transcripts?

The latter is easily done with:

named_idx <-  which(!is.na(names(LargeDNAStringSet)))
LargeDNAStringSet <- LargeDNAStringSet[named_idx]

For the former, you could make up some names with something like this:

unnamed_idx <- which(is.na(names(LargeDNAStringSet)))
names(LargeDNAStringSet)[unnamed_idx] <- paste0("UNNAMEDTRANSCRIPT", seq_along(unnamed_idx))

Hope this helps,

H.