Exporting a FASTA file from a getBM vs getSequence query
1
0
Entering edit mode
Glubbdrubb • 0
@glubbdrubb-13951
Last seen 7.2 years ago

A summary:

I want to get sequences for a list of genes, with their stable gene IDs, transcript IDs, and gene symbols in the ID lines in the resultant fasta file, like so:

>MGP_SPRETEiJ_G0015912|MGP_SPRETEiJ_T0023337|Tab2
ATGACTTCTCTCAACTTGGACTTGCAGTCACAGAATGTTTACCACCATGGAAGAGAAGGA
AGTCGAGTGAATGGAAGCAG

MGP_SPRETEiJ is the species ID.

 

getSequence provides easy export of sequences to FASTA file, but does not provide all that data in the ID line for each sequence

getBM provides all the sections I need, but I can't export the sequences to FASTA.

 

My methods:

First, getSequence:

mart <- useMart("ensembl", dataset="caperea_gene_ensembl")
seq = getSequence(id = genes$V1,
                  seqType = "cdna",
                  type = "hgnc_symbol",
                  mart = mart)

This is exported to FASTA by:

exportFASTA(seq, file='Data/biomartExportTest.seq')

But the fasta ID lines only have the gene name, not the ensembl stable gene ID, or the transcript ID:

>PDPK1
ATCCAGTCCAGCGTGGTGTTATGTGCCTGCCCATCCCCATCAATGGTGAGGACACAGAC

 

Now the getBM method:

fasta = getBM(attributes=c("coding",
                           "ensembl_gene_id",
                           "ensembl_transcript_id",
                           "external_gene_name"),
                    filters = "hgnc_symbol",
                    values = genes$V1,
                    mart = ensembl)

The resultant dataframe has all the data I need, but I  don't know how to export it to FASTA:

 

biomart fasta • 2.4k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 35 minutes ago
United States
library(Biostrings)

z <- DNAStringSet(fasta[,1])

names(z) <- apply(fasta[,-1], 2, paste, collapse = "; ")

writeXStringSet(z, "biomartExportTest.seq")
ADD COMMENT
0
Entering edit mode

Biostrings looks like a very useful package! Thanks for your answer :-)

But something went wrong during the paste function.

Here's the error I get:

> writeXStringSet(z, 'Data/biomartEportTest.fasta')
Error in .Call2("write_XStringSet_to_fasta", x, filexp_list, width, lkup,  :
  'names(x)' contains NAs

 

When I view z:

> z
  A DNAStringSet instance of length 89
     width seq                                                                         names               
 [1]  1776 GACCGTGTGGCGCGCATCGGACTGGGCGTGGTTTTG...AGTTTGGAAGGGGCTGCACCCATGGGCCCACCTTAA ENSCAPG0000000376...
 [2]  3714 ATGACTACAAAACGAAGTTTGTTTGTGCGGTTGGTA...AAAATTATTCAGTCTGGCCTTGCACTCAGTATGTAG ENSCAPT0000000419...
 [3]  1176 ATGCTAACCCCTTCCCCGGTCCTGGTTCTCAACCTC...GCGCTCAGCCAGCCCCATCCGCCCTTCTTCAGGTAG SARM1; ERBIN; MAP...
 [4]  1851 ATGCAGAGCACTCCTAACCATCTGTGGCTTTTATCT...ATGGATGTTGGCCTACGGAATGTCGACGGTCTTTAG <NA>
 [5]  1611 GAACTTGATCTCAAAATAGCTATTAAGTCTTGCCGA...ATGATGAATCTTGATTGGAGTTGGTTAACTGAATGA <NA>
 ...   ... ...
[85]  1320 ATCCAGTCCAGCGTGGTGTTATGTGCCTGCCCATCC...CGATACCAGAGCCACCCAGATGCTGCTGTGCAGTGA <NA>
[86]  1113 ATGGACAGTAAATGTGTGTTCATCTCTGTCAGTAAA...AGCAAGAGAAACAGCTGGTATGTGATTACAACTTGA <NA>
[87]  2418 ATGTCTGACAGTGGATCACAACCTGGTTCAATGGGT...GCCATTGAAGAAACAGAAGGATTTGGACAAGAGTAA <NA>
[88]   999 ATCCAGTCCAGCGTGGTGTTATGTGCCTGCCCATCC...CGATACCAGAGCCACCCAGATGCTGCTGTGCAGTGA <NA>
[89]  1119 ACATTACTACCTCCCCATTCATACCCCAAAGAAACT...CCAGCTACTCCCAGCTCTCCCATGTATGTTGATTGA <NA>

Many of the sequences have NA instead of names. And many sequences have more than one name set.

ADD REPLY
0
Entering edit mode

I fixed it!

In the line:

names(z) <- apply(fasta[,-1], 2, paste, collapse = "; ")

I changed 2 to 1:

names(z) <- apply(fasta[,-1], 1, paste, collapse = "; ")

 

 

And it works now!

 

Thank you so much for your help!

 

ADD REPLY

Login before adding your answer.

Traffic: 958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6