Question

Reading fasta file with multiple sequences

0

Entering edit mode

Riot • 0

@riot-12299

Last seen 7.2 years ago

Hello all,

I'm trying to read a fasta file that has over 5000 sequences. The plan is to create a vector that calls out all the sequences, and those sequences I'll carry over to Bio Linux after I turn them into protein. I've done this, but with only one sequence at a time (that and I'm still new to RStudio). Please see below for the codes I'm using... Can someone please tell me where I'm going wrong?

> contigs= read.fasta("contigs.fasta", seqtype = “DNA”)

> contigsdnaseq= contigs[[1]] (I think this is the part where things go wrong. I'm not sure what code to use in order for the program to recognize the 5000+ sequences.)

> getTrans(contigsdnaseq, sens = "F", NAstring = "X", ambiguous = FALSE, frame = 0, numcode = 1)

> contigs_aa= getTrans(contigsdnaseq,sens = "F")

> write.fasta(contigs_aa,contigs_aa,file.out = "contigs_aa.fasta")

> contigsaafile = read.fasta("contigs_aa.fasta", seqtype = "AA")

> getAnnot(contigsaafile)

multiple sequences bioconductor rstudio seqinr • 4.5k views

ADD COMMENT • link updated 7.2 years ago by Martin Morgan 25k • written 7.2 years ago by Riot • 0

score 2 · Answer 1 · 2017-02-08

seqinr is a CRAN package so you'd have to ask elsewhere for help.

In Bioconductor, you'd use

library(Biostrings)
dna = readDNAStringSet("your.fasta")
aa = translate(dna)
writeXStringSet(aa, "aa.fasta")

This would process all of your fasta sequences in one go, no need to iterate.

I'm not really sure what getAnnot() retrieves for amino acid sequences, it seems like it's just the identifier, names(aa). If more, one would use one of the Bioconductor 'org' packages (e.g., org.Hs.eg.db) or biomaRt; see the vignette AnnotationDbi: Introduction To Bioconductor Annotation Packages in the AnnotationDbi or biomaRt packages for more.