problem importing a fasta file biostrings or seqinr ?

0

Entering edit mode

m a ▴ 10

@m-a-3789

Last seen 10.6 years ago

Hello, I would like to make simple statistics on a specific DNA sequence. In order to do that a need to import a sequence with a fasta extension. http://www.ncbi.nlm.nih.gov/nuccore/9626243?report=fasta&log$=seqview After download I run the folliwing code with the package seqinr : dnafile <- system.file("sequences/seqbac.fasta", package = "seqinr") cc<-read.fasta(file = dnafile) cc gives me then the following vector ... [47764] "t" "c" "c" "c" "t" ...... My problem is I would like now to use that vector to perform basic statistics eg; GC content analysis, base frequencies . I hardly see how ? For instance an histogram on my vector like hist(cc) don't work My first intention by the way was to use biostring package to import fasta file, like readFASTA(" directory",strip.desc=TRUE). But how sould I know under which directory I have to put data ? Because I ve tried few directories but he still do not found my data Thanks in advance, Moses Student in biostatistics [[alternative HTML version deleted]]

• 4.9k views

ADD COMMENT • link updated 15.4 years ago by Hervé Pagès 16k • written 15.4 years ago by m a ▴ 10

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Moses, Once you've figured out where your FASTA file is located, you can do: library(Biostrings) myseqs <- read.DNAStringSet("path/to/your/fasta_file.fa", "fasta") myseqs myseq <- myseqs[[1]] ## For base frequencies: alphabetFrequency(myseq) ## For GC content: dinucleotideFrequency(myseq) dinucleotideFrequency(myseq, as.prob=TRUE) Biostrings also has trinucleotideFrequency(), oligonucleotideFrequency(), and much more (see man pages for more info about those functions). Cheers, H. m a wrote: > Hello, > > I would like to make simple statistics on a specific DNA sequence. In order > to do that a need to import a sequence with a fasta extension. > > http://www.ncbi.nlm.nih.gov/nuccore/9626243?report=fasta&log$=seqview > > After download I run the folliwing code with the package seqinr : > > dnafile <- system.file("sequences/seqbac.fasta", package = "seqinr") > cc<-read.fasta(file = dnafile) > > cc gives me then the following vector > ... > > [47764] "t" "c" "c" "c" "t" > ...... > > My problem is I would like now to use that vector to perform basic > statistics eg; GC content analysis, base frequencies . I hardly see how ? > For instance an histogram on my vector like hist(cc) don't work > > > My first intention by the way was to use biostring package to import fasta > file, like readFASTA(" directory",strip.desc=TRUE). But how sould I know > under which directory I have to put data ? Because I ve tried few > directories but he still do not found my data > > > > Thanks in advance, > > > Moses > Student in biostatistics > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 15.4 years ago Hervé Pagès 16k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 2.1 years ago

United States

Hi Moses, On Nov 9, 2009, at 7:21 AM, m a wrote: > Hello, > > I would like to make simple statistics on a specific DNA sequence. > In order > to do that a need to import a sequence with a fasta extension. > > http://www.ncbi.nlm.nih.gov/nuccore/9626243?report=fasta&log$=seqview > > After download I run the folliwing code with the package seqinr : > > dnafile <- system.file("sequences/seqbac.fasta", package = "seqinr") > cc<-read.fasta(file = dnafile) > > cc gives me then the following vector > ... > > [47764] "t" "c" "c" "c" "t" > ...... > > My problem is I would like now to use that vector to perform basic > statistics eg; GC content analysis, base frequencies . I hardly see > how ? > For instance an histogram on my vector like hist(cc) don't work It looks like the call through seqnir::read.fasta returns you a character vector for the sequence? (I'm guessing, I haven't used it). If that's the case, one way to get frequencies would be via the table command, eg: R> fa <- c("t", "c", "c", "c", "t", "a", "g", "a", "a", "g") R> table(fa) fa <- c("t", "c", "c", "c", "t", "a", "g", "a", "a", "g") fa a c g t 3 3 2 2 Though, I'd probably prefer using Biostrings: > My first intention by the way was to use biostring package to import > fasta > file, like readFASTA(" directory",strip.desc=TRUE). But how sould I > know > under which directory I have to put data ? Because I ve tried few > directories but he still do not found my data How is it that you don't know where to find your data? I'm not sure there's anything we can do to help you find it, so ... just find it :-) Once you know where it is, you can pass the absolute path *of the file* to the readFASTA function. In your example above, it looks like you want to call "readFASTA" on a directory, which won't work. For instance, on my computer (I'm using OS X), in order to read in some file on my HD, I'd do: library(Biostrings) my.fasta <- readFASTA('/Users/stavros/Data/YeastPromoters.fa') Does that help? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 15.4 years ago Steve Lianoglou ★ 13k

Login before adding your answer.