problem importing a fasta file biostrings or seqinr ?
2
0
Entering edit mode
m a ▴ 10
@m-a-3789
Last seen 9.6 years ago
Hello, I would like to make simple statistics on a specific DNA sequence. In order to do that a need to import a sequence with a fasta extension. http://www.ncbi.nlm.nih.gov/nuccore/9626243?report=fasta&log$=seqview After download I run the folliwing code with the package seqinr : dnafile <- system.file("sequences/seqbac.fasta", package = "seqinr") cc<-read.fasta(file = dnafile) cc gives me then the following vector ... [47764] "t" "c" "c" "c" "t" ...... My problem is I would like now to use that vector to perform basic statistics eg; GC content analysis, base frequencies . I hardly see how ? For instance an histogram on my vector like hist(cc) don't work My first intention by the way was to use biostring package to import fasta file, like readFASTA(" directory",strip.desc=TRUE). But how sould I know under which directory I have to put data ? Because I ve tried few directories but he still do not found my data Thanks in advance, Moses Student in biostatistics [[alternative HTML version deleted]]
• 4.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 5 hours ago
Seattle, WA, United States
Hi Moses, Once you've figured out where your FASTA file is located, you can do: library(Biostrings) myseqs <- read.DNAStringSet("path/to/your/fasta_file.fa", "fasta") myseqs myseq <- myseqs[[1]] ## For base frequencies: alphabetFrequency(myseq) ## For GC content: dinucleotideFrequency(myseq) dinucleotideFrequency(myseq, as.prob=TRUE) Biostrings also has trinucleotideFrequency(), oligonucleotideFrequency(), and much more (see man pages for more info about those functions). Cheers, H. m a wrote: > Hello, > > I would like to make simple statistics on a specific DNA sequence. In order > to do that a need to import a sequence with a fasta extension. > > http://www.ncbi.nlm.nih.gov/nuccore/9626243?report=fasta&log$=seqview > > After download I run the folliwing code with the package seqinr : > > dnafile <- system.file("sequences/seqbac.fasta", package = "seqinr") > cc<-read.fasta(file = dnafile) > > cc gives me then the following vector > ... > > [47764] "t" "c" "c" "c" "t" > ...... > > My problem is I would like now to use that vector to perform basic > statistics eg; GC content analysis, base frequencies . I hardly see how ? > For instance an histogram on my vector like hist(cc) don't work > > > My first intention by the way was to use biostring package to import fasta > file, like readFASTA(" directory",strip.desc=TRUE). But how sould I know > under which directory I have to put data ? Because I ve tried few > directories but he still do not found my data > > > > Thanks in advance, > > > Moses > Student in biostatistics > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi Moses, On Nov 9, 2009, at 7:21 AM, m a wrote: > Hello, > > I would like to make simple statistics on a specific DNA sequence. > In order > to do that a need to import a sequence with a fasta extension. > > http://www.ncbi.nlm.nih.gov/nuccore/9626243?report=fasta&log$=seqview > > After download I run the folliwing code with the package seqinr : > > dnafile <- system.file("sequences/seqbac.fasta", package = "seqinr") > cc<-read.fasta(file = dnafile) > > cc gives me then the following vector > ... > > [47764] "t" "c" "c" "c" "t" > ...... > > My problem is I would like now to use that vector to perform basic > statistics eg; GC content analysis, base frequencies . I hardly see > how ? > For instance an histogram on my vector like hist(cc) don't work It looks like the call through seqnir::read.fasta returns you a character vector for the sequence? (I'm guessing, I haven't used it). If that's the case, one way to get frequencies would be via the table command, eg: R> fa <- c("t", "c", "c", "c", "t", "a", "g", "a", "a", "g") R> table(fa) fa <- c("t", "c", "c", "c", "t", "a", "g", "a", "a", "g") fa a c g t 3 3 2 2 Though, I'd probably prefer using Biostrings: > My first intention by the way was to use biostring package to import > fasta > file, like readFASTA(" directory",strip.desc=TRUE). But how sould I > know > under which directory I have to put data ? Because I ve tried few > directories but he still do not found my data How is it that you don't know where to find your data? I'm not sure there's anything we can do to help you find it, so ... just find it :-) Once you know where it is, you can pass the absolute path *of the file* to the readFASTA function. In your example above, it looks like you want to call "readFASTA" on a directory, which won't work. For instance, on my computer (I'm using OS X), in order to read in some file on my HD, I'd do: library(Biostrings) my.fasta <- readFASTA('/Users/stavros/Data/YeastPromoters.fa') Does that help? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT

Login before adding your answer.

Traffic: 627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6