read sequences from fasta file starting with > sign and untill next > sign
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hi: I am trying to read sequences from a fasta file starting with > till the next > sign: library(ShortRead) setwd("fastafolder"); con <- file("somefastafile.fa"); open(con) pattern <- as.character("TACC") while(length(res <- readLines(con, n=1))) { #do something } close(con) With this while statement I am able to read a single line from the fasta file each time. But I want to read a chunk of links each time from the fasta file starting with > sign and till the next > sign. Example >AAATTT TAGGCT ATTTGC >CGATTT And I want to read the following in the first run of while loop >AAATTT TAGGCT ATTTGC Thanks for your help. Regards: Jack -- output of sessionInfo(): > sessionInfo() R version 2.15.1 (2012-06-22) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] CHNOSZ_0.9-7 ShortRead_1.14.4 latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.8.6 lattice_0.20-10 Biostrings_2.24.1 [8] GenomicRanges_1.8.13 IRanges_1.14.4 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.1 hwriter_1.3 stats4_2.15.1 tools_2.15.1 zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org.
• 753 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
On 09/14/2012 06:11 AM, Jack [guest] wrote: > > Hi: > > I am trying to read sequences from a fasta file starting with > till the next > sign: > > library(ShortRead) > setwd("fastafolder"); > con <- file("somefastafile.fa"); > open(con) > pattern <- as.character("TACC") > while(length(res <- readLines(con, n=1))) > { > #do something > } > close(con) > > With this while statement I am able to read a single line from the fasta file each time. But I want to read a chunk of links each time from the fasta file starting with > sign and till the next > sign. > > Example >> AAATTT > TAGGCT > ATTTGC >> CGATTT > > And I want to read the following in the first run of while loop >> AAATTT > TAGGCT > ATTTGC > > Thanks for your help. Hi Jack -- in R it's not generally useful to operate one record at a time, so think about reading many records (a million?) in and then processing as a vector. You could use FaFile in Rsamtools fl <- system.file(package="BSgenome", "extdata", "ce2dict0.fa") ## create an index, if neceessary indexFa(fl) fa <- FaFile(fl) hdr <- scanFaIndex(fa) ## record-at-a-time open(fa) for (i in seq_along(hdr)) scanFa(fa, hdr[i]) close(fa) ## groups, e.g., 10 idx <- seq_along(hdr) grps <- split(idx, cut(idx, 10, labels=FALSE)) open(fa) for (grp in grps) scanFa(fa, hdr[grp]) close(fa) or role your own fl <- system.file(package="BSgenome", "extdata", "ce2dict0.fa") con <- file(fl) open(con) buf <- character() while (length(inpt <- readLines(con, 10))) { lastidx <- tail(grep("^>", inpt), 1) complete <- c(buf, inpt[seq(1, lastidx - 1L)]) buf <- inpt[seq(lastidx, length(inpt))] if (length(complete)) { ## 'complete' contains complete record(s) } } ## 'buf' might contain some final, complete, records close(con) Martin > > Regards: > Jack > > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] CHNOSZ_0.9-7 ShortRead_1.14.4 latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.8.6 lattice_0.20-10 Biostrings_2.24.1 > [8] GenomicRanges_1.8.13 IRanges_1.14.4 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.1 hwriter_1.3 stats4_2.15.1 tools_2.15.1 zlibbioc_1.2.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6