Question

sorting FASTQ file by ID

0

Entering edit mode

Ramzi TEMANNI ▴ 160

@ramzi-temanni-3819

Last seen 11.4 years ago

Hi everyone, I have paired end data in fastq format where forward and reverse file have different number of reads and are not ordered(based on their id) . I write the following code to mate the reads but seems that *srsort* do not sort id. could anyone tell me what would be the function to use and if there any way to tune the code as the fastQ files to process are around 6gig ? I'm working on a 16 core / 16gig server. overlapingreads<-function(m1.filename,m2.filename) { fastq.m1 <- readFastq(m1.filename) # read forward fq file fastq.m2 <- readFastq(m2.filename) # read reverse fq file # HWI-EA332_0007_FC622U7:6:1:2761:1100#0/2 # extract tile and coordinates as key for matching forward and reverse reads id1=subseq(id(fastq.m1),26,nchar(id(fastq.m1))-4) id2=subseq(id(fastq.m2),26,nchar(id(fastq.m2))-4) cid=sort(intersect(id1,id2)) tmp1=srsort(fastq.m1[id1%in%cid]) tmp2=srsort(fastq.m2[id2%in%cid]) writeFastq(tmp1,paste("sorted_",m1.filename,sep="")) writeFastq(tmp2,paste("sorted_",m2.filename,sep="")) } Thanks in advance for your help and comments Regards, Ramzi > sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.8.2 Rsamtools_1.2.2 lattice_0.19-17 [4] Biostrings_2.18.2 GenomicRanges_1.2.2 IRanges_1.8.8 loaded via a namespace (and not attached): [1] Biobase_2.10.0 grid_2.12.1 hwriter_1.3 tools_2.12.1 > [[alternative HTML version deleted]]

PROcess PROcess • 2.5k views

ADD COMMENT • link 15.0 years ago Ramzi TEMANNI ▴ 160