Entering edit mode
Ramzi TEMANNI
▴
160
@ramzi-temanni-3819
Last seen 10.2 years ago
Hi everyone,
I have paired end data in fastq format where forward and reverse file
have
different number of reads and are not ordered(based on their id) .
I write the following code to mate the reads but seems that *srsort*
do not
sort id.
could anyone tell me what would be the function to use and if there
any way
to tune the code as the fastQ files to process are around 6gig ? I'm
working
on a 16 core / 16gig server.
overlapingreads<-function(m1.filename,m2.filename)
{
fastq.m1 <- readFastq(m1.filename) # read forward fq file
fastq.m2 <- readFastq(m2.filename) # read reverse fq file
# HWI-EA332_0007_FC622U7:6:1:2761:1100#0/2
# extract tile and coordinates as key for matching forward and reverse
reads
id1=subseq(id(fastq.m1),26,nchar(id(fastq.m1))-4)
id2=subseq(id(fastq.m2),26,nchar(id(fastq.m2))-4)
cid=sort(intersect(id1,id2))
tmp1=srsort(fastq.m1[id1%in%cid])
tmp2=srsort(fastq.m2[id2%in%cid])
writeFastq(tmp1,paste("sorted_",m1.filename,sep=""))
writeFastq(tmp2,paste("sorted_",m2.filename,sep=""))
}
Thanks in advance for your help and comments
Regards,
Ramzi
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.8.2 Rsamtools_1.2.2 lattice_0.19-17
[4] Biostrings_2.18.2 GenomicRanges_1.2.2 IRanges_1.8.8
loaded via a namespace (and not attached):
[1] Biobase_2.10.0 grid_2.12.1 hwriter_1.3 tools_2.12.1
>
[[alternative HTML version deleted]]