Deleting object rows while looping

Deleting object rows while looping - II

0

Entering edit mode

Daniel.Berner@unibas.ch ▴ 90

@danielbernerunibasch-4268

Last seen 3.2 years ago

Martin, Steve Thanks for your suggestions, much appreciated! My core problem is that the operations inside the loop are too complex for me to allow implementing any function() and apply() combination (there are multiple interleaved if() lines etc). and I was not aware of the data.table package, but this also looks hard to me, given the many steps performed within the loop. to be more specific, I want to reduce a number of Illumina short reads at genomic loci to diploid or haploid (depending on coverage) consensus genotypes, by evaluating several test criteria. Below I am copying the full code. the input is an alignment in BAM format. Any suggestion to make this faster is most welcome, as the code takes more than a week to run for each individual, and I have to process 96 of them (I have multicore, but still)! It seems to me that the rate-limiting step is the subsetting; the other manipulations are fast enough to work in such a primitive loop. cheers, Daniel Berner Zoological Institute University of Basel ######################### library(Rsamtools) rm(list=ls()) M<-"GAAGC" ### specify MID (individual) lib<-25 ### specify library # genotyping settings: foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2 het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2) # infile upload: #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="") param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute # extract the relevant data: chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character (f$rname[grep("-",f$strand,invert=F)])) posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$stran d,invert=F)]) seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,in vert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)])) chrpos<-paste(chr, posi, sep=' ') u.chrpos<-sort(unique(chrpos)) length(posi) #total number of alignments length(u.chrpos) #this gives the number of unique loci length(posi)/length(u.chrpos) #mean coverage per locus c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage # derive the data object to process: dat<-data.frame(cbind(chrpos, seqs)) # clean the environment and release memory: rm(f, chr, posi, seqs, chrpos) #no longer used gc() # process the dat object while writing to 'cons' # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq cons<-NULL for (i in 1:length(u.chrpos)){ stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2])) stack.le<-length(stack) if(stack.le<=c.thr){ #monomorphic: if (length(unique(stack))==1){ #well covered: if(stack.le>=sum.1.2){ gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ") gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ") cons<-rbind(cons, gtp.X, gtp.Y) } #poorly covered: if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ") cons<-rbind(cons, gtp.X) } } #polymorphic: if(length(unique(stack))>1){ cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] #no repeat issue: if((sum(cnts)/stack.le)>prop.1.2){ #heterozygote: if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ") gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ") cons<-rbind(cons, gtp.X, gtp.Y) } #homozygote (minor variant too rare, artifact): if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ") gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ") cons<-rbind(cons, gtp.X, gtp.Y) } #coverage poor, gtp unclear, haploid call: if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ") cons<-rbind(cons, gtp.X) } } } } } write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file ### END [[alternative HTML version deleted]]

Coverage Alignment PROcess Coverage Alignment PROcess • 1.5k views

ADD COMMENT • link updated 11.0 years ago by Martin Morgan 25k • written 11.0 years ago by Daniel.Berner@unibas.ch ▴ 90

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, This is, unfortunately, a bit too much code for me to digest right now, but a few points. (1) You are building an ever growing "something" that you are storing in cons: you start with `cons <- NULL` then are reconstructing it w/in every iteration w/in the loop, ie you have `cons <- c(cons, something) This is going to kill you -- can you set `cons` to be a "whatever" of the length you are trying to build? (it looks like it will be of length (or nrow) `length(u.chrpos)` (2) You create a `stack` object in each loop iteration that is some DNAStringSet of consisting of all the sequences at the chromosome -- but then you don't use it as a DNAStringSet ... in fact you revert it to a character (as.character(stack)) often -- don't make a `stack` DNAStringSet to begin with. (3) It seems you are outputting a character vector per row that you then have to parse -- why not just output a data.frame with the relevant columns that you will eventually use. (4) Here is a shell of a script you would use if you wanted to know how many reads and how many of them are unique start at each locus: library(data.table) ## ... your init stuff up to building `dat`, which is where data.table ## will take over dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) result <- dat[, { ## .N is simply the number of rows in this group -- read the data.table docs. ## the `seqs` column is injected into the scope evaluated here ## so I can manipulate the sequences within this chr/pos group directly list(n.reads=.N, n.unique=length(unique(seqs))) }, by=c('chr', 'pos')] Now look at `result`, you will have a data.table with as many rows as there are unique chr,pos pairs, and you will have counted the number of reads, and number of unique reads at each locus. (5) Hopefully the speed with which (4) completes will encourage you to take the time to read through the data.table documentation to apply it to your problem. It will take me more time than I have to read through your code ... giving a small subset of your data and the expected output would be a lot more helpful for people trying to help you. Fortunately, you will end up largely swapping in the core of your for loop within the data.table `j` expression, eg: result <- dat[, { ## a large core of your for loop will go here ## don't make a DNAStringSet in here because you are not ## using it anyway! }, by=c('chr', 'pos')] Anyway, good luck with it. -steve On Mon, Apr 29, 2013 at 11:52 PM, Daniel Berner <daniel.berner at="" unibas.ch=""> wrote: > Martin, Steve > > Thanks for your suggestions, much appreciated! My core problem is that the > operations inside the loop are too complex for me to allow implementing any > function() and apply() combination (there are multiple interleaved if() lines > etc). and I was not aware of the data.table package, but this also looks > hard to me, given the many steps performed within the loop. > > to be more specific, I want to reduce a number of Illumina short reads at > genomic loci to diploid or haploid (depending on coverage) consensus > genotypes, by evaluating several test criteria. Below I am copying the full > code. the input is an alignment in BAM format. > > Any suggestion to make this faster is most welcome, as the code takes more > than a week to run for each individual, and I have to process 96 of them (I > have multicore, but still)! It seems to me that the rate-limiting step is the > subsetting; the other manipulations are fast enough to work in such a > primitive loop. > > cheers, > Daniel Berner > Zoological Institute > University of Basel > > ######################### > library(Rsamtools) > rm(list=ls()) > M<-"GAAGC" ### specify MID (individual) > lib<-25 ### specify library > > # genotyping settings: > foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats > prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats > sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2 > het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het > min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2) > > # infile upload: > #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs > folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs > infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path > outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path > #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="") > param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) > f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute > > # extract the relevant data: > chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.charact er(f$rname[grep("-",f$strand,invert=F)])) > posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$str and,invert=F)]) > seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand, invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)])) > chrpos<-paste(chr, posi, sep=' ') > u.chrpos<-sort(unique(chrpos)) > length(posi) #total number of alignments > length(u.chrpos) #this gives the number of unique loci > length(posi)/length(u.chrpos) #mean coverage per locus > c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage > > # derive the data object to process: > dat<-data.frame(cbind(chrpos, seqs)) > > # clean the environment and release memory: > rm(f, chr, posi, seqs, chrpos) #no longer used > gc() > > # process the dat object while writing to 'cons' > # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq > cons<-NULL > for (i in 1:length(u.chrpos)){ > stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2])) > stack.le<-length(stack) > if(stack.le<=c.thr){ > #monomorphic: > if (length(unique(stack))==1){ > #well covered: > if(stack.le>=sum.1.2){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ") > gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ") > cons<-rbind(cons, gtp.X, gtp.Y) > } > #poorly covered: > if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ") > cons<-rbind(cons, gtp.X) > } > } > #polymorphic: > if(length(unique(stack))>1){ > cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] > #no repeat issue: > if((sum(cnts)/stack.le)>prop.1.2){ > #heterozygote: > if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ") > gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ") > cons<-rbind(cons, gtp.X, gtp.Y) > } > #homozygote (minor variant too rare, artifact): > if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote > gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ") > gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ") > cons<-rbind(cons, gtp.X, gtp.Y) > } > #coverage poor, gtp unclear, haploid call: > if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ") > cons<-rbind(cons, gtp.X) > } > } > } > } > } > write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file > ### END > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD COMMENT • link 11.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi there, Yep, lot of good advice from Steve. I'll only add a few things: (1) Yes, as Steve said, building an ever growing 'cons' is not good. And since you don't seem to make any use of that 'cons' object in the end anyway, except sending it to a file with write.table(), why not send the little pieces to the file in each iteration by calling write.table() with append=TRUE, instead of sticking them to the ever growing 'cons' object? (2) Why not load your alignments in a GappedAlignments (renamed GAlignments) object instead of a data.frame? Then all the code you use before entering the big loop is simpler: library(Rsamtools) M <- "GAAGC" ### specify MID (individual) lib <- 25 ### specify library # genotyping settings: foldcov <- 3 prop.1.2 <- 0.7 sum.1.2 <- 15 het <- 1/3 min <- 2 # infile upload: folder <- 'C:\\Documents and Settings\\daniel\\Desktop\\temp' infile <- paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") outfile <- paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") param <- ScanBamParam(what="seq", tag="Z3", reverseComplement=TRUE) f <- readGappedAlignments(infile, param=param) Then, if you try to display the first 4 alignments, you'll see something like: > f[1:4] GappedAlignments with 4 alignments and 2 metadata columns: seqnames strand cigar qwidth start <rle> <rle> <character> <integer> <integer> [1] seq1 + 36M 36 1 [2] seq1 + 35M 35 3 [3] seq1 + 35M 35 5 [4] seq1 + 36M 36 6 end width ngap | <integer> <integer> <integer> | [1] 36 36 0 | [2] 37 35 0 | [3] 39 35 0 | [4] 41 36 0 | seq Z3 <dnastringset> <logical> [1] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG <na> [2] CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT <na> [3] AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC <na> [4] GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT <na> --- seqlengths: seq1 seq2 1575 1584 Looks good huh? ;-) # extract the relevant data: mcols(f)$seq <- narrow(mcols(f)$seq, start=6, end=94) Just to make sure: the above call to narrow() is trimming the first 5 bases so if you want to trim the last 5 bases and if your sequence is a 99-mer, you are good. But if it's a 100-mer, you would need to use end=95. chrpos <- paste(seqnames(f), start(f), sep=' ') u.chrpos <- sort(unique(chrpos)) length(f) #total number of alignments length(u.chrpos) #this gives the number of unique loci length(f)/length(u.chrpos) #mean coverage per locus Note that I would not call this the "mean coverage per locus". A given locus receives more coverage than just from the alignments that start exactly at that locus but also coverage from alignments that start before that locus and end at or after that locus. A more accurate way to get the "mean coverage per locus" would be to do something like: f_cvg <- coverage(f) cvg_at_loci <- lapply(names(f_cvg), function(seqname) { cvg <- f_cvg[[seqname]] cvg[start(f)[as.character(seqnames(f)) == seqname]]}) mean(do.call(c, cvg_at_loci)) Note that will give you a much higher "mean coverage per locus"! (3) The big loop: c.thr <- as.integer(length(f)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage # big loop: for (i in seq_along(u.chrpos)) { idx <- which(chrpos == u.chrpos[i]) stack <- mcols(f)$seq[idx] stack.le <- length(stack) if (stack.le <= c.thr) { ... } } OK so your loop is not doing anything at loci for which stack.le > c.thr, and also, by looking deeper in the code (not shown above) it seems that it won't do anything either at loci for which stack.le == 1 i.e. if there is only 1 read that starts at the loci. So maybe getting rid upfront of those loci where you're not doing anything would reduce substantially the nb of loci that you actually need to examine? This can easily be done with: # big loop: tmp <- Rle(sort(chrpos)) my.u.chrpos <- runValue(tmp)[runLength(tmp) <= c.thr & runLength(tmp) >= 2] for (i in seq_along(my.u.chrpos)) { idx <- which(chrpos == my.u.chrpos[i]) stack <- mcols(f)$seq[idx] ... } (4) Don't compute twice the same thing, especially if it's (potentially) an expensive operation: Not good: if (length(unique(stack))==1){ ... } if(length(unique(stack))>1){ ... } Better: u.stack <- unique(stack) if (length(u.stack) == 1) { ... } if (length(u.stack) > 1) { ... } (5) Put all this code in a function. Even if you only use it in 1 place. It's always a good exercise, and it almost always lead to better code. (6) Direct comparison of the query sequences of the reads that are aligned to the same locus is not always meaningful. For example it's not meaningful if the alignments are not all on the same strand, unless you use reverseComplement=FALSE (which is the default, maybe for a good reason). So you should not use reverseComplement=TRUE. Even if you use reverseComplement=FALSE, comparing the query sequences is still not meaningful if the alignments don't have the same cigar. But maybe you know that all your alignments have a simple cigar, e.g. "99M" or something like that? If not, then you would need to check that. Hope this helps, Cheers, H. On 04/30/2013 09:52 AM, Steve Lianoglou wrote: > Hi, > > This is, unfortunately, a bit too much code for me to digest right > now, but a few points. > > (1) You are building an ever growing "something" that you are storing > in cons: you start with `cons <- NULL` then are reconstructing it w/in > every iteration w/in the loop, ie you have `cons <- c(cons, something) > > This is going to kill you -- can you set `cons` to be a "whatever" of > the length you are trying to build? (it looks like it will be of > length (or nrow) `length(u.chrpos)` > > (2) You create a `stack` object in each loop iteration that is some > DNAStringSet of consisting of all the sequences at the chromosome -- > but then you don't use it as a DNAStringSet ... in fact you revert it > to a character (as.character(stack)) often -- don't make a `stack` > DNAStringSet to begin with. > > (3) It seems you are outputting a character vector per row that you > then have to parse -- why not just output a data.frame with the > relevant columns that you will eventually use. > > (4) Here is a shell of a script you would use if you wanted to know > how many reads and how many of them are unique start at each locus: > > library(data.table) > ## ... your init stuff up to building `dat`, which is where data.table > ## will take over > dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) > result <- dat[, { > ## .N is simply the number of rows in this group -- read the data.table docs. > ## the `seqs` column is injected into the scope evaluated here > ## so I can manipulate the sequences within this chr/pos group directly > list(n.reads=.N, n.unique=length(unique(seqs))) > }, by=c('chr', 'pos')] > > Now look at `result`, you will have a data.table with as many rows as > there are unique chr,pos pairs, and you will have counted the number > of reads, and number of unique reads at each locus. > > (5) Hopefully the speed with which (4) completes will encourage you to > take the time to read through the data.table documentation to apply it > to your problem. It will take me more time than I have to read through > your code ... giving a small subset of your data and the expected > output would be a lot more helpful for people trying to help you. > Fortunately, you will end up largely swapping in the core of your for > loop within the data.table `j` expression, eg: > > result <- dat[, { > ## a large core of your for loop will go here > ## don't make a DNAStringSet in here because you are not > ## using it anyway! > }, by=c('chr', 'pos')] > > Anyway, good luck with it. > > -steve > > > On Mon, Apr 29, 2013 at 11:52 PM, Daniel Berner <daniel.berner at="" unibas.ch=""> wrote: >> Martin, Steve >> >> Thanks for your suggestions, much appreciated! My core problem is that the >> operations inside the loop are too complex for me to allow implementing any >> function() and apply() combination (there are multiple interleaved if() lines >> etc). and I was not aware of the data.table package, but this also looks >> hard to me, given the many steps performed within the loop. >> >> to be more specific, I want to reduce a number of Illumina short reads at >> genomic loci to diploid or haploid (depending on coverage) consensus >> genotypes, by evaluating several test criteria. Below I am copying the full >> code. the input is an alignment in BAM format. >> >> Any suggestion to make this faster is most welcome, as the code takes more >> than a week to run for each individual, and I have to process 96 of them (I >> have multicore, but still)! It seems to me that the rate-limiting step is the >> subsetting; the other manipulations are fast enough to work in such a >> primitive loop. >> >> cheers, >> Daniel Berner >> Zoological Institute >> University of Basel >> >> ######################### >> library(Rsamtools) >> rm(list=ls()) >> M<-"GAAGC" ### specify MID (individual) >> lib<-25 ### specify library >> >> # genotyping settings: >> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats >> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats >> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2 >> het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het >> min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2) >> >> # infile upload: >> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs >> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs >> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path >> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path >> #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="") >> param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) >> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute >> >> # extract the relevant data: >> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.charac ter(f$rname[grep("-",f$strand,invert=F)])) >> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$st rand,invert=F)]) >> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand ,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]) ) >> chrpos<-paste(chr, posi, sep=' ') >> u.chrpos<-sort(unique(chrpos)) >> length(posi) #total number of alignments >> length(u.chrpos) #this gives the number of unique loci >> length(posi)/length(u.chrpos) #mean coverage per locus >> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage >> >> # derive the data object to process: >> dat<-data.frame(cbind(chrpos, seqs)) >> >> # clean the environment and release memory: >> rm(f, chr, posi, seqs, chrpos) #no longer used >> gc() >> >> # process the dat object while writing to 'cons' >> # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq >> cons<-NULL >> for (i in 1:length(u.chrpos)){ >> stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2])) >> stack.le<-length(stack) >> if(stack.le<=c.thr){ >> #monomorphic: >> if (length(unique(stack))==1){ >> #well covered: >> if(stack.le>=sum.1.2){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #poorly covered: >> if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ") >> cons<-rbind(cons, gtp.X) >> } >> } >> #polymorphic: >> if(length(unique(stack))>1){ >> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] >> #no repeat issue: >> if((sum(cnts)/stack.le)>prop.1.2){ >> #heterozygote: >> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #homozygote (minor variant too rare, artifact): >> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #coverage poor, gtp unclear, haploid call: >> if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ") >> cons<-rbind(cons, gtp.X) >> } >> } >> } >> } >> } >> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file >> ### END >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.0 years ago Hervé Pagès 16k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

On 04/29/2013 11:52 PM, Daniel Berner wrote: > Martin, Steve > > Thanks for your suggestions, much appreciated! My core problem is that the > operations inside the loop are too complex for me to allow implementing any > function() and apply() combination (there are multiple interleaved if() lines > etc). and I was not aware of the data.table package, but this also looks > hard to me, given the many steps performed within the loop. > > to be more specific, I want to reduce a number of Illumina short reads at > genomic loci to diploid or haploid (depending on coverage) consensus > genotypes, by evaluating several test criteria. Below I am copying the full > code. the input is an alignment in BAM format. > > Any suggestion to make this faster is most welcome, as the code takes more > than a week to run for each individual, and I have to process 96 of them (I > have multicore, but still)! It seems to me that the rate-limiting step is the > subsetting; the other manipulations are fast enough to work in such a > primitive loop. > > cheers, > Daniel Berner > Zoological Institute > University of Basel > > ######################### > library(Rsamtools) > rm(list=ls()) > M<-"GAAGC" ### specify MID (individual) > lib<-25 ### specify library > > # genotyping settings: > foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats > prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats > sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2 > het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het > min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2) > > # infile upload: > #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs > folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs > infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path > outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path > #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="") > param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) > f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute > > # extract the relevant data: > chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.charact er(f$rname[grep("-",f$strand,invert=F)])) > posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$str and,invert=F)]) I was wondering if that 'f$tag' was a typo, it seems to break the symmetry of these commands...? > seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand, invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)])) > chrpos<-paste(chr, posi, sep=' ') > u.chrpos<-sort(unique(chrpos)) > length(posi) #total number of alignments > length(u.chrpos) #this gives the number of unique loci > length(posi)/length(u.chrpos) #mean coverage per locus > c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage > > # derive the data object to process: > dat<-data.frame(cbind(chrpos, seqs)) probably want data.frame(<...>, stringsAsFactors=FALSE) > > # clean the environment and release memory: > rm(f, chr, posi, seqs, chrpos) #no longer used > gc() > > # process the dat object while writing to 'cons' > # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq > cons<-NULL A variant of Herve's suggestion to use append=TRUE (his idea is better!) is con <- file(outfile, "w") which opens the file, and then instead of accumulating the results in memory you could, inside your loop, write.table(<intermediate result="">, con) and then when done remember to close(con) you'll need to make sure not to write headers in the write.table() statement. But I think this is mostly not relevant in your revised work flow... > for (i in 1:length(u.chrpos)){ My suggestion is a little complicated, but you're really interested in turning this loop into a vector operation. I think you can do this by sorting your data frame dat <- dat[order(dat$chrpos, dat$seq),] It's helpful to have a 'view' onto dat, where the view is a data.frame with indices marking the beginning and end of each 'chrpos' group. I created some helper functions ends <- function(x) c([-length(x)] != x[-1], TRUE) starts <- function(x) c(TRUE, ends(x)[-length(x)]) which creates a logical vector of the same length as x, with a TRUE whenever a run of identical chrpos start or end (something similar could be done with the rle() or Rle() functions). So my view is then view <- data.frame(start = which(starts(dat$chrpos)), end = which(ends(dat$chrpos))) > stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2])) > stack.le<-length(stack) > if(stack.le<=c.thr){ for this if statement we need to know how many distinct sequences are in each view. I did this by numbering the sequences sequentially seqid <- cumsum(starts(dat$seqs)) and then finding the difference between the id of the last and first sequence view$n_seq <- seqid[view$end] - seqid[view$start] + 1L > #monomorphic: > if (length(unique(stack))==1){ The 'view' onto the data representing the monomorphic locations is mono <- view[view$n_seq == 1, , drop=FALSE] > #well covered: > if(stack.le>=sum.1.2){ and the 'well-covered' locations are idx <- (mono$end - mono$start + 1L) >= sum.1.2 ok <- mono[idx, , drop=FALSE] > gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ") and you're after the information gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N", dat[ok$start, "seqs"]) (although I think you're really just adding columns and entries to 'view'...). To emphasize, this processes _all_ the monomorphic, well-covered sites, without any looping. And so on through the rest of your code. This type of operation is well-suited to data.table, though I'm not sure enough of the syntax and implementation to know whether Steve's dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) result <- dat[, { list(n.reads=.N, n.unique=length(unique(seqs))) }, by=c('chr', 'pos')] is implemented efficiently -- I'm sure the .N is; just not whether clever thinking is used behind the scenes to avoid looping through function(x) length(unique(x)). The syntax is certainly clearer than my 'view' approach. Martin > gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ") > cons<-rbind(cons, gtp.X, gtp.Y) > } > #poorly covered: > if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ") > cons<-rbind(cons, gtp.X) > } > } > #polymorphic: > if(length(unique(stack))>1){ > cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] > #no repeat issue: > if((sum(cnts)/stack.le)>prop.1.2){ > #heterozygote: > if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ") > gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ") > cons<-rbind(cons, gtp.X, gtp.Y) > } > #homozygote (minor variant too rare, artifact): > if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote > gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ") > gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ") > cons<-rbind(cons, gtp.X, gtp.Y) > } > #coverage poor, gtp unclear, haploid call: > if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ > gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ") > cons<-rbind(cons, gtp.X) > } > } > } > } > } > write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file > ### END > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.0 years ago Martin Morgan 25k

Login before adding your answer.