Question: Finding mismatches between reads and reference sequence
0
6.6 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:
Hi David, Sorry for the delay in answering this. Was too busy trying to get a few things done before the new BioC release. On 03/26/2013 05:50 AM, David Greber wrote: > Hello, > > How can I find the mismatching positions between a read (e.g. in a > "GappedAlignments" object) and the reference sequence (a "BSgenome" > object)? In general, I am looking for an operation that maps the read > sequence against the reference genome (taking cigar operation into account) > and compares the DNAString objects. > > I tried this, but due to the different cigar string operations, this seems > to become difficult for complex alignments. The "Rsamtools" package offers > with "cigarToIRangesListByAlignment", but does not take soft clips into > account. > > Is there some functionality in bioconductor for this? I assume that it is a > common task, but could not find anything like it. I don't think there is an easy way at the moment to find the mismatches between the reads and reference sequences. Below is some code that solves this problem granted that (1) you have a BSgenome object containing the reference sequences, and (2) there are no insertions in your alignments (i.e. no I's in the cigars). Like in the following example: library(pasillaBamSubset) library(Rsamtools) ## Query sequences are in the SEQ field in the BAM file. param <- ScanBamParam(what="seq", tag=c("MD", "NM")) reads <- readGappedAlignments(untreated1_chr4(), param=param) Then: > colSums(cigarOpTable(cigar(reads))) M I D N S H P 15326625 0 0 21682582 0 0 0 No insertions (I), no deletions (D). What's important for the following code to work is that you have no insertions. Having deletions, or soft clipping (S) or hard clipping (H) is OK. It would probably not be too hard to adapt the code below to actually support insertions, but that would make it a little bit more complicated. We'll proceed in 3 steps: (a) Extract the query sequences, clip them, and flip them if needed. (b) Extract the reference sequences from the BSgenome object. (c) Compare query sequences to reference sequences. (a) Extract the query sequences, clip them, and flip them if needed. qseqs <- mcols(reads)seq Clip them. Only soft-clipping needs to be performed. The query sequences stored in BAM file are already hard-clipped: softClip <- function(qseq, cigar) { if (length(qseq) != length(cigar)) stop("'qseq' and 'cigar' must have the same length") split_cigar <- splitCigar(cigar) no_clipping_idx <- unlist(unname(lapply(split_cigar, function(x) { x1 <- x[[1L]] c(rawToChar(x1[1L]), rawToChar(x1[length(x1)])) != "S" }))) clipping <- unlist(unname(lapply(split_cigar, function(x) { x2 <- x[[2L]] x2[c(1L, length(x2))] }))) clipping[no_clipping_idx] <- 0L left_clipping <- clipping[c(TRUE, FALSE)] right_clipping <- clipping[c(FALSE, TRUE)] narrow(qseq, start=left_clipping+1L, end=-right_clipping- 1L) } qseqs <- softClip(qseqs, cigar(reads)) Because the BAM format imposes that the read sequence stored in the SEQ field is reverse complemented when the read is aligned to the minus strand, we reverse complement it again to get it in the original orientation: is_on_minus <- as.logical(strand(reads) == "-") qseqs[is_on_minus] <- reverseComplement(qseqs[is_on_minus]) (b) Extract corresponding reference sequences from BSgenome object. library(BSgenome.Dmelanogaster.UCSC.dm3) A trick here is to extract the ranges of the reads in a GRangesList object (each read will typically produce 1 range or more), and to treat that GRangesList object as if it was representing exons grouped by transcript: grl <- grglist(reads, order.as.in.query=TRUE, drop.D.ranges=TRUE) library(GenomicFeatures) ## Warning can be ignored. rseqs <- extractTranscriptsFromGenome(Dmelanogaster, grl) (c) Compare 'qseqs' and 'rseqs'. At this point 'qseqs' and 'rseqs' should have the same shape (i.e. same length and widths). Furthermore, if there was no mismatch in any of the alignments, 'qseqs' and 'rseqs' would contain exactly the same sequences. The mismatches can be found with: mismatches <- function(x, y) { if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet")) stop("'x' and 'y' must be DNAStringSet objects") x_width <- width(x) y_width <- width(y) if (!identical(x_width, y_width)) stop("'x' and 'y' must have the same shape ", "(i.e. same length and widths)") unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y))) breakpoints <- cumsum(x_width) ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L, breakpoints) + 1L, nbins=length(x)) skeleton <- PartitioningByEnd(cumsum(ans_eltlens)) ans <- relist(unlisted_ans, skeleton) offsets <- c(0L, breakpoints[-length(breakpoints)]) ans <- ans - offsets ans } mm <- mismatches(qseqs, rseqs) The result is an IntegerList: > mm IntegerList of length 204355 [[1]] 27 [[2]] 9 13 54 [[3]] 13 45 [[4]] 48 [[5]] 37 57 [[6]] 56 [[7]] integer(0) [[8]] integer(0) [[9]] 35 73 [[10]] 23 41 48 ... <204345 more elements> Nb of mismatches per alignment: elementLengths(mm) In the case of pasillaBamSubset, since there are not indels in the alignments, this should be identical to the edit distance reported in the NM tag of the BAM file: stopifnot(identical(elementLengths(mm), mcols(reads)NM)) Unfortunately, using 'mm' to subset 'qseqs' or 'rseqs' is not supported at the moment (we should add it). Here is a temporary workaround: subsetByIntegerList <- function(x, i) { breakpoints <- cumsum(width(x)) offsets <- c(0L, breakpoints[-length(breakpoints)]) i <- i + offsets unlisted_ans <- unlist(x)[unlist(i)] as(successiveViews(unlisted_ans, elementLengths(i)), class(x)) } Subsetting 'qseqs' and 'rseqs' produces 2 DNAStringSet objects with the shape of 'mm': > subsetByIntegerList(qseqs, mm) A DNAStringSet instance of length 204355 width seq [1] 1 G [2] 3 GAG [3] 2 AG [4] 1 A [5] 2 CA ... ... ... [204351] 1 A [204352] 1 A [204353] 2 AG [204354] 0 [204355] 0 > subsetByIntegerList(rseqs, mm) A DNAStringSet instance of length 204355 width seq [1] 1 A [2] 3 TGA [3] 2 GC [4] 1 G [5] 2 AC ... ... ... [204351] 1 G [204352] 1 G [204353] 2 GT [204354] 0 [204355] 0 Lightly tested only. Let me know if you run into problems. Lot of the above stuff will be added soon in one form or another to the basic infrastructure. Maybe a mismatches(x, y) generic with methods for x=GappedAlignments,y=BSgenome ; x=DNAStringSet,y=DNAStringSet ; and other useful combinations... Cheers, H. > > Cheers > David > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
infrastructure cancer bsgenome • 2.0k views
modified 6.6 years ago by David Greber30 • written 6.6 years ago by Hervé Pagès ♦♦ 14k
0
6.6 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:
On Tue, Mar 26, 2013 at 8:50 AM, David Greber <dav.greber at="" gmail.com=""> wrote: > Hello, > > How can I find the mismatching positions between a read (e.g. in a > "GappedAlignments" object) and the reference sequence (a "BSgenome" > object)? In general, I am looking for an operation that maps the read > sequence against the reference genome (taking cigar operation into account) > and compares the DNAString objects. > > I tried this, but due to the different cigar string operations, this seems > to become difficult for complex alignments. The "Rsamtools" package offers > with "cigarToIRangesListByAlignment", but does not take soft clips into > account. > > Is there some functionality in bioconductor for this? I assume that it is a > common task, but could not find anything like it. Hi, David. This doesn't answer your question directly, but you may want to look at pileup functionality in Rsamtools. In particular, check out the applyPileups() function. Sean > Cheers > David > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Sean, Thanks for the hint. In my case, the pileup functions do not help much, as I'm interested in finding the mismatches per read/alignment, whereas the pileup functions only return the mismatches by positions. Best David 2013/4/5 Sean Davis <sdavis2@mail.nih.gov> > On Tue, Mar 26, 2013 at 8:50 AM, David Greber <dav.greber@gmail.com> > wrote: > > Hello, > > > > How can I find the mismatching positions between a read (e.g. in a > > "GappedAlignments" object) and the reference sequence (a "BSgenome" > > object)? In general, I am looking for an operation that maps the read > > sequence against the reference genome (taking cigar operation into > account) > > and compares the DNAString objects. > > > > I tried this, but due to the different cigar string operations, this > seems > > to become difficult for complex alignments. The "Rsamtools" package > offers > > with "cigarToIRangesListByAlignment", but does not take soft clips into > > account. > > > > Is there some functionality in bioconductor for this? I assume that it > is a > > common task, but could not find anything like it. > > Hi, David. > > This doesn't answer your question directly, but you may want to look > at pileup functionality in Rsamtools. In particular, check out the > applyPileups() function. > > Sean > > > > Cheers > > David > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
On Fri, Apr 5, 2013 at 5:43 AM, David Greber <dav.greber at="" gmail.com=""> wrote: > Hi Sean, > > Thanks for the hint. In my case, the pileup functions do not help much, as > I'm interested in finding the mismatches per read/alignment, whereas the > pileup functions only return the mismatches by positions. In that case, you might want to use the MD tag rather than the CIGAR string. If your aligner does not generate an MD tag, you could use samtools calmd to generate one and then read the MD tag with the reads. From the SAM format specification: "The MD eld aims to achieve SNP/indel calling without looking at the reference. For example, a string `10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is di erent from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD eld ought to match the CIGAR string" Sean > Best > David > > > 2013/4/5 Sean Davis <sdavis2 at="" mail.nih.gov=""> > >> On Tue, Mar 26, 2013 at 8:50 AM, David Greber <dav.greber at="" gmail.com=""> >> wrote: >> > Hello, >> > >> > How can I find the mismatching positions between a read (e.g. in a >> > "GappedAlignments" object) and the reference sequence (a "BSgenome" >> > object)? In general, I am looking for an operation that maps the read >> > sequence against the reference genome (taking cigar operation into >> account) >> > and compares the DNAString objects. >> > >> > I tried this, but due to the different cigar string operations, this >> seems >> > to become difficult for complex alignments. The "Rsamtools" package >> offers >> > with "cigarToIRangesListByAlignment", but does not take soft clips into >> > account. >> > >> > Is there some functionality in bioconductor for this? I assume that it >> is a >> > common task, but could not find anything like it. >> >> Hi, David. >> >> This doesn't answer your question directly, but you may want to look >> at pileup functionality in Rsamtools. In particular, check out the >> applyPileups() function. >> >> Sean >> >> >> > Cheers >> > David >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
6.6 years ago by
David Greber30
David Greber30 wrote:
Hi Herve, Thank you for the nice and comprehensive example. I will check it out and will try to incorporate also insertions. Best David 2013/4/5 Hervé Pagès <hpages@fhcrc.org> > Hi David, > > Sorry for the delay in answering this. Was too busy trying to get > a few things done before the new BioC release. > > > On 03/26/2013 05:50 AM, David Greber wrote: > >> Hello, >> >> How can I find the mismatching positions between a read (e.g. in a >> "GappedAlignments" object) and the reference sequence (a "BSgenome" >> object)? In general, I am looking for an operation that maps the read >> sequence against the reference genome (taking cigar operation into >> account) >> and compares the DNAString objects. >> >> I tried this, but due to the different cigar string operations, this seems >> to become difficult for complex alignments. The "Rsamtools" package offers >> with "**cigarToIRangesListByAlignment"**, but does not take soft clips >> into >> account. >> >> Is there some functionality in bioconductor for this? I assume that it is >> a >> common task, but could not find anything like it. >> > > I don't think there is an easy way at the moment to find the mismatches > between the reads and reference sequences. > > Below is some code that solves this problem granted that (1) you have > a BSgenome object containing the reference sequences, and (2) there > are no insertions in your alignments (i.e. no I's in the cigars). > Like in the following example: > > library(pasillaBamSubset) > library(Rsamtools) > > ## Query sequences are in the SEQ field in the BAM file. > param <- ScanBamParam(what="seq", tag=c("MD", "NM")) > reads <- readGappedAlignments(**untreated1_chr4(), param=param) > > Then: > > > colSums(cigarOpTable(cigar(**reads))) > M I D N S H P > 15326625 0 0 21682582 0 0 0 > > No insertions (I), no deletions (D). What's important for the following > code to work is that you have no insertions. Having deletions, or soft > clipping (S) or hard clipping (H) is OK. > > It would probably not be too hard to adapt the code below to actually > support insertions, but that would make it a little bit more complicated. > > We'll proceed in 3 steps: > (a) Extract the query sequences, clip them, and flip them if needed. > (b) Extract the reference sequences from the BSgenome object. > (c) Compare query sequences to reference sequences. > > > (a) Extract the query sequences, clip them, and flip them if needed. > > qseqs <- mcols(reads)seq > > Clip them. Only soft-clipping needs to be performed. The > query sequences stored in BAM file are already hard-clipped: > > softClip <- function(qseq, cigar) > { > if (length(qseq) != length(cigar)) > stop("'qseq' and 'cigar' must have the same length") > split_cigar <- splitCigar(cigar) > no_clipping_idx <- unlist(unname(lapply(split_**cigar, > function(x) { > x1 <- x[[1L]] > c(rawToChar(x1[1L]), > rawToChar(x1[length(x1)])) != "S" > }))) > clipping <- unlist(unname(lapply(split_**cigar, > function(x) { > x2 <- x[[2L]] > x2[c(1L, length(x2))] > }))) > clipping[no_clipping_idx] <- 0L > left_clipping <- clipping[c(TRUE, FALSE)] > right_clipping <- clipping[c(FALSE, TRUE)] > narrow(qseq, start=left_clipping+1L, end=-right_clipping- 1L) > } > > qseqs <- softClip(qseqs, cigar(reads)) > > Because the BAM format imposes that the read sequence stored in > the SEQ field is reverse complemented when the read is aligned > to the minus strand, we reverse complement it again to get it in > the original orientation: > > is_on_minus <- as.logical(strand(reads) == "-") > qseqs[is_on_minus] <- reverseComplement(qseqs[is_on_**minus]) > > (b) Extract corresponding reference sequences from BSgenome object. > > library(BSgenome.**Dmelanogaster.UCSC.dm3) > > A trick here is to extract the ranges of the reads in a GRangesList > object (each read will typically produce 1 range or more), and to > treat that GRangesList object as if it was representing exons > grouped by transcript: > > grl <- grglist(reads, order.as.in.query=TRUE, drop.D.ranges=TRUE) > > library(GenomicFeatures) > ## Warning can be ignored. > rseqs <- extractTranscriptsFromGenome(**Dmelanogaster, grl) > > (c) Compare 'qseqs' and 'rseqs'. > > At this point 'qseqs' and 'rseqs' should have the same shape (i.e. > same length and widths). Furthermore, if there was no mismatch > in any of the alignments, 'qseqs' and 'rseqs' would contain > exactly the same sequences. The mismatches can be found with: > > mismatches <- function(x, y) > { > if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet")) > stop("'x' and 'y' must be DNAStringSet objects") > x_width <- width(x) > y_width <- width(y) > if (!identical(x_width, y_width)) > stop("'x' and 'y' must have the same shape ", > "(i.e. same length and widths)") > unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y))) > breakpoints <- cumsum(x_width) > ans_eltlens <- tabulate(findInterval(**unlisted_ans - 1L, > breakpoints) + 1L, > nbins=length(x)) > skeleton <- PartitioningByEnd(cumsum(ans_**eltlens)) > ans <- relist(unlisted_ans, skeleton) > offsets <- c(0L, breakpoints[-length(**breakpoints)]) > ans <- ans - offsets > ans > } > > mm <- mismatches(qseqs, rseqs) > > The result is an IntegerList: > > > mm > IntegerList of length 204355 > [[1]] 27 > [[2]] 9 13 54 > [[3]] 13 45 > [[4]] 48 > [[5]] 37 57 > [[6]] 56 > [[7]] integer(0) > [[8]] integer(0) > [[9]] 35 73 > [[10]] 23 41 48 > ... > <204345 more elements> > > Nb of mismatches per alignment: > > elementLengths(mm) > > In the case of pasillaBamSubset, since there are not indels in the > alignments, this should be identical to the edit distance reported > in the NM tag of the BAM file: > > stopifnot(identical(**elementLengths(mm), mcols(reads)NM)) > > Unfortunately, using 'mm' to subset 'qseqs' or 'rseqs' is not > supported at the moment (we should add it). Here is a temporary > workaround: > > subsetByIntegerList <- function(x, i) > { > breakpoints <- cumsum(width(x)) > offsets <- c(0L, breakpoints[-length(**breakpoints)]) > i <- i + offsets > unlisted_ans <- unlist(x)[unlist(i)] > as(successiveViews(unlisted_**ans, elementLengths(i)), class(x)) > } > > Subsetting 'qseqs' and 'rseqs' produces 2 DNAStringSet objects > with the shape of 'mm': > > > subsetByIntegerList(qseqs, mm) > A DNAStringSet instance of length 204355 > width seq > [1] 1 G > [2] 3 GAG > [3] 2 AG > [4] 1 A > [5] 2 CA > ... ... ... > [204351] 1 A > [204352] 1 A > [204353] 2 AG > [204354] 0 > [204355] 0 > > > subsetByIntegerList(rseqs, mm) > A DNAStringSet instance of length 204355 > width seq > [1] 1 A > [2] 3 TGA > [3] 2 GC > [4] 1 G > [5] 2 AC > ... ... ... > [204351] 1 G > [204352] 1 G > [204353] 2 GT > [204354] 0 > [204355] 0 > > Lightly tested only. Let me know if you run into problems. > > Lot of the above stuff will be added soon in one form or another to > the basic infrastructure. Maybe a mismatches(x, y) generic with methods > for x=GappedAlignments,y=BSgenome ; x=DNAStringSet,y=DNAStringSet ; and > other useful combinations... > > Cheers, > H. > > > >> Cheers >> David >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
0
6.6 years ago by
Paul Leo970
Paul Leo970 wrote:
HI David, Probably Herve has answered your question... But about a 1 1/2 ago I wrote some code that : Given a genomic location(s) (a SNP) and some BAM file(s) uses Rsamtools (to get the cigar and the reads in that location) and functions like cigarToQWidth cigarQNarrow to get the bases in each read that spans that location . Is that what you want? The code is truly horrible cause of soft-clipping was not accounted for in the cigar utilities in the way I would have expected , but it seems to test ok, least it seems to a 1 1/2 ago... The dump of that code is below .... if you clean it up or if Herves' does the same thing better let me know! , it's something I've been meaning to fix up for a while (this bit also filters reads based on Phred sore.... Like I say it's really really rough!...with all my development comments... Cheers Paul ## data.gr<-GRanges(seqnames =paste("chr",3,sep=""),ranges = IRanges(start=as.numeric(97851689),end=as.numeric(97851689)),strand="+ ") which<- data.gr ## which<- data.gr[1,] #which ## bam.dir<-"/media/Bioinform-D/Research/Color_space/Pathology 01" ## bam.file<-"tumor_all.bam" ## bam.file setwd(bam.dir) files<-dir(getwd()) files[grepl("bam",files)] ## files[match(bam.file,files)] ### bam file must exist ## scanBamWhat() ## scanBamFlag() params<-ScanBamParam(which=which,flag=scanBamFlag(isUnmappedQuery=FALS E,isDuplicate=NA,isValidVendorRead=TRUE),simpleCigar = FALSE,reverseComplement = FALSE,what=c("qname","flag","rname","seq","s trand","pos","qwidth","cigar","qual","mapq") ) ### NOTE isValidVendorRead=FALSE shoudl be TRUE params which[1:5] ######################## Phred scores DECODED are ascii 0-93 str<-as.character(PhredQuality(0:93)) str<-unlist(strsplit(str,split = character(0))) score<-0:93 names(score)<-str ## score ## the.qual<-unlist(strsplit(as.character(aln1[[1]]$qual[7]),split = character(0))) ## the.qual ## score[the.qual] # provides the quality ## library("ChIPsim") ### needed for decodeQuality ## -10*log10(decodeQuality(aln1[[1]]$qual[7], type = c("Sanger"))) ##################################################### for(ii in 1:length(sample.bams)){ aln1 <- scanBam(sample.bams[ii],param=params) ## aln1 lapply(aln1,function(x) length(x$seq)) regions<-names(aln1) regions<-strsplit(gsub("-",":",regions),split=":") ## regions print(paste("Doing",sample.bams[ii],sep=" ")) print(paste("Nunmber regions:",length(regions),sep=" ")) base.qual.filter<-15 # bases with a Phred less than 20 are ignored 20-99% 10-90% ok 15 appears to be about what GATK is using read.qual.filter<-20 # mapQ takes into account the paired read perhaps about 20 major.allele<-{} minor.allele<-{} major.allele.count<-{} minor.allele.count<-{} alleles<-{} alleles.unfiltered<-{} alleles.freq<-{} coverage<-{} for (i in 1:length(regions)){ loc.wanted<-as.numeric(regions[[i]][3]) # loc.wanted<-101401645 posns<-loc.wanted-aln1[[i]]$pos+1 if(length(posns)==0) {next} # no reads - i=160 ii=1 poor.read<- aln1[[i]]$mapq <= read.qual.filter # sort(aln1[[i]]$mapq,decreasing=TRUE) ######################################## map to location in read ## posns ## test<-cigarOpTable(aln1[[i]]$cigar) ## print(paste("test no deletions:",i,":",sum(apply(test[,c("I","D","N","H","P")],2,sum))),sep ="") shift<-rep(0,times=length(posns)) length<-cigarToQWidth(aln1[[i]]$cigar) ## length<-cigarToQWidth(aln1[[i]]$cigar, before.hard.clipping=TRUE) not tested true.length<-cigarToWidth(aln1[[i]]$cigar) true.length[true.length>length]<-length[true.length>length] ### problem is that softclips appear in the sequence but are not accounted for the cigar utilities ### hardclip are removed from the sequence but are also caught in the work flow here soft.has.S<-grepl("S",aln1[[i]]$cigar) ## cigar contains any S read.has.H<-grepl("H",aln1[[i]]$cigar) ## Hard clips are beginning are the only problem ## read.has.H.at.end<-grepl("H$",aln1[[i]]$cigar) ## Hard clips are beginning are the only problem soft.at.end<-grepl("S$",aln1[[i]]$cigar) shift[soft.has.S & !soft.at.end]<-(length-true.length)[soft.has.S & ! soft.at.end] # aln1[[i]]$cigar[soft.has.S & !soft.at.end] ##### count the hard clip and remove them the shift.other only ones with hardclips enter (length-true.length)[soft.has.S & !soft.at.end][9] ## if(sum( read.has.H)>0){ (soft.has.S & !soft.at.end)[9] ## hard.clips<-aln1[[i]]$cigar[read.has.H] # hard.clips<-c("4H20M","5H30M","4H20M2D2H","20M4H") ## splitCigars <- splitCigar(hard.clips) ## cigar.labels <- unname(lapply(splitCigars, "[[", 1L)) ## cigar.labels<- lapply(cigar.labels,function(x) grep("H",rawToChar(x,multiple=TRUE))) ## splitLengths <- unname(lapply(splitCigars, "[[", 2L)) ## for(ih in 1:length(cigar.labels)){ ## hard.clips[ih]<-as.integer(splitLengths[[ih]][1]) ## firt one is H unlees just a H at the end or 2 H's ## if( length(cigar.labels[[ih]]) <2 & (cigar.labels[[ih]][1] ! =1)){ hard.clips[ih]<-0} # hard clip is just at end ## }} ############################ finish processing hardclips posns[posns>=length]<-length[posns>=length] ## see path tumor "chr10" "126727619" "126727619" deletionx check also 9 139994187 139994187 ## shift<-(length-true.length) ## shift[soft.at.end]<-0 cigarQNarrow(aln1[[i]]$cigar[1],start=50,width=1) ## shift ## posns2<-cigarNarrow(aln1[[i]]$cigar, start=posns,width=1) posns2<-cigarQNarrow(aln1[[i]]$cigar, start=posns,width=1) posns2<-attr(posns2,"rshift") ## posns2 contains info id the indel is in the soft-clip position shift.other <- posns-posns2-1 ## shift.other[posns2==0]<-shift[posns2==0] ### posn can be within the softclip which causes a problem shift.other[shift!=shift.other & shift!=0]<-shift[shift!=shift.other & shift!=0] ## see path tumor "chr10" "126727619" "126727619" deletionx ## shift.other[read.has.H]<-shift.other[read.has.H]-as.integer(hard.clips ) #hard.clips is number of hardclips at read start read that were accounted for al ## posns<-posns+shift+(shift.other-shift) shift.other[!(soft.has.S & !soft.at.end)]<-shift[!(soft.has.S & ! soft.at.end)] # if there is no softclip then do nothing posns<-posns+shift.other ##maybe these will work someday ## test<-queryLocs2refLocs(loc.wanted, aln1[[i]]$cigar[1],aln1[[i]]$pos[1], flag=NULL) ## test<-queryLocs2refLoc(loc.wanted, aln1[[i]]$cigar[1],aln1[[i]]$pos[1], flag=NULL) ############### Work done below performed AFTER ponns determined # -subseq(aln1[[i]]$seq[9],start=posns[9],width=1) seq.at.posn<-subseq(aln1[[i]]$seq,start=posns,width=1) seq.at.posn<-as.character(seq.at.posn) # seq.at.posn qual.at.posn<-subseq(aln1[[i]]$qual,start=posns,width=1) ## as.character(qual.at.posn) qual.at.posn<-score[as.character(qual.at.posn)] bad.qual<-qual.at.posn <= base.qual.filter # as.character(seq.at.posn)[bad.qual] ## sort(qual.at.posn,decreasing=TRUE) ## sum(qual.at.posn> qual.filter) names(qual.at.posn)<-seq.at.posn ### needed to get median phred scores counts<-sort(tapply(seq.at.posn,seq.at.posn,length),decreasing=TRUE) mean.quals<-round(tapply(qual.at.posn, names(qual.at.posn),mean)) mean.quals<-mean.quals[names(counts)] alleles.unfiltered[i]<-toString(paste(names(counts),":",counts," (",mean.quals,")",sep="")) ######filter for base quality seq.at.posn<-seq.at.posn[!bad.qual & !poor.read] qual.at.posn<-qual.at.posn[!bad.qual & !poor.read] counts<-sort(tapply(seq.at.posn,seq.at.posn,length),decreasing=TRUE) mean.quals<-round(tapply(qual.at.posn, names(qual.at.posn),mean)) mean.quals<-mean.quals[names(counts)] ## alleles[i]<-toString(paste(unlist(labels(counts)),counts,sep=":")) if(length(counts)<2){if(length(counts)==0){counts[1:2]<-0;names(counts )[1:2]<- "O"}else{counts[2]<-0;names(counts)[2]<- "O"} } ### only one allele can break suff below... cause where poor reads mean that ther are no good reads major.allele[i]<-names(counts)[1] minor.allele[i]<-names(counts)[2] major.allele.count[i]<-counts[1] minor.allele.count[i]<-counts[2] alleles[i]<-toString(paste(names(counts),":",counts," (",mean.quals,")",sep="")) alleles.freq[i]<-counts[2]/(counts[1]+counts[2]) ##minor allele frequency coverage[i]<-counts[2]+counts[1] print(toString(paste(unlist(labels(counts)),":",counts," (",mean.quals,")",sep=""))) } ####### which and regions dife names(alleles)<-names(aln1) summary.key<-paste("chr",summary[,"chr"],":",summary[,"start"],"-",sum mary[,"end"],sep="") posns<-match(summary.key,names(alleles)) missing<-is.na(posns) sum(missing) posns summary[!missing,paste(names(sample.bams)[ii],"Allele Summary (<phred>) ALL",sep=".")]<-alleles.unfiltered[posns[!missing]] summary[!missing,paste(names(sample.bams)[ii],"Allele Summary (<phred>)",sep=".")]<-alleles[posns[!missing]] # summary[! missing,"Normal Coverage"]<-alleles[posns[!missing]] summary[!missing,paste(names(sample.bams)[ii],"Allele Freq",sep=".")]<-round(alleles.freq[posns[!missing]],digits=3) summary[!missing,paste(names(sample.bams)[ii],"Major Allele",sep=".")]<-major.allele[posns[!missing]] summary[!missing,paste(names(sample.bams)[ii],"Minor Allele",sep=".")]<-minor.allele[posns[!missing]] summary[!missing,paste(names(sample.bams)[ii],"Major Allele Count",sep=".")]<-major.allele.count[posns[!missing]] summary[!missing,paste(names(sample.bams)[ii],"Minor Allele Count",sep=".")]<-minor.allele.count[posns[!missing]] summary[! missing,paste(names(sample.bams)[ii],"Coverage",sep=".")]<-coverage[po sns[!missing]] } ## summary[!missing,"Normal Coverage"]<-coverage[posns[!missing]] dim(summary) summary[1:2,] Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level 5 Rm 5005 , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: HervÃ© PagÃ¨s <hpages@fhcrc.org> To: David Greber <dav.greber@gmail.com> Cc: bioconductor@r-project.org Subject: Re: [BioC] Finding mismatches between reads and reference sequence Date: Fri, 5 Apr 2013 00:43:09 -0700 Hi David, Sorry for the delay in answering this. Was too busy trying to get a few things done before the new BioC release. On 03/26/2013 05:50 AM, David Greber wrote: > Hello, > > How can I find the mismatching positions between a read (e.g. in a > "GappedAlignments" object) and the reference sequence (a "BSgenome" > object)? In general, I am looking for an operation that maps the read > sequence against the reference genome (taking cigar operation into account) > and compares the DNAString objects. > > I tried this, but due to the different cigar string operations, this seems > to become difficult for complex alignments. The "Rsamtools" package offers > with "cigarToIRangesListByAlignment", but does not take soft clips into > account. > > Is there some functionality in bioconductor for this? I assume that it is a > common task, but could not find anything like it. I don't think there is an easy way at the moment to find the mismatches between the reads and reference sequences. Below is some code that solves this problem granted that (1) you have a BSgenome object containing the reference sequences, and (2) there are no insertions in your alignments (i.e. no I's in the cigars). Like in the following example: library(pasillaBamSubset) library(Rsamtools) ## Query sequences are in the SEQ field in the BAM file. param <- ScanBamParam(what="seq", tag=c("MD", "NM")) reads <- readGappedAlignments(untreated1_chr4(), param=param) Then: > colSums(cigarOpTable(cigar(reads))) M I D N S H P 15326625 0 0 21682582 0 0 0 No insertions (I), no deletions (D). What's important for the following code to work is that you have no insertions. Having deletions, or soft clipping (S) or hard clipping (H) is OK. It would probably not be too hard to adapt the code below to actually support insertions, but that would make it a little bit more complicated. We'll proceed in 3 steps: (a) Extract the query sequences, clip them, and flip them if needed. (b) Extract the reference sequences from the BSgenome object. (c) Compare query sequences to reference sequences. (a) Extract the query sequences, clip them, and flip them if needed. qseqs <- mcols(reads)seq Clip them. Only soft-clipping needs to be performed. The query sequences stored in BAM file are already hard-clipped: softClip <- function(qseq, cigar) { if (length(qseq) != length(cigar)) stop("'qseq' and 'cigar' must have the same length") split_cigar <- splitCigar(cigar) no_clipping_idx <- unlist(unname(lapply(split_cigar, function(x) { x1 <- x[[1L]] c(rawToChar(x1[1L]), rawToChar(x1[length(x1)])) != "S" }))) clipping <- unlist(unname(lapply(split_cigar, function(x) { x2 <- x[[2L]] x2[c(1L, length(x2))] }))) clipping[no_clipping_idx] <- 0L left_clipping <- clipping[c(TRUE, FALSE)] right_clipping <- clipping[c(FALSE, TRUE)] narrow(qseq, start=left_clipping+1L, end=-right_clipping- 1L) } qseqs <- softClip(qseqs, cigar(reads)) Because the BAM format imposes that the read sequence stored in the SEQ field is reverse complemented when the read is aligned to the minus strand, we reverse complement it again to get it in the original orientation: is_on_minus <- as.logical(strand(reads) == "-") qseqs[is_on_minus] <- reverseComplement(qseqs[is_on_minus]) (b) Extract corresponding reference sequences from BSgenome object. library(BSgenome.Dmelanogaster.UCSC.dm3) A trick here is to extract the ranges of the reads in a GRangesList object (each read will typically produce 1 range or more), and to treat that GRangesList object as if it was representing exons grouped by transcript: grl <- grglist(reads, order.as.in.query=TRUE, drop.D.ranges=TRUE) library(GenomicFeatures) ## Warning can be ignored. rseqs <- extractTranscriptsFromGenome(Dmelanogaster, grl) (c) Compare 'qseqs' and 'rseqs'. At this point 'qseqs' and 'rseqs' should have the same shape (i.e. same length and widths). Furthermore, if there was no mismatch in any of the alignments, 'qseqs' and 'rseqs' would contain exactly the same sequences. The mismatches can be found with: mismatches <- function(x, y) { if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet")) stop("'x' and 'y' must be DNAStringSet objects") x_width <- width(x) y_width <- width(y) if (!identical(x_width, y_width)) stop("'x' and 'y' must have the same shape ", "(i.e. same length and widths)") unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y))) breakpoints <- cumsum(x_width) ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L, breakpoints) + 1L, nbins=length(x)) skeleton <- PartitioningByEnd(cumsum(ans_eltlens)) ans <- relist(unlisted_ans, skeleton) offsets <- c(0L, breakpoints[-length(breakpoints)]) ans <- ans - offsets ans } mm <- mismatches(qseqs, rseqs) The result is an IntegerList: > mm IntegerList of length 204355 [[1]] 27 [[2]] 9 13 54 [[3]] 13 45 [[4]] 48 [[5]] 37 57 [[6]] 56 [[7]] integer(0) [[8]] integer(0) [[9]] 35 73 [[10]] 23 41 48 ... <204345 more elements> Nb of mismatches per alignment: elementLengths(mm) In the case of pasillaBamSubset, since there are not indels in the alignments, this should be identical to the edit distance reported in the NM tag of the BAM file: stopifnot(identical(elementLengths(mm), mcols(reads)NM)) Unfortunately, using 'mm' to subset 'qseqs' or 'rseqs' is not supported at the moment (we should add it). Here is a temporary workaround: subsetByIntegerList <- function(x, i) { breakpoints <- cumsum(width(x)) offsets <- c(0L, breakpoints[-length(breakpoints)]) i <- i + offsets unlisted_ans <- unlist(x)[unlist(i)] as(successiveViews(unlisted_ans, elementLengths(i)), class(x)) } Subsetting 'qseqs' and 'rseqs' produces 2 DNAStringSet objects with the shape of 'mm': > subsetByIntegerList(qseqs, mm) A DNAStringSet instance of length 204355 width seq [1] 1 G [2] 3 GAG [3] 2 AG [4] 1 A [5] 2 CA ... ... ... [204351] 1 A [204352] 1 A [204353] 2 AG [204354] 0 [204355] 0 > subsetByIntegerList(rseqs, mm) A DNAStringSet instance of length 204355 width seq [1] 1 A [2] 3 TGA [3] 2 GC [4] 1 G [5] 2 AC ... ... ... [204351] 1 G [204352] 1 G [204353] 2 GT [204354] 0 [204355] 0 Lightly tested only. Let me know if you run into problems. Lot of the above stuff will be added soon in one form or another to the basic infrastructure. Maybe a mismatches(x, y) generic with methods for x=GappedAlignments,y=BSgenome ; x=DNAStringSet,y=DNAStringSet ; and other useful combinations... Cheers, H. > > Cheers > David > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]