matchPDict with mismatches allowed appears to drop names

0

Entering edit mode

Ian Henry ▴ 50

@ian-henry-4772

Last seen 11.4 years ago

Hi, I have a question regarding the inheritance of the names attribute when using matchPDict. If I use matchPDict as follows: #Get transcript information > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene") > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) #Create DNAStringSet with names associated with each probe > probeset <- DNAStringSet(probelist$sequence) > names(probeset)<-probelist$probenames #Create PDict object and match against human transcript 14 (I know it should match) > ps_pdict<-PDict(probeset) > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]]) this compares the probes in ps_pdict to transcript 14 in hg19 and gives: >unlist(txmatches): start end width names [1] 749 773 25 HW:6 [2] 569 593 25 HW:16 [3] 804 828 25 HW:26 [4] 757 781 25 HW:36 which works :) However, if I search allowing for mismatches then the names appear to be lost: > ps_pdict1<-PDict(probeset, max.mismatch=1) > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], max.mismatch=1, min.mismatch=0) > unlist(txmatches1) IRanges of length 4 start end width [1] 749 773 25 [2] 569 593 25 [3] 804 828 25 [4] 757 781 25 The result of matchPDict is a MIndex object that I named txmatches with exact matches, and txmatches1 with 1 mismatch > names(txmatches) #gives character vector containing probe names > names(txmatches1) #returns NULL So it appears the names are not inherited. I tried to added them manually to my MIndex object >names(txmatches1)<-names(probeset) but I get Error: attempt to modify the names of a ByPos_MIndex instance Therefore I'm not sure how to keep my probe names associated with the Transcript match, which is important for inexact matching. Any help would be greatly appreciated, Thanks, Ian >sessionInfo() R version 2.13.0 beta (2011-03-31 r55221) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17 [3] BSgenome_1.19.5 Biostrings_2.19.17 [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 [7] IRanges_1.9.28 loaded via a namespace (and not attached): [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 [7] rtracklayer_1.11.12 tools_2.13.0

probe probe • 1.8k views

ADD COMMENT • link updated 14.5 years ago by Hervé Pagès 16k • written 14.5 years ago by Ian Henry ▴ 50

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 11.2 years ago

United States

The short answer for the fuzzy matching case appears to be that, even though your names are maintained through most of the function matchPDict.R:.match.MTB_PDict, and in fact they are still present in the list 'ans_compons', they get dropped in this call: ans <- ByPos_MIndex.combine(ans_compons) By contrast, for exact matching, the function matchPDict.R:.match.TB_PDict returns your names here: new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), ends=C_ans) So my naive guess is that MIndex-class.R:ByPos_MIndex.combine should do that also, something like ans_names <- mi_list[[1]]@NAMES new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, NAMES=ans_names) On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: > Hi, > > I have a question regarding the inheritance of the names attribute > when using matchPDict. > > If I use matchPDict as follows: > > #Get transcript information > > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = > "refGene") > > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) > > #Create DNAStringSet with names associated with each probe > > probeset <- DNAStringSet(probelist$sequence) > > names(probeset)<-probelist$probenames > > #Create PDict object and match against human transcript 14 (I know > it should match) > > ps_pdict<-PDict(probeset) > > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]]) > > this compares the probes in ps_pdict to transcript 14 in hg19 and > gives: > >unlist(txmatches): > > start end width names > [1] 749 773 25 HW:6 > [2] 569 593 25 HW:16 > [3] 804 828 25 HW:26 > [4] 757 781 25 HW:36 > > which works :) > > However, if I search allowing for mismatches then the names appear > to be lost: > > > ps_pdict1<-PDict(probeset, max.mismatch=1) > > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], > max.mismatch=1, min.mismatch=0) > > unlist(txmatches1) > > IRanges of length 4 > start end width > [1] 749 773 25 > [2] 569 593 25 > [3] 804 828 25 > [4] 757 781 25 > > The result of matchPDict is a MIndex object that I named txmatches > with exact matches, and txmatches1 with 1 mismatch > > names(txmatches) #gives character vector > containing probe names > > names(txmatches1) #returns NULL > > So it appears the names are not inherited. I tried to added them > manually to my MIndex object > >names(txmatches1)<-names(probeset) > > but I get Error: > attempt to modify the names of a ByPos_MIndex instance > > Therefore I'm not sure how to keep my probe names associated with > the Transcript match, which is important for inexact matching. > > Any help would be greatly appreciated, > > Thanks, > > Ian > > > >sessionInfo() > > R version 2.13.0 beta (2011-03-31 r55221) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] C/UTF-8/C/C/C/C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] plyr_1.5.2 > BSgenome.Hsapiens.UCSC.hg19_1.3.17 > [3] BSgenome_1.19.5 Biostrings_2.19.17 > [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 > [7] IRanges_1.9.28 > > loaded via a namespace (and not attached): > [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 > [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 > [7] rtracklayer_1.11.12 tools_2.13.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.5 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

>> So it appears the names are not inherited. I tried to added them >> manually to my MIndex object names(txmatches1)<-names(probeset) >> but I get Error: >> attempt to modify the names of a ByPos_MIndex instance From MIndex-class.R: setMethod("names", "MIndex", function(x) x at NAMES) setReplaceMethod("names", "MIndex", function(x, value) stop("attempt to modify the names of a ", class(x), " instance") ) For now, a pretty safe workaround for you is txmatches1 at NAMES <- names(probeset) On Aug 2, 2011, at 11:45 AM, Harris A. Jaffee wrote: > The short answer for the fuzzy matching case appears to be that, even > though your names are maintained through most of the function > > matchPDict.R:.match.MTB_PDict, > > and in fact they are still present in the list 'ans_compons', they get > dropped in this call: > > ans <- ByPos_MIndex.combine(ans_compons) > > By contrast, for exact matching, the function > > matchPDict.R:.match.TB_PDict > > returns your names here: > > new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), > ends=C_ans) > > So my naive guess is that > > MIndex-class.R:ByPos_MIndex.combine > > should do that also, something like > > ans_names <- mi_list[[1]]@NAMES > new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, > NAMES=ans_names) > > On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: > >> Hi, >> >> I have a question regarding the inheritance of the names attribute >> when using matchPDict. >> >> If I use matchPDict as follows: >> >> #Get transcript information >> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename >> = "refGene") >> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >> >> #Create DNAStringSet with names associated with each probe >> > probeset <- DNAStringSet(probelist$sequence) >> > names(probeset)<-probelist$probenames >> >> #Create PDict object and match against human transcript 14 (I know >> it should match) >> > ps_pdict<-PDict(probeset) >> > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]]) >> >> this compares the probes in ps_pdict to transcript 14 in hg19 and >> gives: >> >unlist(txmatches): >> >> start end width names >> [1] 749 773 25 HW:6 >> [2] 569 593 25 HW:16 >> [3] 804 828 25 HW:26 >> [4] 757 781 25 HW:36 >> >> which works :) >> >> However, if I search allowing for mismatches then the names appear >> to be lost: >> >> > ps_pdict1<-PDict(probeset, max.mismatch=1) >> > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], >> max.mismatch=1, min.mismatch=0) >> > unlist(txmatches1) >> >> IRanges of length 4 >> start end width >> [1] 749 773 25 >> [2] 569 593 25 >> [3] 804 828 25 >> [4] 757 781 25 >> >> The result of matchPDict is a MIndex object that I named txmatches >> with exact matches, and txmatches1 with 1 mismatch >> > names(txmatches) #gives character vector >> containing probe names >> > names(txmatches1) #returns NULL >> >> So it appears the names are not inherited. I tried to added them >> manually to my MIndex object >> >names(txmatches1)<-names(probeset) >> >> but I get Error: >> attempt to modify the names of a ByPos_MIndex instance >> >> Therefore I'm not sure how to keep my probe names associated with >> the Transcript match, which is important for inexact matching. >> >> Any help would be greatly appreciated, >> >> Thanks, >> >> Ian >> >> >> >sessionInfo() >> >> R version 2.13.0 beta (2011-03-31 r55221) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] C/UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] plyr_1.5.2 >> BSgenome.Hsapiens.UCSC.hg19_1.3.17 >> [3] BSgenome_1.19.5 Biostrings_2.19.17 >> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 >> [7] IRanges_1.9.28 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 >> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 >> [7] rtracklayer_1.11.12 tools_2.13.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD REPLY • link 14.5 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Hi Ian, Harris, On 11-08-02 08:45 AM, Harris A. Jaffee wrote: > The short answer for the fuzzy matching case appears to be that, even > though your names are maintained through most of the function > > matchPDict.R:.match.MTB_PDict, > > and in fact they are still present in the list 'ans_compons', they get > dropped in this call: > > ans <- ByPos_MIndex.combine(ans_compons) > > By contrast, for exact matching, the function > > matchPDict.R:.match.TB_PDict > > returns your names here: > > new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), ends=C_ans) > > So my naive guess is that > > MIndex-class.R:ByPos_MIndex.combine > > should do that also, something like > > ans_names <- mi_list[[1]]@NAMES > new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, NAMES=ans_names) Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and will be soon in Biostrings devel. With Biostrings 2.20.2: library(Biostrings) probes <- DNAStringSet(c(probe1="aaaaaa", probe2="accacc")) pdict1 <- PDict(probes, max.mismatch=1) m1 <- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1) Then: > names(m1) [1] "probe1" "probe2" > unlist(m1) IRanges of length 3 start end width names [1] 2 7 6 probe1 [2] 3 8 6 probe1 [3] 7 12 6 probe2 Thanks guys for the report and for the fix! H. > > On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: > >> Hi, >> >> I have a question regarding the inheritance of the names attribute >> when using matchPDict. >> >> If I use matchPDict as follows: >> >> #Get transcript information >> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >> "refGene") >> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >> >> #Create DNAStringSet with names associated with each probe >> > probeset <- DNAStringSet(probelist$sequence) >> > names(probeset)<-probelist$probenames >> >> #Create PDict object and match against human transcript 14 (I know it >> should match) >> > ps_pdict<-PDict(probeset) >> > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]]) >> >> this compares the probes in ps_pdict to transcript 14 in hg19 and gives: >> >unlist(txmatches): >> >> start end width names >> [1] 749 773 25 HW:6 >> [2] 569 593 25 HW:16 >> [3] 804 828 25 HW:26 >> [4] 757 781 25 HW:36 >> >> which works :) >> >> However, if I search allowing for mismatches then the names appear to >> be lost: >> >> > ps_pdict1<-PDict(probeset, max.mismatch=1) >> > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], max.mismatch=1, >> min.mismatch=0) >> > unlist(txmatches1) >> >> IRanges of length 4 >> start end width >> [1] 749 773 25 >> [2] 569 593 25 >> [3] 804 828 25 >> [4] 757 781 25 >> >> The result of matchPDict is a MIndex object that I named txmatches >> with exact matches, and txmatches1 with 1 mismatch >> > names(txmatches) #gives character vector containing probe names >> > names(txmatches1) #returns NULL >> >> So it appears the names are not inherited. I tried to added them >> manually to my MIndex object >> >names(txmatches1)<-names(probeset) >> >> but I get Error: >> attempt to modify the names of a ByPos_MIndex instance >> >> Therefore I'm not sure how to keep my probe names associated with the >> Transcript match, which is important for inexact matching. >> >> Any help would be greatly appreciated, >> >> Thanks, >> >> Ian >> >> >> >sessionInfo() >> >> R version 2.13.0 beta (2011-03-31 r55221) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] C/UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17 >> [3] BSgenome_1.19.5 Biostrings_2.19.17 >> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 >> [7] IRanges_1.9.28 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 >> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 >> [7] rtracklayer_1.11.12 tools_2.13.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.5 years ago Hervé Pagès 16k

0

Entering edit mode

Excellent thanks for the help and the fix! The workaround works well for me for now. I did spend a while yesterday writing my own workaround but Harris' workaround is much simpler for now than my solution, which is given below but is rather dirty. > txmatches <- matchPDict(ps_pdict, transcript[[1]], max.mismatch=max_mis, min.mismatch=min_mis) > tdf<-as.data.frame(unlist(txmatches)) > patternIndex<-which(countIndex(txmatches)>0) > duptimes <- function(duptimes) rep(duptimes,length(txmatches[[duptimes]])) > tmp<-sapply(patternIndex,duptimes) > tdf$patternIndex<-unlist(tmp) > getNames <- function(getNames) names(ps_pdict)[getNames] > tdf$name<-sapply(tdf$patternIndex, getNames) On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote: > Hi Ian, Harris, > > On 11-08-02 08:45 AM, Harris A. Jaffee wrote: >> The short answer for the fuzzy matching case appears to be that, even >> though your names are maintained through most of the function >> >> matchPDict.R:.match.MTB_PDict, >> >> and in fact they are still present in the list 'ans_compons', they >> get >> dropped in this call: >> >> ans <- ByPos_MIndex.combine(ans_compons) >> >> By contrast, for exact matching, the function >> >> matchPDict.R:.match.TB_PDict >> >> returns your names here: >> >> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), >> ends=C_ans) >> >> So my naive guess is that >> >> MIndex-class.R:ByPos_MIndex.combine >> >> should do that also, something like >> >> ans_names <- mi_list[[1]]@NAMES >> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, >> NAMES=ans_names) > > Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and > will be soon in Biostrings devel. With Biostrings 2.20.2: > > library(Biostrings) > probes <- DNAStringSet(c(probe1="aaaaaa", probe2="accacc")) > pdict1 <- PDict(probes, max.mismatch=1) > m1 <- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1) > > Then: > > > names(m1) > [1] "probe1" "probe2" > > > unlist(m1) > IRanges of length 3 > start end width names > [1] 2 7 6 probe1 > [2] 3 8 6 probe1 > [3] 7 12 6 probe2 > > Thanks guys for the report and for the fix! > H. > >> >> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: >> >>> Hi, >>> >>> I have a question regarding the inheritance of the names attribute >>> when using matchPDict. >>> >>> If I use matchPDict as follows: >>> >>> #Get transcript information >>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >>> "refGene") >>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >>> >>> #Create DNAStringSet with names associated with each probe >>> > probeset <- DNAStringSet(probelist$sequence) >>> > names(probeset)<-probelist$probenames >>> >>> #Create PDict object and match against human transcript 14 (I know >>> it >>> should match) >>> > ps_pdict<-PDict(probeset) >>> > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]]) >>> >>> this compares the probes in ps_pdict to transcript 14 in hg19 and >>> gives: >>> >unlist(txmatches): >>> >>> start end width names >>> [1] 749 773 25 HW:6 >>> [2] 569 593 25 HW:16 >>> [3] 804 828 25 HW:26 >>> [4] 757 781 25 HW:36 >>> >>> which works :) >>> >>> However, if I search allowing for mismatches then the names appear >>> to >>> be lost: >>> >>> > ps_pdict1<-PDict(probeset, max.mismatch=1) >>> > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], max.mismatch=1, >>> min.mismatch=0) >>> > unlist(txmatches1) >>> >>> IRanges of length 4 >>> start end width >>> [1] 749 773 25 >>> [2] 569 593 25 >>> [3] 804 828 25 >>> [4] 757 781 25 >>> >>> The result of matchPDict is a MIndex object that I named txmatches >>> with exact matches, and txmatches1 with 1 mismatch >>> > names(txmatches) #gives character vector containing probe names >>> > names(txmatches1) #returns NULL >>> >>> So it appears the names are not inherited. I tried to added them >>> manually to my MIndex object >>> >names(txmatches1)<-names(probeset) >>> >>> but I get Error: >>> attempt to modify the names of a ByPos_MIndex instance >>> >>> Therefore I'm not sure how to keep my probe names associated with >>> the >>> Transcript match, which is important for inexact matching. >>> >>> Any help would be greatly appreciated, >>> >>> Thanks, >>> >>> Ian >>> >>> >>> >sessionInfo() >>> >>> R version 2.13.0 beta (2011-03-31 r55221) >>> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >>> >>> locale: >>> [1] C/UTF-8/C/C/C/C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17 >>> [3] BSgenome_1.19.5 Biostrings_2.19.17 >>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 >>> [7] IRanges_1.9.28 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 >>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 >>> [7] rtracklayer_1.11.12 tools_2.13.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 Dr. Ian Henry Bioinformatics Service Leader MPI-CBG Dresden Pfotenhauerstrasse 108 01307 Dresden Germany phone: +49 351 210 2638 email: henry@mpi-cbg.de web: http://people.mpi-cbg.de/henry Check out the Bioinformatics Service wiki site: https://wiki.mpi- cbg.de/wiki/bioinfo/ [[alternative HTML version deleted]]

ADD REPLY • link 14.5 years ago Ian Henry ▴ 50

0

Entering edit mode

Hello again, I'm not sure whether to append this or add a new thread but I have another issue with matchPDict. I can create my PDict object: ps_pdict<-PDict(probeset, max.mismatch=max_mis) I then use vwhichPDict to filter for transcripts that match one of my probes tx_matches <- vwhichPDict(ps_pdict, transcriptLibrary, max.mismatch=max_mis, min.mismatch=min_mis) and then run matchPDict over my list of transcripts that have a probe match using various numbers of mismatches for fuzzy matching. txmatches <- matchPDict(ps_pdict, transcript[[1]], max.mismatch=max_mis, min.mismatch=min_mis) Exact matches work well, as do 1 and 2 mismatches, but above 3 things go a bit wrong and I get back results with only 2 mismatches with max.mismatch=3, and min.mismatch=3 Exact Matches: "names " "start " "end " "width " "TranscriptID " "Matched_Sequence " "probe_sequence " "maxMismatchSet " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved" "HW:11" 169 193 25 "NM_001144941.1 " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" "HW:11" 169 193 25 "NM_001144940.1 " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" "HW:11" 169 193 25 "NM_182566.2" "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" "HW:11" 169 193 25 "NM_001144939.1 " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "HW:13" 120 144 25 "NM_007128.2" "GCCCTTGGAACCACAATCCGCCTCA" "GCCCTTGGAACCACAATCCGCCTCA" 0 0 "" "0" One MisMatch: "names " "start " "end " "width " "TranscriptID " "Matched_Sequence " "probe_sequence " "maxMismatchSet " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved" "HW:9" 826 850 25 "NM_198152.3" "CCTAACTTTGTTATCCGTGTTGAGT" "CCTAACTTTGTTATCCGTGTTGATT" 1 1 "24" "1" "HW:18" 551 575 25 "NR_037144.1" "GACTCTGCTGTGACCTCCTTGTTCA" "GACTCTACTGTGACCTCCTTGTTCA" 1 1 "7" "1" Two MisMatches: names start end width TranscriptID Matched_Sequence probe_sequence maxMismatchSet minMismatchSet MisMatch_Position(s) NumberOfMismatchesObserved HW:22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT GGAGGCCGAGGTAGGAGGATTGCTT 2 2 12; 17 2 IV:14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC CGCACACACACACACACACACACAT 2 2 19; 25 2 HW::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT GGAGGCCGAGGTAGGAGGATTGCTT 2 2 7; 13 2 IV:14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC CGCACACACACACACACACACACAT 2 2 2; 25 2 IV::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC CGCACACACACACACACACACACAT 2 2 4; 25 2 IV:14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC CGCACACACACACACACACACACAT 2 2 2; 25 2 Three MisMatches: names start end width TranscriptID Matched_Sequence probe_sequence maxMismatchSet minMismatchSet MisMatch_Position(s) NumberOfMismatchesObserved HGW::17::7::22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT GGAGGCCGAGGTAGGAGGATTGCTT 3 3 12; 17 2 INV::5::14::14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC CGCACACACACACACACACACACAT 3 3 19; 25 2 HGW::17::7::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT GGAGGCCGAGGTAGGAGGATTGCTT 3 3 7; 13 2 INV::5::14::14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC CGCACACACACACACACACACACAT 3 3 2; 25 2 INV::5::14::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC CGCACACACACACACACACACACAT 3 3 4; 25 2 INV::5::14::14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC CGCACACACACACACACACACACAT 3 3 2; 25 2 When running vwhichPDict with three mismatches I do get a warning that the algorithm may not be efficient, but this seems to indicate a time problem rather than an accuracy problem. Warning message: In .MTB_PDict(x, max.mismatch, algo) : given the characteristics of dictionary 'x', this value of 'max.mismatch' will give poor performance when you call matchPDict() on this MTB_PDict object (it will of course depend ultimately on the length of the subject) Indeed the output looks OK: Indeed the results of vwhichPDict gives: 6428 tx_matches for probes allowing max 0 min 0 mismatches 1077 tx_matches for probes allowing max 1 min 1 mismatches 1953 tx_matches for probes allowing max 2 min 2 mismatches 3729 tx_matches for probes allowing max 3 min 3 mismatches However, If I compare this to the output of looping over matchPDict for transcripts with probe matches (from vwhichPDict) I get: 6428 unique tx_matches for probes allowing max 0 min 0 mismatches 1077 unique tx_matches for probes allowing max 1 min 1 mismatches 1953 unique tx_matches for probes allowing max 2 min 2 mismatches 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but the results show only 2 mismatches Suggesting that matchPDict reverted back to 2 mismatch settings (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3, min.mismatch=3. (It also seems to be the case for max=3, min=0) Probably 3 mismatches is excessive for what I want to do, but I thought I'd report the issue. It seems like matchPDict is reverting back to 2 mismatches in this case but vwhichPDict probably does 3. Thanks, Ian On Aug 3, 2011, at 11:06 AM, Ian Henry wrote: > Excellent thanks for the help and the fix! > > The workaround works well for me for now. I did spend a while > yesterday writing my own workaround but Harris' workaround is much > simpler for now than my solution, which is given below but is rather > dirty. > > > txmatches <- matchPDict(ps_pdict, transcript[[1]], > max.mismatch=max_mis, min.mismatch=min_mis) > > tdf<-as.data.frame(unlist(txmatches)) > > patternIndex<-which(countIndex(txmatches)>0) > > duptimes <- function(duptimes) > rep(duptimes,length(txmatches[[duptimes]])) > > tmp<-sapply(patternIndex,duptimes) > > tdf$patternIndex<-unlist(tmp) > > getNames <- function(getNames) names(ps_pdict)[getNames] > > tdf$name<-sapply(tdf$patternIndex, getNames) > > > > On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote: > >> Hi Ian, Harris, >> >> On 11-08-02 08:45 AM, Harris A. Jaffee wrote: >>> The short answer for the fuzzy matching case appears to be that, >>> even >>> though your names are maintained through most of the function >>> >>> matchPDict.R:.match.MTB_PDict, >>> >>> and in fact they are still present in the list 'ans_compons', they >>> get >>> dropped in this call: >>> >>> ans <- ByPos_MIndex.combine(ans_compons) >>> >>> By contrast, for exact matching, the function >>> >>> matchPDict.R:.match.TB_PDict >>> >>> returns your names here: >>> >>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), >>> ends=C_ans) >>> >>> So my naive guess is that >>> >>> MIndex-class.R:ByPos_MIndex.combine >>> >>> should do that also, something like >>> >>> ans_names <- mi_list[[1]]@NAMES >>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, >>> NAMES=ans_names) >> >> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and >> will be soon in Biostrings devel. With Biostrings 2.20.2: >> >> library(Biostrings) >> probes <- DNAStringSet(c(probe1="aaaaaa", probe2="accacc")) >> pdict1 <- PDict(probes, max.mismatch=1) >> m1 <- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1) >> >> Then: >> >> > names(m1) >> [1] "probe1" "probe2" >> >> > unlist(m1) >> IRanges of length 3 >> start end width names >> [1] 2 7 6 probe1 >> [2] 3 8 6 probe1 >> [3] 7 12 6 probe2 >> >> Thanks guys for the report and for the fix! >> H. >> >>> >>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: >>> >>>> Hi, >>>> >>>> I have a question regarding the inheritance of the names attribute >>>> when using matchPDict. >>>> >>>> If I use matchPDict as follows: >>>> >>>> #Get transcript information >>>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >>>> "refGene") >>>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >>>> >>>> #Create DNAStringSet with names associated with each probe >>>> > probeset <- DNAStringSet(probelist$sequence) >>>> > names(probeset)<-probelist$probenames >>>> >>>> #Create PDict object and match against human transcript 14 (I >>>> know it >>>> should match) >>>> > ps_pdict<-PDict(probeset) >>>> > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]]) >>>> >>>> this compares the probes in ps_pdict to transcript 14 in hg19 and >>>> gives: >>>> >unlist(txmatches): >>>> >>>> start end width names >>>> [1] 749 773 25 HW:6 >>>> [2] 569 593 25 HW:16 >>>> [3] 804 828 25 HW:26 >>>> [4] 757 781 25 HW:36 >>>> >>>> which works :) >>>> >>>> However, if I search allowing for mismatches then the names >>>> appear to >>>> be lost: >>>> >>>> > ps_pdict1<-PDict(probeset, max.mismatch=1) >>>> > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], >>>> max.mismatch=1, >>>> min.mismatch=0) >>>> > unlist(txmatches1) >>>> >>>> IRanges of length 4 >>>> start end width >>>> [1] 749 773 25 >>>> [2] 569 593 25 >>>> [3] 804 828 25 >>>> [4] 757 781 25 >>>> >>>> The result of matchPDict is a MIndex object that I named txmatches >>>> with exact matches, and txmatches1 with 1 mismatch >>>> > names(txmatches) #gives character vector containing probe names >>>> > names(txmatches1) #returns NULL >>>> >>>> So it appears the names are not inherited. I tried to added them >>>> manually to my MIndex object >>>> >names(txmatches1)<-names(probeset) >>>> >>>> but I get Error: >>>> attempt to modify the names of a ByPos_MIndex instance >>>> >>>> Therefore I'm not sure how to keep my probe names associated with >>>> the >>>> Transcript match, which is important for inexact matching. >>>> >>>> Any help would be greatly appreciated, >>>> >>>> Thanks, >>>> >>>> Ian >>>> >>>> >>>> >sessionInfo() >>>> >>>> R version 2.13.0 beta (2011-03-31 r55221) >>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >>>> >>>> locale: >>>> [1] C/UTF-8/C/C/C/C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17 >>>> [3] BSgenome_1.19.5 Biostrings_2.19.17 >>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 >>>> [7] IRanges_1.9.28 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 >>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 >>>> [7] rtracklayer_1.11.12 tools_2.13.0 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 14.5 years ago Ian Henry ▴ 50

0

Entering edit mode

Hi Ian, On 11-08-04 08:11 AM, Ian Henry wrote: > Hello again, > > I'm not sure whether to append this or add a new thread but I have > another issue with matchPDict. > > I can create my PDict object: > ps_pdict<-PDict(probeset, max.mismatch=max_mis) > > > I then use vwhichPDict to filter for transcripts that match one of my > probes > tx_matches<- vwhichPDict(ps_pdict, transcriptLibrary, > max.mismatch=max_mis, min.mismatch=min_mis) > > and then run matchPDict over my list of transcripts that have a probe > match using various numbers of mismatches for fuzzy matching. > txmatches<- matchPDict(ps_pdict, transcript[[1]], > max.mismatch=max_mis, min.mismatch=min_mis) > > Exact matches work well, as do 1 and 2 mismatches, but above 3 things > go a bit wrong and I get back results with only 2 mismatches with > max.mismatch=3, and min.mismatch=3 > > Exact Matches: > "names > " "start > " "end > " "width > " "TranscriptID > " "Matched_Sequence > " "probe_sequence > " "maxMismatchSet > " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved" > "HW:11" 169 193 > 25 > "NM_001144941.1 > " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" > "HW:11" 169 193 > 25 > "NM_001144940.1 > " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" > "HW:11" 169 193 > 25 > "NM_182566.2" "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" > 0 0 "" "0" > "HW:11" 169 193 > 25 > "NM_001144939.1 > " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" > "HW:13" 120 144 > 25 > "NM_007128.2" "GCCCTTGGAACCACAATCCGCCTCA" "GCCCTTGGAACCACAATCCGCCTCA" > 0 0 "" "0" > > One MisMatch: > "names > " "start > " "end > " "width > " "TranscriptID > " "Matched_Sequence > " "probe_sequence > " "maxMismatchSet > " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved" > "HW:9" 826 850 > 25 > "NM_198152.3" "CCTAACTTTGTTATCCGTGTTGAGT" "CCTAACTTTGTTATCCGTGTTGATT" > 1 1 "24" "1" > "HW:18" 551 575 > 25 > "NR_037144.1" "GACTCTGCTGTGACCTCCTTGTTCA" "GACTCTACTGTGACCTCCTTGTTCA" > 1 1 "7" "1" > > Two MisMatches: > names start end width TranscriptID Matched_Sequence probe_sequence > maxMismatchSet minMismatchSet MisMatch_Position(s) > NumberOfMismatchesObserved > HW:22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT > GGAGGCCGAGGTAGGAGGATTGCTT 2 2 12; 17 2 > IV:14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC > CGCACACACACACACACACACACAT 2 2 19; 25 2 > HW::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT > GGAGGCCGAGGTAGGAGGATTGCTT 2 2 7; 13 2 > IV:14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC > CGCACACACACACACACACACACAT 2 2 2; 25 2 > IV::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC > CGCACACACACACACACACACACAT 2 2 4; 25 2 > IV:14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC > CGCACACACACACACACACACACAT 2 2 2; 25 2 > > Three MisMatches: > names start end width TranscriptID Matched_Sequence probe_sequence > maxMismatchSet minMismatchSet MisMatch_Position(s) > NumberOfMismatchesObserved > HGW::17::7::22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT > GGAGGCCGAGGTAGGAGGATTGCTT 3 3 12; 17 2 > INV::5::14::14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC > CGCACACACACACACACACACACAT 3 3 19; 25 2 > HGW::17::7::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT > GGAGGCCGAGGTAGGAGGATTGCTT 3 3 7; 13 2 > INV::5::14::14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC > CGCACACACACACACACACACACAT 3 3 2; 25 2 > INV::5::14::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC > CGCACACACACACACACACACACAT 3 3 4; 25 2 > INV::5::14::14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC > CGCACACACACACACACACACACAT 3 3 2; 25 2 Please provide a self-contained reproducible example. Below I try to run something only *very* approximate to what you seem to have done (based on the information you provide above), but I can't reproduce the problem. library(Biostrings) ## Using the 6 probes and 6 matched sequences from your "Three ## MisMatches" example: probes <- DNAStringSet(c("GGAGGCCGAGGTAGGAGGATTGCTT", "CGCACACACACACACACACACACAT", "GGAGGCCGAGGTAGGAGGATTGCTT", "CGCACACACACACACACACACACAT", "CGCACACACACACACACACACACAT", "CGCACACACACACACACACACACAT")) matched_sequences <- DNAStringSet(c("GGAGGCCGAGGCAGGATGATTGCTT", "CGCACACACACACACACATACACAC", "GGAGGCTGAGGTGGGAGGATTGCTT", "CACACACACACACACACACACACAC", "CGCGCACACACACACACACACACAC", "CACACACACACACACACACACACAC")) ## Computing the Hamming distances: > sapply(1:6, function(i) neditStartingAt(matched_sequences[[i]], probes[[i]])) [1] 2 2 2 2 2 2 ## Using matchPDict() with max.mismatch=3 and min.mismatch=3. ps_pdict <- PDict(probeset, max.mismatch=3) ## probe 1 vs sequence 1: > m1 <- matchPDict(ps_pdict, matched_sequences[[1]], max.mismatch=3, min.mismatch=3) > m1[[1]] IRanges of length 0 ## --> no hit ## probe 2 vs sequence 2: > m2 <- matchPDict(ps_pdict, matched_sequences[[2]], max.mismatch=3, min.mismatch=3) > m2[[2]] IRanges of length 0 ## --> no hit ## probe 3 vs sequence 3: > m3 <- matchPDict(ps_pdict, matched_sequences[[3]], max.mismatch=3, min.mismatch=3) > m3[[3]] IRanges of length 0 ## --> no hit ## probe 4 vs sequence 4: > m4 <- matchPDict(ps_pdict, matched_sequences[[4]], max.mismatch=3, min.mismatch=3) > m4[[4]] IRanges of length 2 start end width [1] -1 23 25 [2] 3 27 25 ## --> probe 4 hits sequence 4 twice but those 2 hits ## are "out-of-limit" hits. The reason those hits are ## considered to have 3 mismatches is that letters in ## the pattern that are not facing a letter in the subject ## are counted as mismatches: > neditStartingAt(probes[[4]], matched_sequences[[4]], starting.at=-2:4) [1] 25 3 25 2 25 3 25 ## probe 5 vs sequence 5: > m5 <- matchPDict(ps_pdict, matched_sequences[[5]], max.mismatch=3, min.mismatch=3) > m5[[5]] IRanges of length 0 ## --> no hit ## probe 6 vs sequence 6: > m6 <- matchPDict(ps_pdict, matched_sequences[[6]], max.mismatch=3, min.mismatch=3) > m6[[6]] IRanges of length 2 start end width [1] -1 23 25 [2] 3 27 25 ## -> same as probe 4 vs sequence 4 ## Consistency with vwhichPDict(): > countIndex(m1) [1] 0 0 0 0 0 0 > countIndex(m2) [1] 0 0 0 0 0 0 > countIndex(m3) [1] 0 0 0 0 0 0 > countIndex(m4) [1] 0 2 0 2 2 2 > countIndex(m5) [1] 0 0 0 0 0 0 > countIndex(m6) [1] 0 2 0 2 2 2 > vwhichPDict(ps_pdict, matched_sequences, max.mismatch=3, min.mismatch=3) [[1]] integer(0) [[2]] integer(0) [[3]] integer(0) [[4]] [1] 2 4 5 6 [[5]] integer(0) [[6]] [1] 2 4 5 6 ## --> things are consistent > sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.20.2 IRanges_1.10.4 loaded via a namespace (and not attached): [1] tools_2.13.0 H. > > > When running vwhichPDict with three mismatches I do get a warning that > the algorithm may not be efficient, but this seems to indicate a time > problem rather than an accuracy problem. > Warning message: > In .MTB_PDict(x, max.mismatch, algo) : > given the characteristics of dictionary 'x', this value of > 'max.mismatch' will > give poor performance when you call matchPDict() on this MTB_PDict > object > (it will of course depend ultimately on the length of the subject) > > Indeed the output looks OK: > > Indeed the results of vwhichPDict gives: > 6428 tx_matches for probes allowing max 0 min 0 mismatches > 1077 tx_matches for probes allowing max 1 min 1 mismatches > 1953 tx_matches for probes allowing max 2 min 2 mismatches > 3729 tx_matches for probes allowing max 3 min 3 mismatches > > However, If I compare this to the output of looping over matchPDict > for transcripts with probe matches (from vwhichPDict) I get: > 6428 unique tx_matches for probes allowing max 0 min 0 mismatches > 1077 unique tx_matches for probes allowing max 1 min 1 mismatches > 1953 unique tx_matches for probes allowing max 2 min 2 mismatches > 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but > the results show only 2 mismatches > > Suggesting that matchPDict reverted back to 2 mismatch settings > (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3, > min.mismatch=3. (It also seems to be the case for max=3, min=0) > > Probably 3 mismatches is excessive for what I want to do, but I > thought I'd report the issue. It seems like matchPDict is reverting > back to 2 mismatches in this case but vwhichPDict probably does 3. > > Thanks, > > Ian > > > > On Aug 3, 2011, at 11:06 AM, Ian Henry wrote: > >> Excellent thanks for the help and the fix! >> >> The workaround works well for me for now. I did spend a while >> yesterday writing my own workaround but Harris' workaround is much >> simpler for now than my solution, which is given below but is rather >> dirty. >> >>> txmatches<- matchPDict(ps_pdict, transcript[[1]], >> max.mismatch=max_mis, min.mismatch=min_mis) >>> tdf<-as.data.frame(unlist(txmatches)) >>> patternIndex<-which(countIndex(txmatches)>0) >>> duptimes<- function(duptimes) >> rep(duptimes,length(txmatches[[duptimes]])) >>> tmp<-sapply(patternIndex,duptimes) >>> tdf$patternIndex<-unlist(tmp) >>> getNames<- function(getNames) names(ps_pdict)[getNames] >>> tdf$name<-sapply(tdf$patternIndex, getNames) >> >> >> >> On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote: >> >>> Hi Ian, Harris, >>> >>> On 11-08-02 08:45 AM, Harris A. Jaffee wrote: >>>> The short answer for the fuzzy matching case appears to be that, >>>> even >>>> though your names are maintained through most of the function >>>> >>>> matchPDict.R:.match.MTB_PDict, >>>> >>>> and in fact they are still present in the list 'ans_compons', they >>>> get >>>> dropped in this call: >>>> >>>> ans<- ByPos_MIndex.combine(ans_compons) >>>> >>>> By contrast, for exact matching, the function >>>> >>>> matchPDict.R:.match.TB_PDict >>>> >>>> returns your names here: >>>> >>>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), >>>> ends=C_ans) >>>> >>>> So my naive guess is that >>>> >>>> MIndex-class.R:ByPos_MIndex.combine >>>> >>>> should do that also, something like >>>> >>>> ans_names<- mi_list[[1]]@NAMES >>>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, >>>> NAMES=ans_names) >>> >>> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and >>> will be soon in Biostrings devel. With Biostrings 2.20.2: >>> >>> library(Biostrings) >>> probes<- DNAStringSet(c(probe1="aaaaaa", probe2="accacc")) >>> pdict1<- PDict(probes, max.mismatch=1) >>> m1<- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1) >>> >>> Then: >>> >>> > names(m1) >>> [1] "probe1" "probe2" >>> >>> > unlist(m1) >>> IRanges of length 3 >>> start end width names >>> [1] 2 7 6 probe1 >>> [2] 3 8 6 probe1 >>> [3] 7 12 6 probe2 >>> >>> Thanks guys for the report and for the fix! >>> H. >>> >>>> >>>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a question regarding the inheritance of the names attribute >>>>> when using matchPDict. >>>>> >>>>> If I use matchPDict as follows: >>>>> >>>>> #Get transcript information >>>>>> hg19txdb<- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >>>>> "refGene") >>>>>> hg19_tx<- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >>>>> >>>>> #Create DNAStringSet with names associated with each probe >>>>>> probeset<- DNAStringSet(probelist$sequence) >>>>>> names(probeset)<-probelist$probenames >>>>> >>>>> #Create PDict object and match against human transcript 14 (I >>>>> know it >>>>> should match) >>>>>> ps_pdict<-PDict(probeset) >>>>>> txmatches<- matchPDict(ps_pdict, hg19_tx[[14]]) >>>>> >>>>> this compares the probes in ps_pdict to transcript 14 in hg19 and >>>>> gives: >>>>>> unlist(txmatches): >>>>> >>>>> start end width names >>>>> [1] 749 773 25 HW:6 >>>>> [2] 569 593 25 HW:16 >>>>> [3] 804 828 25 HW:26 >>>>> [4] 757 781 25 HW:36 >>>>> >>>>> which works :) >>>>> >>>>> However, if I search allowing for mismatches then the names >>>>> appear to >>>>> be lost: >>>>> >>>>>> ps_pdict1<-PDict(probeset, max.mismatch=1) >>>>>> txmatches1<- matchPDict(ps_pdict1, hg19_tx[[14]], >>>>> max.mismatch=1, >>>>> min.mismatch=0) >>>>>> unlist(txmatches1) >>>>> >>>>> IRanges of length 4 >>>>> start end width >>>>> [1] 749 773 25 >>>>> [2] 569 593 25 >>>>> [3] 804 828 25 >>>>> [4] 757 781 25 >>>>> >>>>> The result of matchPDict is a MIndex object that I named txmatches >>>>> with exact matches, and txmatches1 with 1 mismatch >>>>>> names(txmatches) #gives character vector containing probe names >>>>>> names(txmatches1) #returns NULL >>>>> >>>>> So it appears the names are not inherited. I tried to added them >>>>> manually to my MIndex object >>>>>> names(txmatches1)<-names(probeset) >>>>> >>>>> but I get Error: >>>>> attempt to modify the names of a ByPos_MIndex instance >>>>> >>>>> Therefore I'm not sure how to keep my probe names associated with >>>>> the >>>>> Transcript match, which is important for inexact matching. >>>>> >>>>> Any help would be greatly appreciated, >>>>> >>>>> Thanks, >>>>> >>>>> Ian >>>>> >>>>> >>>>>> sessionInfo() >>>>> >>>>> R version 2.13.0 beta (2011-03-31 r55221) >>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >>>>> >>>>> locale: >>>>> [1] C/UTF-8/C/C/C/C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> other attached packages: >>>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17 >>>>> [3] BSgenome_1.19.5 Biostrings_2.19.17 >>>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 >>>>> [7] IRanges_1.9.28 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 >>>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 >>>>> [7] rtracklayer_1.11.12 tools_2.13.0 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >> > > > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.5 years ago Hervé Pagès 16k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 hours ago

Seattle, WA, United States

Answering to the list... On 11-08-04 12:28 PM, Ian Henry wrote: > I did still see the bug with your code but I fixed the problem: > > I have been making a new pdict object with the specific number of mismatches before running matchPDict. > > However, when preparing you a small example to illustrate my point I discovered the code worked correctly when using just one transcript. > > When I ran this again iterating over the whole transcriptome I noticed my code was erroring. > > I think the real issue is with transcript XM_003119554 > > One probe matches it with 3 mismatches but the match starts at -2 which screwed up my later subseq() call and then failed causing corruption between my two runs (with 2 and 3 mismatches) > > start end width names TranscriptID > 1 -2 22 25 KP:5 XM_003119554.1 > > I now don't think there is an issue with matchPDict, it was just this unexpected error that I missed in my codeblock. OK I see. Yes when doing fuzzy matching you can see "out-of-limits" hits which can then cause you troubles downstream if not handled carefully. Cheers, H. > > I'm still getting to grips with debugging my R code effectively and I guess I missed this one. > > Sorry Harris and Herve thanks for your help. > > Ian > > > > On Aug 4, 2011, at 6:29 PM, Harris A. Jaffee wrote: > >> Do you still see a bug with this, i.e. some matches with only 2 errors? >> >>>> pdict3 = PDict(probeset, max.mismatch=3) >>>> m3 = matchPDict(pdict3, subject3, max.mismatch=3, min.mismatch=3) >> >> If so, can you give a small example? >> >> Or did my advice just avoid your original problem? >> >> On Aug 4, 2011, at 11:43 AM, Ian Henry wrote: >> >>> Sorry I wasn't clear, I am doing that as I thought it best too. >>> >>> On Aug 4, 2011, at 5:38 PM, Harris A. Jaffee wrote: >>> >>>> I'm not certain, but I think it best to build and use a different >>>> PDict object for each 'max_mis' value for which you are going to >>>> call matchPDict: >>>> >>>> pdict0 = PDict(probeset) >>>> pdict1 = PDict(probeset, max.mismatch=1) >>>> pdict2 = PDict(probeset, max.mismatch=2) >>>> pdict3 = PDict(probeset, max.mismatch=3) >>>> >>>> m0 = matchPDict(pdict0, subject0) >>>> m1 = matchPDict(pdict1, subject1, max.mismatch=1, min.mismatch=1) >>>> m2 = matchPDict(pdict2, subject2, max.mismatch=2, min.mismatch=2) >>>> m3 = matchPDict(pdict3, subject3, max.mismatch=3, min.mismatch=3) >>>> >>>> But you still may have found a bug. >>>> >>>> On Aug 4, 2011, at 11:11 AM, Ian Henry wrote: >>>> >>>>> Hello again, >>>>> >>>>> I'm not sure whether to append this or add a new thread but I have >>>>> another issue with matchPDict. >>>>> >>>>> I can create my PDict object: >>>>> ps_pdict<-PDict(probeset, max.mismatch=max_mis) >>>>> >>>>> >>>>> I then use vwhichPDict to filter for transcripts that match one of my >>>>> probes >>>>> tx_matches<- vwhichPDict(ps_pdict, transcriptLibrary, >>>>> max.mismatch=max_mis, min.mismatch=min_mis) >>>>> >>>>> and then run matchPDict over my list of transcripts that have a probe >>>>> match using various numbers of mismatches for fuzzy matching. >>>>> txmatches<- matchPDict(ps_pdict, transcript[[1]], >>>>> max.mismatch=max_mis, min.mismatch=min_mis) >>>>> >>>>> Exact matches work well, as do 1 and 2 mismatches, but above 3 things >>>>> go a bit wrong and I get back results with only 2 mismatches with >>>>> max.mismatch=3, and min.mismatch=3 >>>>> >>>>> Exact Matches: >>>>> "names >>>>> " "start >>>>> " "end >>>>> " "width >>>>> " "TranscriptID >>>>> " "Matched_Sequence >>>>> " "probe_sequence >>>>> " "maxMismatchSet >>>>> " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved" >>>>> "HW:11" 169 193 >>>>> 25 >>>>> "NM_001144941.1 >>>>> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" >>>>> "HW:11" 169 193 >>>>> 25 >>>>> "NM_001144940.1 >>>>> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0" >>>>> "HW:11" 169 193 >>>>> 25 >>>>> "NM_182566.2" "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" >>>>> 0 0 "" "0" >>>>> "HW:11" 169 193 >>>>> 25 >>>>> "NM_001144939.1 >>>>> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" >>>>> "HW:13" 120 144 >>>>> 25 >>>>> "NM_007128.2" "GCCCTTGGAACCACAATCCGCCTCA" "GCCCTTGGAACCACAATCCGCCTCA" >>>>> 0 0 "" "0" >>>>> >>>>> One MisMatch: >>>>> "names >>>>> " "start >>>>> " "end >>>>> " "width >>>>> " "TranscriptID >>>>> " "Matched_Sequence >>>>> " "probe_sequence >>>>> " "maxMismatchSet >>>>> " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved" >>>>> "HW:9" 826 850 >>>>> 25 >>>>> "NM_198152.3" "CCTAACTTTGTTATCCGTGTTGAGT" "CCTAACTTTGTTATCCGTGTTGATT" >>>>> 1 1 "24" "1" >>>>> "HW:18" 551 575 >>>>> 25 >>>>> "NR_037144.1" "GACTCTGCTGTGACCTCCTTGTTCA" "GACTCTACTGTGACCTCCTTGTTCA" >>>>> 1 1 "7" "1" >>>>> >>>>> Two MisMatches: >>>>> names start end width TranscriptID Matched_Sequence probe_sequence >>>>> maxMismatchSet minMismatchSet MisMatch_Position(s) >>>>> NumberOfMismatchesObserved >>>>> HW:22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT >>>>> GGAGGCCGAGGTAGGAGGATTGCTT 2 2 12; 17 2 >>>>> IV:14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC >>>>> CGCACACACACACACACACACACAT 2 2 19; 25 2 >>>>> HW::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT >>>>> GGAGGCCGAGGTAGGAGGATTGCTT 2 2 7; 13 2 >>>>> IV:14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC >>>>> CGCACACACACACACACACACACAT 2 2 2; 25 2 >>>>> IV::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC >>>>> CGCACACACACACACACACACACAT 2 2 4; 25 2 >>>>> IV:14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC >>>>> CGCACACACACACACACACACACAT 2 2 2; 25 2 >>>>> >>>>> Three MisMatches: >>>>> names start end width TranscriptID Matched_Sequence probe_sequence >>>>> maxMismatchSet minMismatchSet MisMatch_Position(s) >>>>> NumberOfMismatchesObserved >>>>> HGW::17::7::22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT >>>>> GGAGGCCGAGGTAGGAGGATTGCTT 3 3 12; 17 2 >>>>> INV::5::14::14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC >>>>> CGCACACACACACACACACACACAT 3 3 19; 25 2 >>>>> HGW::17::7::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT >>>>> GGAGGCCGAGGTAGGAGGATTGCTT 3 3 7; 13 2 >>>>> INV::5::14::14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC >>>>> CGCACACACACACACACACACACAT 3 3 2; 25 2 >>>>> INV::5::14::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC >>>>> CGCACACACACACACACACACACAT 3 3 4; 25 2 >>>>> INV::5::14::14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC >>>>> CGCACACACACACACACACACACAT 3 3 2; 25 2 >>>>> >>>>> >>>>> When running vwhichPDict with three mismatches I do get a warning that >>>>> the algorithm may not be efficient, but this seems to indicate a time >>>>> problem rather than an accuracy problem. >>>>> Warning message: >>>>> In .MTB_PDict(x, max.mismatch, algo) : >>>>> given the characteristics of dictionary 'x', this value of >>>>> 'max.mismatch' will >>>>> give poor performance when you call matchPDict() on this MTB_PDict >>>>> object >>>>> (it will of course depend ultimately on the length of the subject) >>>>> >>>>> Indeed the output looks OK: >>>>> >>>>> Indeed the results of vwhichPDict gives: >>>>> 6428 tx_matches for probes allowing max 0 min 0 mismatches >>>>> 1077 tx_matches for probes allowing max 1 min 1 mismatches >>>>> 1953 tx_matches for probes allowing max 2 min 2 mismatches >>>>> 3729 tx_matches for probes allowing max 3 min 3 mismatches >>>>> >>>>> However, If I compare this to the output of looping over matchPDict >>>>> for transcripts with probe matches (from vwhichPDict) I get: >>>>> 6428 unique tx_matches for probes allowing max 0 min 0 mismatches >>>>> 1077 unique tx_matches for probes allowing max 1 min 1 mismatches >>>>> 1953 unique tx_matches for probes allowing max 2 min 2 mismatches >>>>> 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but >>>>> the results show only 2 mismatches >>>>> >>>>> Suggesting that matchPDict reverted back to 2 mismatch settings >>>>> (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3, >>>>> min.mismatch=3. (It also seems to be the case for max=3, min=0) >>>>> >>>>> Probably 3 mismatches is excessive for what I want to do, but I >>>>> thought I'd report the issue. It seems like matchPDict is reverting >>>>> back to 2 mismatches in this case but vwhichPDict probably does 3. >>>>> >>>>> Thanks, >>>>> >>>>> Ian >>>>> >>>>> >>>>> >>>>> On Aug 3, 2011, at 11:06 AM, Ian Henry wrote: >>>>> >>>>>> Excellent thanks for the help and the fix! >>>>>> >>>>>> The workaround works well for me for now. I did spend a while >>>>>> yesterday writing my own workaround but Harris' workaround is much >>>>>> simpler for now than my solution, which is given below but is rather >>>>>> dirty. >>>>>> >>>>>>> txmatches<- matchPDict(ps_pdict, transcript[[1]], >>>>>> max.mismatch=max_mis, min.mismatch=min_mis) >>>>>>> tdf<-as.data.frame(unlist(txmatches)) >>>>>>> patternIndex<-which(countIndex(txmatches)>0) >>>>>>> duptimes<- function(duptimes) >>>>>> rep(duptimes,length(txmatches[[duptimes]])) >>>>>>> tmp<-sapply(patternIndex,duptimes) >>>>>>> tdf$patternIndex<-unlist(tmp) >>>>>>> getNames<- function(getNames) names(ps_pdict)[getNames] >>>>>>> tdf$name<-sapply(tdf$patternIndex, getNames) >>>>>> >>>>>> >>>>>> >>>>>> On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote: >>>>>> >>>>>>> Hi Ian, Harris, >>>>>>> >>>>>>> On 11-08-02 08:45 AM, Harris A. Jaffee wrote: >>>>>>>> The short answer for the fuzzy matching case appears to be that, >>>>>>>> even >>>>>>>> though your names are maintained through most of the function >>>>>>>> >>>>>>>> matchPDict.R:.match.MTB_PDict, >>>>>>>> >>>>>>>> and in fact they are still present in the list 'ans_compons', they >>>>>>>> get >>>>>>>> dropped in this call: >>>>>>>> >>>>>>>> ans<- ByPos_MIndex.combine(ans_compons) >>>>>>>> >>>>>>>> By contrast, for exact matching, the function >>>>>>>> >>>>>>>> matchPDict.R:.match.TB_PDict >>>>>>>> >>>>>>>> returns your names here: >>>>>>>> >>>>>>>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), >>>>>>>> ends=C_ans) >>>>>>>> >>>>>>>> So my naive guess is that >>>>>>>> >>>>>>>> MIndex-class.R:ByPos_MIndex.combine >>>>>>>> >>>>>>>> should do that also, something like >>>>>>>> >>>>>>>> ans_names<- mi_list[[1]]@NAMES >>>>>>>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, >>>>>>>> NAMES=ans_names) >>>>>>> >>>>>>> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and >>>>>>> will be soon in Biostrings devel. With Biostrings 2.20.2: >>>>>>> >>>>>>> library(Biostrings) >>>>>>> probes<- DNAStringSet(c(probe1="aaaaaa", probe2="accacc")) >>>>>>> pdict1<- PDict(probes, max.mismatch=1) >>>>>>> m1<- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1) >>>>>>> >>>>>>> Then: >>>>>>> >>>>>>>> names(m1) >>>>>>> [1] "probe1" "probe2" >>>>>>> >>>>>>>> unlist(m1) >>>>>>> IRanges of length 3 >>>>>>> start end width names >>>>>>> [1] 2 7 6 probe1 >>>>>>> [2] 3 8 6 probe1 >>>>>>> [3] 7 12 6 probe2 >>>>>>> >>>>>>> Thanks guys for the report and for the fix! >>>>>>> H. >>>>>>> >>>>>>>> >>>>>>>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have a question regarding the inheritance of the names attribute >>>>>>>>> when using matchPDict. >>>>>>>>> >>>>>>>>> If I use matchPDict as follows: >>>>>>>>> >>>>>>>>> #Get transcript information >>>>>>>>>> hg19txdb<- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >>>>>>>>> "refGene") >>>>>>>>>> hg19_tx<- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >>>>>>>>> >>>>>>>>> #Create DNAStringSet with names associated with each probe >>>>>>>>>> probeset<- DNAStringSet(probelist$sequence) >>>>>>>>>> names(probeset)<-probelist$probenames >>>>>>>>> >>>>>>>>> #Create PDict object and match against human transcript 14 (I >>>>>>>>> know it >>>>>>>>> should match) >>>>>>>>>> ps_pdict<-PDict(probeset) >>>>>>>>>> txmatches<- matchPDict(ps_pdict, hg19_tx[[14]]) >>>>>>>>> >>>>>>>>> this compares the probes in ps_pdict to transcript 14 in hg19 and >>>>>>>>> gives: >>>>>>>>>> unlist(txmatches): >>>>>>>>> >>>>>>>>> start end width names >>>>>>>>> [1] 749 773 25 HW:6 >>>>>>>>> [2] 569 593 25 HW:16 >>>>>>>>> [3] 804 828 25 HW:26 >>>>>>>>> [4] 757 781 25 HW:36 >>>>>>>>> >>>>>>>>> which works :) >>>>>>>>> >>>>>>>>> However, if I search allowing for mismatches then the names >>>>>>>>> appear to >>>>>>>>> be lost: >>>>>>>>> >>>>>>>>>> ps_pdict1<-PDict(probeset, max.mismatch=1) >>>>>>>>>> txmatches1<- matchPDict(ps_pdict1, hg19_tx[[14]], >>>>>>>>> max.mismatch=1, >>>>>>>>> min.mismatch=0) >>>>>>>>>> unlist(txmatches1) >>>>>>>>> >>>>>>>>> IRanges of length 4 >>>>>>>>> start end width >>>>>>>>> [1] 749 773 25 >>>>>>>>> [2] 569 593 25 >>>>>>>>> [3] 804 828 25 >>>>>>>>> [4] 757 781 25 >>>>>>>>> >>>>>>>>> The result of matchPDict is a MIndex object that I named txmatches >>>>>>>>> with exact matches, and txmatches1 with 1 mismatch >>>>>>>>>> names(txmatches) #gives character vector containing probe names >>>>>>>>>> names(txmatches1) #returns NULL >>>>>>>>> >>>>>>>>> So it appears the names are not inherited. I tried to added them >>>>>>>>> manually to my MIndex object >>>>>>>>>> names(txmatches1)<-names(probeset) >>>>>>>>> >>>>>>>>> but I get Error: >>>>>>>>> attempt to modify the names of a ByPos_MIndex instance >>>>>>>>> >>>>>>>>> Therefore I'm not sure how to keep my probe names associated with >>>>>>>>> the >>>>>>>>> Transcript match, which is important for inexact matching. >>>>>>>>> >>>>>>>>> Any help would be greatly appreciated, >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Ian >>>>>>>>> >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> >>>>>>>>> R version 2.13.0 beta (2011-03-31 r55221) >>>>>>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] C/UTF-8/C/C/C/C >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17 >>>>>>>>> [3] BSgenome_1.19.5 Biostrings_2.19.17 >>>>>>>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31 >>>>>>>>> [7] IRanges_1.9.28 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0 >>>>>>>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1 >>>>>>>>> [7] rtracklayer_1.11.12 tools_2.13.0 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpages at fhcrc.org >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>>> >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.5 years ago Hervé Pagès 16k

Login before adding your answer.