Finding coding SNPs with predictCoding
2
0
Entering edit mode
@alex-gutteridge-2935
Last seen 10.3 years ago
United States
Hi, I'm trying to find coding SNPs for a list of genes. I'm trying to use predictCoding() from VariantAnnotation in the devel version, but I'm getting an error I can't get to the bottom of. Here is the script: library(BSgenome.Hsapiens.UCSC.hg19) library(VariantAnnotation) library(SNPlocs.Hsapiens.dbSNP.20110815) library(TxDb.Hsapiens.UCSC.hg19.knownGene) #List of genes to retrieve coding snps for entrez.ids = c('6233') #Transcriptdb txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene #SNPs - renamed seq levels to be compatible with txdb snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) renamed.levels = gsub("ch","chr",seqlevels(snps)) names(renamed.levels) = seqlevels(snps) snps = renameSeqlevels(snps,renamed.levels) #CDS grouped by gene gene.list = cdsBy(txdb19, by="gene") vsd.list = gene.list[entrez.ids] #For each gene for(eg in entrez.ids){ #Find all SNPs snp.idx = queryHits(findOverlaps(snps, vsd.list[eg][[1]])) eg.snps = snps[snp.idx] #Find variant alleles as iupac iupac = values(eg.snps)[,"alleles_as_ambig"] #Replicate snps according to number of variant alleles eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) #Create list of all variants variant.alleles = DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1] ]) #Find coding SNPs aa = predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant. alleles) } ######## But the predictCoding call gives this error: Error in .setSeqNames(x, value) : The replacement value for isActiveSeq must be a logical vector, with names that match the seqlevels of the object As you can see I had to redo the seqlevels for the SNPs because they were ch1, ch2, etc... rather than chr1, chr2 etc... Have I done that correctly or introduced some inconsistency there? ###### > sessionInfo() R Under development (unstable) (2012-02-19 r58418) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.6.4 [2] GenomicFeatures_1.7.25 [3] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 [4] VariantAnnotation_1.1.51 [5] Rsamtools_1.7.35 [6] AnnotationDbi_1.17.21 [7] Biobase_2.15.3 [8] BSgenome.Hsapiens.UCSC.hg19_1.3.17 [9] BSgenome_1.23.2 [10] Biostrings_2.23.6 [11] GenomicRanges_1.7.25 [12] IRanges_1.13.24 [13] BiocGenerics_0.1.4 [14] Gviz_0.99.0 [15] BiocInstaller_1.3.7 loaded via a namespace (and not attached): [1] biomaRt_2.11.1 bitops_1.0-4.1 DBI_0.2-5 lattice_0.20-0 [5] Matrix_1.0-3 RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 [9] rtracklayer_1.15.7 snpStats_1.5.4 splines_2.15.0 survival_2.36-12 [13] tools_2.15.0 XML_3.9-4 zlibbioc_1.1.1 -- Alex Gutteridge
SNP VariantAnnotation BSgenome SNPlocs BSgenome VariantAnnotation SNP VariantAnnotation • 1.6k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Alex, On 02/22/2012 03:56 AM, Alex Gutteridge wrote: > Hi, > > I'm trying to find coding SNPs for a list of genes. I'm trying to use > predictCoding() from VariantAnnotation in the devel version, but I'm > getting an error I can't get to the bottom of. Here is the script: > > library(BSgenome.Hsapiens.UCSC.hg19) > library(VariantAnnotation) > library(SNPlocs.Hsapiens.dbSNP.20110815) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > #List of genes to retrieve coding snps for > entrez.ids = c('6233') > > #Transcriptdb > txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene > > #SNPs - renamed seq levels to be compatible with txdb > snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) > renamed.levels = gsub("ch","chr",seqlevels(snps)) > names(renamed.levels) = seqlevels(snps) > snps = renameSeqlevels(snps,renamed.levels) Or, more concisely, instead of the 3 above lines: seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) > > #CDS grouped by gene > gene.list = cdsBy(txdb19, by="gene") > vsd.list = gene.list[entrez.ids] > > #For each gene > for(eg in entrez.ids){ > #Find all SNPs > snp.idx = queryHits(findOverlaps(snps, vsd.list[eg][[1]])) Use vsd.list[[eg]] instead of vsd.list[eg][[1]]. It gives the same result but is more concise. Also here, in your particular example, your snp doesn't hit multiple CDS but it could. So you might want to use unique: snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) > eg.snps = snps[snp.idx] > #Find variant alleles as iupac > iupac = values(eg.snps)[,"alleles_as_ambig"] > #Replicate snps according to number of variant alleles > eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) > #Create list of all variants > variant.alleles = > DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[ 1]]) > #Find coding SNPs > aa = > predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=varian t.alleles) > > } > > ######## > > But the predictCoding call gives this error: > > Error in .setSeqNames(x, value) : > The replacement value for isActiveSeq must be a logical vector, with > names that match the seqlevels of the object > > As you can see I had to redo the seqlevels for the SNPs because they > were ch1, ch2, etc... rather than chr1, chr2 etc... Have I done that > correctly or introduced some inconsistency there? The error message doesn't help much but I think the pb is that you didn't rename chMT properly. Try to do this: seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) before you start the for(eg in entrez.ids){..} loop again. Cheers, H. > > ###### > >> sessionInfo() > R Under development (unstable) (2012-02-19 r58418) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.6.4 > [2] GenomicFeatures_1.7.25 > [3] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 > [4] VariantAnnotation_1.1.51 > [5] Rsamtools_1.7.35 > [6] AnnotationDbi_1.17.21 > [7] Biobase_2.15.3 > [8] BSgenome.Hsapiens.UCSC.hg19_1.3.17 > [9] BSgenome_1.23.2 > [10] Biostrings_2.23.6 > [11] GenomicRanges_1.7.25 > [12] IRanges_1.13.24 > [13] BiocGenerics_0.1.4 > [14] Gviz_0.99.0 > [15] BiocInstaller_1.3.7 > > loaded via a namespace (and not attached): > [1] biomaRt_2.11.1 bitops_1.0-4.1 DBI_0.2-5 lattice_0.20-0 > [5] Matrix_1.0-3 RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 > [9] rtracklayer_1.15.7 snpStats_1.5.4 splines_2.15.0 survival_2.36-12 > [13] tools_2.15.0 XML_3.9-4 zlibbioc_1.1.1 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
On 22.02.2012 18:58, Hervé Pagès wrote: > Hi Alex, > > On 02/22/2012 03:56 AM, Alex Gutteridge wrote: [...] >> But the predictCoding call gives this error: >> >> Error in .setSeqNames(x, value) : >> The replacement value for isActiveSeq must be a logical vector, with >> names that match the seqlevels of the object > > The error message doesn't help much but I think the pb is that you > didn't rename chMT properly. Try to do this: > > seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > > before you start the for(eg in entrez.ids){..} loop again. > > Cheers, > H. Thanks Herv? that nailed it. I'm having some difficulty joining up the output of predictCoding() with the query SNPs though. If someone could point out where the disconnect in my thinking is I would appreciate it! Here's my (now edited down) script: library(BSgenome.Hsapiens.UCSC.hg19) library(VariantAnnotation) library(SNPlocs.Hsapiens.dbSNP.20110815) library(TxDb.Hsapiens.UCSC.hg19.knownGene) entrez.ids = c('6335') txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) gene.list = cdsBy(txdb19, by="gene") vsd.list = gene.list[entrez.ids] cds.list = cdsBy(txdb19,by="tx") eg = entrez.ids[1] snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) eg.snps = snps[snp.idx] iupac = values(eg.snps)[,"alleles_as_ambig"] eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) variant.alleles = DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1] ]) aa = predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant. alleles) ##### Then if I query the predictCoding results in aa in an interactive session I get the following (see inline comments for what I think should be happening, but I must be misinterpreting what queryID means) The docs for predictCoding() contain a small typo (s/queryHits/queryID), but otherwise seem clear? Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. ?queryHits? The ?queryHits? column provides a map back to the variants in the original ?query?. If the ?query? was a ?VCF? object this index corresponds to the row in the ?GRanges? in the ?rowData? slot. If ?query? was an expanded ?GRanges?, ?RangedData? or ?RangesList? the index corresponds to the row in the expanded object. ##### > aa[1,] DataFrame with 1 row and 9 columns queryID consequence refSeq varSeq refAA <integer> <factor> <dnastringset> <dnastringset> <aastringset> 1 1 nonsynonymous CTC ATC L varAA txID geneID cdsID <aastringset> <character> <factor> <integer> 1 I 10921 6335 33668 >#So the first SNP (queryID: 1) is nonsynonymous and maps to tx '10921' > and cds '33668'. >#If I look at the first query SNP I get this: > eg.snps.exp[aa[1,'queryID'],] GRanges with 1 range and 2 elementMetadata values: seqnames ranges strand | RefSNP_id alleles_as_ambig <rle> <iranges> <rle> | <character> <character> [1] chr2 [167055370, 167055370] * | 111558968 R --- seqlengths: chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX chrY chrM NA NA NA NA NA NA ... NA NA NA NA NA NA >#So SNP 1 is at 167055370 on chr2 >#But if I check tx '10921' I see that the cds overlapping 167055370 is > actually '33651' >#And cds '33668' is at the other end of the tx: > cds.list[[aa[1,'txID']]] GRanges with 26 ranges and 3 elementMetadata values: seqnames ranges strand | cds_id cds_name <rle> <iranges> <rle> | <integer> <character> [1] chr2 [167168009, 167168266] - | 33668 <na> [2] chr2 [167163466, 167163584] - | 33667 <na> [3] chr2 [167163020, 167163109] - | 33666 <na> [4] chr2 [167162302, 167162430] - | 33647 <na> [5] chr2 [167160748, 167160839] - | 33646 <na> [6] chr2 [167159600, 167159812] - | 33645 <na> [7] chr2 [167151109, 167151172] - | 33644 <na> [8] chr2 [167149741, 167149882] - | 33643 <na> [9] chr2 [167144947, 167145153] - | 33642 <na> ... ... ... ... ... ... ... [18] chr2 [167099012, 167099166] - | 33659 <na> [19] chr2 [167094604, 167094777] - | 33658 <na> [20] chr2 [167089850, 167089972] - | 33657 <na> [21] chr2 [167085201, 167085482] - | 33656 <na> [22] chr2 [167084180, 167084233] - | 33655 <na> [23] chr2 [167083077, 167083214] - | 33654 <na> [24] chr2 [167060870, 167060974] - | 33653 <na> [25] chr2 [167060465, 167060735] - | 33652 <na> [26] chr2 [167055182, 167056374] - | 33651 <na> exon_rank <integer> [1] 2 [2] 3 [3] 4 [4] 5 [5] 6 [6] 7 [7] 8 [8] 9 [9] 10 ... ... [18] 19 [19] 20 [20] 21 [21] 22 [22] 23 [23] 24 [24] 25 [25] 26 [26] 27 --- seqlengths: chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 -- Alex Gutteridge
ADD REPLY
0
Entering edit mode
Hi Alex, Thanks for the bug report. The cdsID was taken from an overlap between the query and GRangesList of cds by transcripts. This gave the correct transcript number but (incorrectly) took the first cds number in the list by default. Now fixed in devel 1.1.55. I've also updated the man page. Valerie On 02/24/2012 02:08 AM, Alex Gutteridge wrote: > On 22.02.2012 18:58, Hervé Pagès wrote: >> Hi Alex, >> >> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: > > [...] > >>> But the predictCoding call gives this error: >>> >>> Error in .setSeqNames(x, value) : >>> The replacement value for isActiveSeq must be a logical vector, with >>> names that match the seqlevels of the object >> >> The error message doesn't help much but I think the pb is that you >> didn't rename chMT properly. Try to do this: >> >> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >> >> before you start the for(eg in entrez.ids){..} loop again. >> >> Cheers, >> H. > > Thanks Herv? that nailed it. I'm having some difficulty joining up the > output of predictCoding() with the query SNPs though. If someone could > point out where the disconnect in my thinking is I would appreciate it! > > Here's my (now edited down) script: > > library(BSgenome.Hsapiens.UCSC.hg19) > library(VariantAnnotation) > library(SNPlocs.Hsapiens.dbSNP.20110815) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > entrez.ids = c('6335') > txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene > > snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) > seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) > seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > > gene.list = cdsBy(txdb19, by="gene") > vsd.list = gene.list[entrez.ids] > cds.list = cdsBy(txdb19,by="tx") > > eg = entrez.ids[1] > > snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) > eg.snps = snps[snp.idx] > iupac = values(eg.snps)[,"alleles_as_ambig"] > eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) > variant.alleles = > DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[ 1]]) > > aa = > predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=varian t.alleles) > > ##### > > Then if I query the predictCoding results in aa in an interactive > session I get the following (see inline comments for what I think > should be happening, but I must be misinterpreting what queryID means) > > The docs for predictCoding() contain a small typo > (s/queryHits/queryID), but otherwise seem clear? > > Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, > ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. > > ?queryHits? The ?queryHits? column provides a map back to the > variants in the original ?query?. If the ?query? was a ?VCF? > object this index corresponds to the row in the ?GRanges? in > the ?rowData? slot. If ?query? was an expanded ?GRanges?, > ?RangedData? or ?RangesList? the index corresponds to the row > in the expanded object. > > ##### > >> aa[1,] > DataFrame with 1 row and 9 columns > queryID consequence refSeq varSeq refAA > <integer> <factor> <dnastringset> <dnastringset> <aastringset> > 1 1 nonsynonymous CTC ATC L > varAA txID geneID cdsID > <aastringset> <character> <factor> <integer> > 1 I 10921 6335 33668 >> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >> '10921' and cds '33668'. >> #If I look at the first query SNP I get this: >> eg.snps.exp[aa[1,'queryID'],] > GRanges with 1 range and 2 elementMetadata values: > seqnames ranges strand | RefSNP_id > alleles_as_ambig > <rle> <iranges> <rle> | <character> <character> > [1] chr2 [167055370, 167055370] * | > 111558968 R > --- > seqlengths: > chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX > chrY chrM > NA NA NA NA NA NA ... NA NA NA NA > NA NA >> #So SNP 1 is at 167055370 on chr2 >> #But if I check tx '10921' I see that the cds overlapping 167055370 >> is actually '33651' >> #And cds '33668' is at the other end of the tx: >> cds.list[[aa[1,'txID']]] > GRanges with 26 ranges and 3 elementMetadata values: > seqnames ranges strand | cds_id cds_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr2 [167168009, 167168266] - | 33668 <na> > [2] chr2 [167163466, 167163584] - | 33667 <na> > [3] chr2 [167163020, 167163109] - | 33666 <na> > [4] chr2 [167162302, 167162430] - | 33647 <na> > [5] chr2 [167160748, 167160839] - | 33646 <na> > [6] chr2 [167159600, 167159812] - | 33645 <na> > [7] chr2 [167151109, 167151172] - | 33644 <na> > [8] chr2 [167149741, 167149882] - | 33643 <na> > [9] chr2 [167144947, 167145153] - | 33642 <na> > ... ... ... ... ... ... ... > [18] chr2 [167099012, 167099166] - | 33659 <na> > [19] chr2 [167094604, 167094777] - | 33658 <na> > [20] chr2 [167089850, 167089972] - | 33657 <na> > [21] chr2 [167085201, 167085482] - | 33656 <na> > [22] chr2 [167084180, 167084233] - | 33655 <na> > [23] chr2 [167083077, 167083214] - | 33654 <na> > [24] chr2 [167060870, 167060974] - | 33653 <na> > [25] chr2 [167060465, 167060735] - | 33652 <na> > [26] chr2 [167055182, 167056374] - | 33651 <na> > exon_rank > <integer> > [1] 2 > [2] 3 > [3] 4 > [4] 5 > [5] 6 > [6] 7 > [7] 8 > [8] 9 > [9] 10 > ... ... > [18] 19 > [19] 20 > [20] 21 > [21] 22 > [22] 23 > [23] 24 > [24] 25 > [25] 26 > [26] 27 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > >
ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States
Good suggestion. Yes, predictCoding is does this internally. I'll post back here when this has been added. Valerie On 02/28/2012 01:49 AM, Alex Gutteridge wrote: > Hi Valerie, > > Thanks everything works great now. One small feature request - would > it be hard to output the protein sequence position of the coding SNPs? > At the moment once I've run predictCoding I'm re-extracting the cds > and working out the position of each coding SNP so I can see where in > the protein sequence it is, but it seems like this is probably just > replicating what predictCoding must be doing internally anyway? > > Alex Gutteridge On 02/24/2012 10:39 AM, Valerie Obenchain wrote: > Hi Alex, > > Thanks for the bug report. The cdsID was taken from an overlap between > the query and GRangesList of cds by transcripts. This gave the correct > transcript number but (incorrectly) took the first cds number in the > list by default. Now fixed in devel 1.1.55. > > I've also updated the man page. > > Valerie > > > > On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >> On 22.02.2012 18:58, Hervé Pagès wrote: >>> Hi Alex, >>> >>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >> >> [...] >> >>>> But the predictCoding call gives this error: >>>> >>>> Error in .setSeqNames(x, value) : >>>> The replacement value for isActiveSeq must be a logical vector, with >>>> names that match the seqlevels of the object >>> >>> The error message doesn't help much but I think the pb is that you >>> didn't rename chMT properly. Try to do this: >>> >>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>> >>> before you start the for(eg in entrez.ids){..} loop again. >>> >>> Cheers, >>> H. >> >> Thanks Herv? that nailed it. I'm having some difficulty joining up >> the output of predictCoding() with the query SNPs though. If someone >> could point out where the disconnect in my thinking is I would >> appreciate it! >> >> Here's my (now edited down) script: >> >> library(BSgenome.Hsapiens.UCSC.hg19) >> library(VariantAnnotation) >> library(SNPlocs.Hsapiens.dbSNP.20110815) >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> >> entrez.ids = c('6335') >> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >> >> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >> >> gene.list = cdsBy(txdb19, by="gene") >> vsd.list = gene.list[entrez.ids] >> cds.list = cdsBy(txdb19,by="tx") >> >> eg = entrez.ids[1] >> >> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >> eg.snps = snps[snp.idx] >> iupac = values(eg.snps)[,"alleles_as_ambig"] >> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >> variant.alleles = >> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[ [1]]) >> >> aa = >> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=varia nt.alleles) >> >> ##### >> >> Then if I query the predictCoding results in aa in an interactive >> session I get the following (see inline comments for what I think >> should be happening, but I must be misinterpreting what queryID means) >> >> The docs for predictCoding() contain a small typo >> (s/queryHits/queryID), but otherwise seem clear? >> >> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >> >> ?queryHits? The ?queryHits? column provides a map back to the >> variants in the original ?query?. If the ?query? was a ?VCF? >> object this index corresponds to the row in the ?GRanges? in >> the ?rowData? slot. If ?query? was an expanded ?GRanges?, >> ?RangedData? or ?RangesList? the index corresponds to the row >> in the expanded object. >> >> ##### >> >>> aa[1,] >> DataFrame with 1 row and 9 columns >> queryID consequence refSeq varSeq refAA >> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >> 1 1 nonsynonymous CTC ATC L >> varAA txID geneID cdsID >> <aastringset> <character> <factor> <integer> >> 1 I 10921 6335 33668 >>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>> '10921' and cds '33668'. >>> #If I look at the first query SNP I get this: >>> eg.snps.exp[aa[1,'queryID'],] >> GRanges with 1 range and 2 elementMetadata values: >> seqnames ranges strand | RefSNP_id >> alleles_as_ambig >> <rle> <iranges> <rle> | <character> <character> >> [1] chr2 [167055370, 167055370] * | >> 111558968 R >> --- >> seqlengths: >> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX >> chrY chrM >> NA NA NA NA NA NA ... NA NA NA NA >> NA NA >>> #So SNP 1 is at 167055370 on chr2 >>> #But if I check tx '10921' I see that the cds overlapping 167055370 >>> is actually '33651' >>> #And cds '33668' is at the other end of the tx: >>> cds.list[[aa[1,'txID']]] >> GRanges with 26 ranges and 3 elementMetadata values: >> seqnames ranges strand | cds_id cds_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr2 [167168009, 167168266] - | 33668 <na> >> [2] chr2 [167163466, 167163584] - | 33667 <na> >> [3] chr2 [167163020, 167163109] - | 33666 <na> >> [4] chr2 [167162302, 167162430] - | 33647 <na> >> [5] chr2 [167160748, 167160839] - | 33646 <na> >> [6] chr2 [167159600, 167159812] - | 33645 <na> >> [7] chr2 [167151109, 167151172] - | 33644 <na> >> [8] chr2 [167149741, 167149882] - | 33643 <na> >> [9] chr2 [167144947, 167145153] - | 33642 <na> >> ... ... ... ... ... ... ... >> [18] chr2 [167099012, 167099166] - | 33659 <na> >> [19] chr2 [167094604, 167094777] - | 33658 <na> >> [20] chr2 [167089850, 167089972] - | 33657 <na> >> [21] chr2 [167085201, 167085482] - | 33656 <na> >> [22] chr2 [167084180, 167084233] - | 33655 <na> >> [23] chr2 [167083077, 167083214] - | 33654 <na> >> [24] chr2 [167060870, 167060974] - | 33653 <na> >> [25] chr2 [167060465, 167060735] - | 33652 <na> >> [26] chr2 [167055182, 167056374] - | 33651 <na> >> exon_rank >> <integer> >> [1] 2 >> [2] 3 >> [3] 4 >> [4] 5 >> [5] 6 >> [6] 7 >> [7] 8 >> [8] 9 >> [9] 10 >> ... ... >> [18] 19 >> [19] 20 >> [20] 21 >> [21] 22 >> [22] 23 >> [23] 24 >> [24] 25 >> [25] 26 >> [26] 27 >> --- >> seqlengths: >> chr1 chr2 ... chr18_gl000207_random >> 249250621 243199373 ... 4262 >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6