Finding coding SNPs with predictCoding

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

A 'txLoc' column has been added to the output of predictCoding. Available in devel version 1.1.57. Valerie On 02/28/2012 08:20 AM, Valerie Obenchain wrote: > Good suggestion. Yes, predictCoding is does this internally. I'll post > back here when this has been added. > > Valerie > > > > On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >> Hi Valerie, >> >> Thanks everything works great now. One small feature request - would >> it be hard to output the protein sequence position of the coding >> SNPs? At the moment once I've run predictCoding I'm re-extracting the >> cds and working out the position of each coding SNP so I can see >> where in the protein sequence it is, but it seems like this is >> probably just replicating what predictCoding must be doing internally >> anyway? >> >> Alex Gutteridge > > > On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >> Hi Alex, >> >> Thanks for the bug report. The cdsID was taken from an overlap >> between the query and GRangesList of cds by transcripts. This gave >> the correct transcript number but (incorrectly) took the first cds >> number in the list by default. Now fixed in devel 1.1.55. >> >> I've also updated the man page. >> >> Valerie >> >> >> >> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >>> On 22.02.2012 18:58, Hervé Pagès wrote: >>>> Hi Alex, >>>> >>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >>> >>> [...] >>> >>>>> But the predictCoding call gives this error: >>>>> >>>>> Error in .setSeqNames(x, value) : >>>>> The replacement value for isActiveSeq must be a logical vector, with >>>>> names that match the seqlevels of the object >>>> >>>> The error message doesn't help much but I think the pb is that you >>>> didn't rename chMT properly. Try to do this: >>>> >>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>> >>>> before you start the for(eg in entrez.ids){..} loop again. >>>> >>>> Cheers, >>>> H. >>> >>> Thanks Herv? that nailed it. I'm having some difficulty joining up >>> the output of predictCoding() with the query SNPs though. If someone >>> could point out where the disconnect in my thinking is I would >>> appreciate it! >>> >>> Here's my (now edited down) script: >>> >>> library(BSgenome.Hsapiens.UCSC.hg19) >>> library(VariantAnnotation) >>> library(SNPlocs.Hsapiens.dbSNP.20110815) >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>> >>> entrez.ids = c('6335') >>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >>> >>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>> >>> gene.list = cdsBy(txdb19, by="gene") >>> vsd.list = gene.list[entrez.ids] >>> cds.list = cdsBy(txdb19,by="tx") >>> >>> eg = entrez.ids[1] >>> >>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >>> eg.snps = snps[snp.idx] >>> iupac = values(eg.snps)[,"alleles_as_ambig"] >>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >>> variant.alleles = >>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"") [[1]]) >>> >>> >>> aa = >>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=vari ant.alleles) >>> >>> ##### >>> >>> Then if I query the predictCoding results in aa in an interactive >>> session I get the following (see inline comments for what I think >>> should be happening, but I must be misinterpreting what queryID means) >>> >>> The docs for predictCoding() contain a small typo >>> (s/queryHits/queryID), but otherwise seem clear? >>> >>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >>> >>> ?queryHits? The ?queryHits? column provides a map back to the >>> variants in the original ?query?. If the ?query? was a ?VCF? >>> object this index corresponds to the row in the ?GRanges? in >>> the ?rowData? slot. If ?query? was an expanded ?GRanges?, >>> ?RangedData? or ?RangesList? the index corresponds to the row >>> in the expanded object. >>> >>> ##### >>> >>>> aa[1,] >>> DataFrame with 1 row and 9 columns >>> queryID consequence refSeq varSeq refAA >>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >>> 1 1 nonsynonymous CTC ATC L >>> varAA txID geneID cdsID >>> <aastringset> <character> <factor> <integer> >>> 1 I 10921 6335 33668 >>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>>> '10921' and cds '33668'. >>>> #If I look at the first query SNP I get this: >>>> eg.snps.exp[aa[1,'queryID'],] >>> GRanges with 1 range and 2 elementMetadata values: >>> seqnames ranges strand | RefSNP_id >>> alleles_as_ambig >>> <rle> <iranges> <rle> | <character> <character> >>> [1] chr2 [167055370, 167055370] * | >>> 111558968 R >>> --- >>> seqlengths: >>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX >>> chrY chrM >>> NA NA NA NA NA NA ... NA NA NA >>> NA NA NA >>>> #So SNP 1 is at 167055370 on chr2 >>>> #But if I check tx '10921' I see that the cds overlapping 167055370 >>>> is actually '33651' >>>> #And cds '33668' is at the other end of the tx: >>>> cds.list[[aa[1,'txID']]] >>> GRanges with 26 ranges and 3 elementMetadata values: >>> seqnames ranges strand | cds_id cds_name >>> <rle> <iranges> <rle> | <integer> <character> >>> [1] chr2 [167168009, 167168266] - | 33668 <na> >>> [2] chr2 [167163466, 167163584] - | 33667 <na> >>> [3] chr2 [167163020, 167163109] - | 33666 <na> >>> [4] chr2 [167162302, 167162430] - | 33647 <na> >>> [5] chr2 [167160748, 167160839] - | 33646 <na> >>> [6] chr2 [167159600, 167159812] - | 33645 <na> >>> [7] chr2 [167151109, 167151172] - | 33644 <na> >>> [8] chr2 [167149741, 167149882] - | 33643 <na> >>> [9] chr2 [167144947, 167145153] - | 33642 <na> >>> ... ... ... ... ... ... ... >>> [18] chr2 [167099012, 167099166] - | 33659 <na> >>> [19] chr2 [167094604, 167094777] - | 33658 <na> >>> [20] chr2 [167089850, 167089972] - | 33657 <na> >>> [21] chr2 [167085201, 167085482] - | 33656 <na> >>> [22] chr2 [167084180, 167084233] - | 33655 <na> >>> [23] chr2 [167083077, 167083214] - | 33654 <na> >>> [24] chr2 [167060870, 167060974] - | 33653 <na> >>> [25] chr2 [167060465, 167060735] - | 33652 <na> >>> [26] chr2 [167055182, 167056374] - | 33651 <na> >>> exon_rank >>> <integer> >>> [1] 2 >>> [2] 3 >>> [3] 4 >>> [4] 5 >>> [5] 6 >>> [6] 7 >>> [7] 8 >>> [8] 9 >>> [9] 10 >>> ... ... >>> [18] 19 >>> [19] 20 >>> [20] 21 >>> [21] 22 >>> [22] 23 >>> [23] 24 >>> [24] 25 >>> [25] 26 >>> [26] 27 >>> --- >>> seqlengths: >>> chr1 chr2 ... >>> chr18_gl000207_random >>> 249250621 243199373 ... >>> 4262 >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

SNP ctc SNP ctc • 1.4k views

ADD COMMENT • link updated 12.2 years ago by Thomas Girke ★ 1.7k • written 12.2 years ago by Valerie Obenchain ★ 6.8k

0

Entering edit mode

Alex Gutteridge ▴ 650

@alex-gutteridge-2935

Last seen 9.6 years ago

United States

Thanks Valerie - much appreciated! On 01.03.2012 21:30, Valerie Obenchain wrote: > A 'txLoc' column has been added to the output of predictCoding. > Available in devel version 1.1.57. > > Valerie > > > On 02/28/2012 08:20 AM, Valerie Obenchain wrote: >> Good suggestion. Yes, predictCoding is does this internally. I'll >> post back here when this has been added. >> >> Valerie >> >> >> >> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >>> Hi Valerie, >>> >>> Thanks everything works great now. One small feature request - >>> would it be hard to output the protein sequence position of the >>> coding SNPs? At the moment once I've run predictCoding I'm >>> re-extracting the cds and working out the position of each coding SNP >>> so I can see where in the protein sequence it is, but it seems like >>> this is probably just replicating what predictCoding must be doing >>> internally anyway? >>> >>> Alex Gutteridge >> >> >> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >>> Hi Alex, >>> >>> Thanks for the bug report. The cdsID was taken from an overlap >>> between the query and GRangesList of cds by transcripts. This gave >>> the correct transcript number but (incorrectly) took the first cds >>> number in the list by default. Now fixed in devel 1.1.55. >>> >>> I've also updated the man page. >>> >>> Valerie >>> >>> >>> >>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >>>> On 22.02.2012 18:58, Hervé Pagès wrote: >>>>> Hi Alex, >>>>> >>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >>>> >>>> [...] >>>> >>>>>> But the predictCoding call gives this error: >>>>>> >>>>>> Error in .setSeqNames(x, value) : >>>>>> The replacement value for isActiveSeq must be a logical vector, >>>>>> with >>>>>> names that match the seqlevels of the object >>>>> >>>>> The error message doesn't help much but I think the pb is that >>>>> you >>>>> didn't rename chMT properly. Try to do this: >>>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>>> >>>>> before you start the for(eg in entrez.ids){..} loop again. >>>>> >>>>> Cheers, >>>>> H. >>>> >>>> Thanks Herv? that nailed it. I'm having some difficulty joining up >>>> the output of predictCoding() with the query SNPs though. If someone >>>> could point out where the disconnect in my thinking is I would >>>> appreciate it! >>>> >>>> Here's my (now edited down) script: >>>> >>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>> library(VariantAnnotation) >>>> library(SNPlocs.Hsapiens.dbSNP.20110815) >>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> >>>> entrez.ids = c('6335') >>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >>>> >>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>> >>>> gene.list = cdsBy(txdb19, by="gene") >>>> vsd.list = gene.list[entrez.ids] >>>> cds.list = cdsBy(txdb19,by="tx") >>>> >>>> eg = entrez.ids[1] >>>> >>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >>>> eg.snps = snps[snp.idx] >>>> iupac = values(eg.snps)[,"alleles_as_ambig"] >>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >>>> variant.alleles = >>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"" )[[1]]) >>>> >>>> >>>> aa = >>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=var iant.alleles) >>>> >>>> ##### >>>> >>>> Then if I query the predictCoding results in aa in an interactive >>>> session I get the following (see inline comments for what I think >>>> should be happening, but I must be misinterpreting what queryID >>>> means) >>>> >>>> The docs for predictCoding() contain a small typo >>>> (s/queryHits/queryID), but otherwise seem clear? >>>> >>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >>>> >>>> ?queryHits? The ?queryHits? column provides a map back to the >>>> variants in the original ?query?. If the ?query? was a >>>> ?VCF? >>>> object this index corresponds to the row in the >>>> ?GRanges? in >>>> the ?rowData? slot. If ?query? was an expanded >>>> ?GRanges?, >>>> ?RangedData? or ?RangesList? the index corresponds to >>>> the row >>>> in the expanded object. >>>> >>>> ##### >>>> >>>>> aa[1,] >>>> DataFrame with 1 row and 9 columns >>>> queryID consequence refSeq varSeq >>>> refAA >>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >>>> 1 1 nonsynonymous CTC ATC >>>> L >>>> varAA txID geneID cdsID >>>> <aastringset> <character> <factor> <integer> >>>> 1 I 10921 6335 33668 >>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>>>> '10921' and cds '33668'. >>>>> #If I look at the first query SNP I get this: >>>>> eg.snps.exp[aa[1,'queryID'],] >>>> GRanges with 1 range and 2 elementMetadata values: >>>> seqnames ranges strand | RefSNP_id >>>> alleles_as_ambig >>>> <rle> <iranges> <rle> | <character> <character> >>>> [1] chr2 [167055370, 167055370] * | 111558968 >>>> R >>>> --- >>>> seqlengths: >>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX >>>> chrY chrM >>>> NA NA NA NA NA NA ... NA NA NA NA >>>> NA NA >>>>> #So SNP 1 is at 167055370 on chr2 >>>>> #But if I check tx '10921' I see that the cds overlapping >>>>> 167055370 is actually '33651' >>>>> #And cds '33668' is at the other end of the tx: >>>>> cds.list[[aa[1,'txID']]] >>>> GRanges with 26 ranges and 3 elementMetadata values: >>>> seqnames ranges strand | cds_id >>>> cds_name >>>> <rle> <iranges> <rle> | <integer> <character> >>>> [1] chr2 [167168009, 167168266] - | 33668 <na> >>>> [2] chr2 [167163466, 167163584] - | 33667 <na> >>>> [3] chr2 [167163020, 167163109] - | 33666 <na> >>>> [4] chr2 [167162302, 167162430] - | 33647 <na> >>>> [5] chr2 [167160748, 167160839] - | 33646 <na> >>>> [6] chr2 [167159600, 167159812] - | 33645 <na> >>>> [7] chr2 [167151109, 167151172] - | 33644 <na> >>>> [8] chr2 [167149741, 167149882] - | 33643 <na> >>>> [9] chr2 [167144947, 167145153] - | 33642 <na> >>>> ... ... ... ... ... ... >>>> ... >>>> [18] chr2 [167099012, 167099166] - | 33659 <na> >>>> [19] chr2 [167094604, 167094777] - | 33658 <na> >>>> [20] chr2 [167089850, 167089972] - | 33657 <na> >>>> [21] chr2 [167085201, 167085482] - | 33656 <na> >>>> [22] chr2 [167084180, 167084233] - | 33655 <na> >>>> [23] chr2 [167083077, 167083214] - | 33654 <na> >>>> [24] chr2 [167060870, 167060974] - | 33653 <na> >>>> [25] chr2 [167060465, 167060735] - | 33652 <na> >>>> [26] chr2 [167055182, 167056374] - | 33651 <na> >>>> exon_rank >>>> <integer> >>>> [1] 2 >>>> [2] 3 >>>> [3] 4 >>>> [4] 5 >>>> [5] 6 >>>> [6] 7 >>>> [7] 8 >>>> [8] 9 >>>> [9] 10 >>>> ... ... >>>> [18] 19 >>>> [19] 20 >>>> [20] 21 >>>> [21] 22 >>>> [22] 23 >>>> [23] 24 >>>> [24] 25 >>>> [25] 26 >>>> [26] 27 >>>> --- >>>> seqlengths: >>>> chr1 chr2 ... >>>> chr18_gl000207_random >>>> 249250621 243199373 ... >>>> 4262 >>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor -- Alex Gutteridge

ADD COMMENT • link 12.2 years ago Alex Gutteridge ▴ 650

0

Entering edit mode

Slight change to this - I'm now returning the following new columns, \item{\code{seqTxLoc}}{ Location in transcript-based coordinates of the first nucleotide in the codon sequence to be translated. This position corresponds to the first nucleotide in both the \code{refSeq} and \code{varSeq} columns. } \item{\code{varTxLoc}}{ Location in transcript-based coordinates of the first nucleotide in the variant. This value will be the same as \code{seqTxLoc} when the variant starts exactly at the beginning of a codon. } \item{\code{varCdsLoc}}{ Location in cds-based coordinates of the first nucleotide in the variant. This position is relative to the start of the cds region defined in the \code{subject} annotation. } \item{\code{subjStrand}}{ The strand of the \code{subject} the variant matched. \code{predictCoding} determines which variants fall in a coding region by finding the overlaps between the \code{query} and \code{subject}. The \code{query} may be un-stranded \sQuote{*} but the \code{subject} annotation will have a strand. } You are interested in 'protein coordinates'. Does 'varCdsLoc' described above meet the need or are you looking for the actual codon number in the coding sequence? I am interested in hearing more about what you are doing with the protein coordinates, how you are using them. It would help us better design future functions. Thanks, Valerie On 03/02/2012 01:11 AM, Alex Gutteridge wrote: > Thanks Valerie - much appreciated! > > On 01.03.2012 21:30, Valerie Obenchain wrote: >> A 'txLoc' column has been added to the output of predictCoding. >> Available in devel version 1.1.57. >> >> Valerie >> >> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: >>> Good suggestion. Yes, predictCoding is does this internally. I'll >>> post back here when this has been added. >>> >>> Valerie >>> >>> >>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >>>> Hi Valerie, >>>> >>>> Thanks everything works great now. One small feature request - >>>> would it be hard to output the protein sequence position of the >>>> coding SNPs? At the moment once I've run predictCoding I'm >>>> re-extracting the cds and working out the position of each coding >>>> SNP so I can see where in the protein sequence it is, but it seems >>>> like this is probably just replicating what predictCoding must be >>>> doing internally anyway? >>>> >>>> Alex Gutteridge >>> >>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >>>> Hi Alex, >>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap >>>> between the query and GRangesList of cds by transcripts. This gave >>>> the correct transcript number but (incorrectly) took the first cds >>>> number in the list by default. Now fixed in devel 1.1.55. >>>> >>>> I've also updated the man page. >>>> >>>> Valerie >>>> >>>> >>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: >>>>>> Hi Alex, >>>>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >>>>> >>>>> [...] >>>>> >>>>>>> But the predictCoding call gives this error: >>>>>>> >>>>>>> Error in .setSeqNames(x, value) : >>>>>>> The replacement value for isActiveSeq must be a logical vector, >>>>>>> with >>>>>>> names that match the seqlevels of the object >>>>>> >>>>>> The error message doesn't help much but I think the pb is that you >>>>>> didn't rename chMT properly. Try to do this: >>>>>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again. >>>>>> >>>>>> Cheers, >>>>>> H. >>>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty joining up >>>>> the output of predictCoding() with the query SNPs though. If >>>>> someone could point out where the disconnect in my thinking is I >>>>> would appreciate it! >>>>> >>>>> Here's my (now edited down) script: >>>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>>> library(VariantAnnotation) >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>> >>>>> entrez.ids = c('6335') >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >>>>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>>> >>>>> gene.list = cdsBy(txdb19, by="gene") >>>>> vsd.list = gene.list[entrez.ids] >>>>> cds.list = cdsBy(txdb19,by="tx") >>>>> >>>>> eg = entrez.ids[1] >>>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >>>>> eg.snps = snps[snp.idx] >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >>>>> variant.alleles = >>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse="")," ")[[1]]) >>>>> >>>>> >>>>> >>>>> aa = >>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=va riant.alleles) >>>>> >>>>> ##### >>>>> >>>>> Then if I query the predictCoding results in aa in an interactive >>>>> session I get the following (see inline comments for what I think >>>>> should be happening, but I must be misinterpreting what queryID >>>>> means) >>>>> >>>>> The docs for predictCoding() contain a small typo >>>>> (s/queryHits/queryID), but otherwise seem clear? >>>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >>>>> >>>>> ?queryHits? The ?queryHits? column provides a map back to the >>>>> variants in the original ?query?. If the ?query? was a >>>>> ?VCF? >>>>> object this index corresponds to the row in the >>>>> ?GRanges? in >>>>> the ?rowData? slot. If ?query? was an expanded ?GRanges?, >>>>> ?RangedData? or ?RangesList? the index corresponds to >>>>> the row >>>>> in the expanded object. >>>>> >>>>> ##### >>>>> >>>>>> aa[1,] >>>>> DataFrame with 1 row and 9 columns >>>>> queryID consequence refSeq varSeq refAA >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >>>>> 1 1 nonsynonymous CTC ATC L >>>>> varAA txID geneID cdsID >>>>> <aastringset> <character> <factor> <integer> >>>>> 1 I 10921 6335 33668 >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>>>>> '10921' and cds '33668'. >>>>>> #If I look at the first query SNP I get this: >>>>>> eg.snps.exp[aa[1,'queryID'],] >>>>> GRanges with 1 range and 2 elementMetadata values: >>>>> seqnames ranges strand | RefSNP_id >>>>> alleles_as_ambig >>>>> <rle> <iranges> <rle> | <character> <character> >>>>> [1] chr2 [167055370, 167055370] * | 111558968 >>>>> R >>>>> --- >>>>> seqlengths: >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX >>>>> chrY chrM >>>>> NA NA NA NA NA NA ... NA NA NA NA >>>>> NA NA >>>>>> #So SNP 1 is at 167055370 on chr2 >>>>>> #But if I check tx '10921' I see that the cds overlapping >>>>>> 167055370 is actually '33651' >>>>>> #And cds '33668' is at the other end of the tx: >>>>>> cds.list[[aa[1,'txID']]] >>>>> GRanges with 26 ranges and 3 elementMetadata values: >>>>> seqnames ranges strand | cds_id >>>>> cds_name >>>>> <rle> <iranges> <rle> | <integer> <character> >>>>> [1] chr2 [167168009, 167168266] - | 33668 <na> >>>>> [2] chr2 [167163466, 167163584] - | 33667 <na> >>>>> [3] chr2 [167163020, 167163109] - | 33666 <na> >>>>> [4] chr2 [167162302, 167162430] - | 33647 <na> >>>>> [5] chr2 [167160748, 167160839] - | 33646 <na> >>>>> [6] chr2 [167159600, 167159812] - | 33645 <na> >>>>> [7] chr2 [167151109, 167151172] - | 33644 <na> >>>>> [8] chr2 [167149741, 167149882] - | 33643 <na> >>>>> [9] chr2 [167144947, 167145153] - | 33642 <na> >>>>> ... ... ... ... ... ... >>>>> ... >>>>> [18] chr2 [167099012, 167099166] - | 33659 <na> >>>>> [19] chr2 [167094604, 167094777] - | 33658 <na> >>>>> [20] chr2 [167089850, 167089972] - | 33657 <na> >>>>> [21] chr2 [167085201, 167085482] - | 33656 <na> >>>>> [22] chr2 [167084180, 167084233] - | 33655 <na> >>>>> [23] chr2 [167083077, 167083214] - | 33654 <na> >>>>> [24] chr2 [167060870, 167060974] - | 33653 <na> >>>>> [25] chr2 [167060465, 167060735] - | 33652 <na> >>>>> [26] chr2 [167055182, 167056374] - | 33651 <na> >>>>> exon_rank >>>>> <integer> >>>>> [1] 2 >>>>> [2] 3 >>>>> [3] 4 >>>>> [4] 5 >>>>> [5] 6 >>>>> [6] 7 >>>>> [7] 8 >>>>> [8] 9 >>>>> [9] 10 >>>>> ... ... >>>>> [18] 19 >>>>> [19] 20 >>>>> [20] 21 >>>>> [21] 22 >>>>> [22] 23 >>>>> [23] 24 >>>>> [24] 25 >>>>> [25] 26 >>>>> [26] 27 >>>>> --- >>>>> seqlengths: >>>>> chr1 chr2 ... >>>>> chr18_gl000207_random >>>>> 249250621 243199373 ... >>>>> 4262 >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 5 weeks ago

United States

Hi Valerie, Based on my experience the position in the complete protein (rather than CDS) sequence would be the most important piece of information to record here. For instance, if a SNP changes the 88th triplet CAT (His) in the ORF of a transcript to CAA (Gln) then you want to record it like this: His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a protein structure or inject it into the corresponding protein sequence without any additional remapping or coding. Another feature to consider are SNPs affecting splice sites (commonly first and last two nucleotides of an intron). If possible it would be also very useful to support users who want to work with their custom genomes and annotations files provided as FASTA and GFF/GTF files, respectively. Best, Thomas On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: > Slight change to this - > > I'm now returning the following new columns, > > \item{\code{seqTxLoc}}{ > Location in transcript-based coordinates of the first nucleotide in > the codon sequence to be translated. This position corresponds to the > first nucleotide in both the \code{refSeq} and \code{varSeq} columns. > } > \item{\code{varTxLoc}}{ > Location in transcript-based coordinates of the first nucleotide in > the variant. This value will be the same as \code{seqTxLoc} when the > variant starts exactly at the beginning of a codon. > } > \item{\code{varCdsLoc}}{ > Location in cds-based coordinates of the first nucleotide in > the variant. This position is relative to the start of the cds region > defined in the \code{subject} annotation. > } > \item{\code{subjStrand}}{ > The strand of the \code{subject} the variant matched. > \code{predictCoding} > determines which variants fall in a coding region by finding the > overlaps > between the \code{query} and \code{subject}. The \code{query} may be > un-stranded \sQuote{*} but the \code{subject} annotation will > have a strand. > } > > > You are interested in 'protein coordinates'. Does 'varCdsLoc' described > above meet the need or are you looking for the actual codon number in > the coding sequence? I am interested in hearing more about what you are > doing with the protein coordinates, how you are using them. It would > help us better design future functions. > > Thanks, > Valerie > > On 03/02/2012 01:11 AM, Alex Gutteridge wrote: > > Thanks Valerie - much appreciated! > > > > On 01.03.2012 21:30, Valerie Obenchain wrote: > >> A 'txLoc' column has been added to the output of predictCoding. > >> Available in devel version 1.1.57. > >> > >> Valerie > >> > >> > >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: > >>> Good suggestion. Yes, predictCoding is does this internally. I'll > >>> post back here when this has been added. > >>> > >>> Valerie > >>> > >>> > >>> > >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: > >>>> Hi Valerie, > >>>> > >>>> Thanks everything works great now. One small feature request - > >>>> would it be hard to output the protein sequence position of the > >>>> coding SNPs? At the moment once I've run predictCoding I'm > >>>> re-extracting the cds and working out the position of each coding > >>>> SNP so I can see where in the protein sequence it is, but it seems > >>>> like this is probably just replicating what predictCoding must be > >>>> doing internally anyway? > >>>> > >>>> Alex Gutteridge > >>> > >>> > >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: > >>>> Hi Alex, > >>>> > >>>> Thanks for the bug report. The cdsID was taken from an overlap > >>>> between the query and GRangesList of cds by transcripts. This gave > >>>> the correct transcript number but (incorrectly) took the first cds > >>>> number in the list by default. Now fixed in devel 1.1.55. > >>>> > >>>> I've also updated the man page. > >>>> > >>>> Valerie > >>>> > >>>> > >>>> > >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: > >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: > >>>>>> Hi Alex, > >>>>>> > >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: > >>>>> > >>>>> [...] > >>>>> > >>>>>>> But the predictCoding call gives this error: > >>>>>>> > >>>>>>> Error in .setSeqNames(x, value) : > >>>>>>> The replacement value for isActiveSeq must be a logical vector, > >>>>>>> with > >>>>>>> names that match the seqlevels of the object > >>>>>> > >>>>>> The error message doesn't help much but I think the pb is that you > >>>>>> didn't rename chMT properly. Try to do this: > >>>>>> > >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > >>>>>> > >>>>>> before you start the for(eg in entrez.ids){..} loop again. > >>>>>> > >>>>>> Cheers, > >>>>>> H. > >>>>> > >>>>> Thanks Herv? that nailed it. I'm having some difficulty joining up > >>>>> the output of predictCoding() with the query SNPs though. If > >>>>> someone could point out where the disconnect in my thinking is I > >>>>> would appreciate it! > >>>>> > >>>>> Here's my (now edited down) script: > >>>>> > >>>>> library(BSgenome.Hsapiens.UCSC.hg19) > >>>>> library(VariantAnnotation) > >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) > >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >>>>> > >>>>> entrez.ids = c('6335') > >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene > >>>>> > >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) > >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) > >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > >>>>> > >>>>> gene.list = cdsBy(txdb19, by="gene") > >>>>> vsd.list = gene.list[entrez.ids] > >>>>> cds.list = cdsBy(txdb19,by="tx") > >>>>> > >>>>> eg = entrez.ids[1] > >>>>> > >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) > >>>>> eg.snps = snps[snp.idx] > >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] > >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) > >>>>> variant.alleles = > >>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse="") ,"")[[1]]) > >>>>> > >>>>> > >>>>> > >>>>> aa = > >>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele= variant.alleles) > >>>>> > >>>>> ##### > >>>>> > >>>>> Then if I query the predictCoding results in aa in an interactive > >>>>> session I get the following (see inline comments for what I think > >>>>> should be happening, but I must be misinterpreting what queryID > >>>>> means) > >>>>> > >>>>> The docs for predictCoding() contain a small typo > >>>>> (s/queryHits/queryID), but otherwise seem clear? > >>>>> > >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, > >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. > >>>>> > >>>>> ?queryHits? The ?queryHits? column provides a map back to the > >>>>> variants in the original ?query?. If the ?query? was a > >>>>> ?VCF? > >>>>> object this index corresponds to the row in the > >>>>> ?GRanges? in > >>>>> the ?rowData? slot. If ?query? was an expanded ?GRanges?, > >>>>> ?RangedData? or ?RangesList? the index corresponds to > >>>>> the row > >>>>> in the expanded object. > >>>>> > >>>>> ##### > >>>>> > >>>>>> aa[1,] > >>>>> DataFrame with 1 row and 9 columns > >>>>> queryID consequence refSeq varSeq refAA > >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> > >>>>> 1 1 nonsynonymous CTC ATC L > >>>>> varAA txID geneID cdsID > >>>>> <aastringset> <character> <factor> <integer> > >>>>> 1 I 10921 6335 33668 > >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx > >>>>>> '10921' and cds '33668'. > >>>>>> #If I look at the first query SNP I get this: > >>>>>> eg.snps.exp[aa[1,'queryID'],] > >>>>> GRanges with 1 range and 2 elementMetadata values: > >>>>> seqnames ranges strand | RefSNP_id > >>>>> alleles_as_ambig > >>>>> <rle> <iranges> <rle> | <character> <character> > >>>>> [1] chr2 [167055370, 167055370] * | 111558968 > >>>>> R > >>>>> --- > >>>>> seqlengths: > >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX > >>>>> chrY chrM > >>>>> NA NA NA NA NA NA ... NA NA NA NA > >>>>> NA NA > >>>>>> #So SNP 1 is at 167055370 on chr2 > >>>>>> #But if I check tx '10921' I see that the cds overlapping > >>>>>> 167055370 is actually '33651' > >>>>>> #And cds '33668' is at the other end of the tx: > >>>>>> cds.list[[aa[1,'txID']]] > >>>>> GRanges with 26 ranges and 3 elementMetadata values: > >>>>> seqnames ranges strand | cds_id > >>>>> cds_name > >>>>> <rle> <iranges> <rle> | <integer> <character> > >>>>> [1] chr2 [167168009, 167168266] - | 33668 <na> > >>>>> [2] chr2 [167163466, 167163584] - | 33667 <na> > >>>>> [3] chr2 [167163020, 167163109] - | 33666 <na> > >>>>> [4] chr2 [167162302, 167162430] - | 33647 <na> > >>>>> [5] chr2 [167160748, 167160839] - | 33646 <na> > >>>>> [6] chr2 [167159600, 167159812] - | 33645 <na> > >>>>> [7] chr2 [167151109, 167151172] - | 33644 <na> > >>>>> [8] chr2 [167149741, 167149882] - | 33643 <na> > >>>>> [9] chr2 [167144947, 167145153] - | 33642 <na> > >>>>> ... ... ... ... ... ... > >>>>> ... > >>>>> [18] chr2 [167099012, 167099166] - | 33659 <na> > >>>>> [19] chr2 [167094604, 167094777] - | 33658 <na> > >>>>> [20] chr2 [167089850, 167089972] - | 33657 <na> > >>>>> [21] chr2 [167085201, 167085482] - | 33656 <na> > >>>>> [22] chr2 [167084180, 167084233] - | 33655 <na> > >>>>> [23] chr2 [167083077, 167083214] - | 33654 <na> > >>>>> [24] chr2 [167060870, 167060974] - | 33653 <na> > >>>>> [25] chr2 [167060465, 167060735] - | 33652 <na> > >>>>> [26] chr2 [167055182, 167056374] - | 33651 <na> > >>>>> exon_rank > >>>>> <integer> > >>>>> [1] 2 > >>>>> [2] 3 > >>>>> [3] 4 > >>>>> [4] 5 > >>>>> [5] 6 > >>>>> [6] 7 > >>>>> [7] 8 > >>>>> [8] 9 > >>>>> [9] 10 > >>>>> ... ... > >>>>> [18] 19 > >>>>> [19] 20 > >>>>> [20] 21 > >>>>> [21] 22 > >>>>> [22] 23 > >>>>> [23] 24 > >>>>> [24] 25 > >>>>> [25] 26 > >>>>> [26] 27 > >>>>> --- > >>>>> seqlengths: > >>>>> chr1 chr2 ... > >>>>> chr18_gl000207_random > >>>>> 249250621 243199373 ... > >>>>> 4262 > >>>>> > >>>>> > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.2 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Hi Thomas, Thanks very much for the ideas and specific examples. These are very helpful. Take care, Valerie On 03/02/12 16:58, Thomas Girke wrote: > Hi Valerie, > > Based on my experience the position in the complete protein (rather than > CDS) sequence would be the most important piece of information to record > here. For instance, if a SNP changes the 88th triplet CAT (His) in the > ORF of a transcript to CAA (Gln) then you want to record it like this: > His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a > protein structure or inject it into the corresponding protein sequence > without any additional remapping or coding. > > Another feature to consider are SNPs affecting splice sites (commonly > first and last two nucleotides of an intron). > > If possible it would be also very useful to support users who want to > work with their custom genomes and annotations files provided as FASTA > and GFF/GTF files, respectively. > > Best, > > Thomas > > > On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: >> Slight change to this - >> >> I'm now returning the following new columns, >> >> \item{\code{seqTxLoc}}{ >> Location in transcript-based coordinates of the first nucleotide in >> the codon sequence to be translated. This position corresponds to the >> first nucleotide in both the \code{refSeq} and \code{varSeq} columns. >> } >> \item{\code{varTxLoc}}{ >> Location in transcript-based coordinates of the first nucleotide in >> the variant. This value will be the same as \code{seqTxLoc} when the >> variant starts exactly at the beginning of a codon. >> } >> \item{\code{varCdsLoc}}{ >> Location in cds-based coordinates of the first nucleotide in >> the variant. This position is relative to the start of the cds region >> defined in the \code{subject} annotation. >> } >> \item{\code{subjStrand}}{ >> The strand of the \code{subject} the variant matched. >> \code{predictCoding} >> determines which variants fall in a coding region by finding the >> overlaps >> between the \code{query} and \code{subject}. The \code{query} may be >> un-stranded \sQuote{*} but the \code{subject} annotation will >> have a strand. >> } >> >> >> You are interested in 'protein coordinates'. Does 'varCdsLoc' described >> above meet the need or are you looking for the actual codon number in >> the coding sequence? I am interested in hearing more about what you are >> doing with the protein coordinates, how you are using them. It would >> help us better design future functions. >> >> Thanks, >> Valerie >> >> On 03/02/2012 01:11 AM, Alex Gutteridge wrote: >>> Thanks Valerie - much appreciated! >>> >>> On 01.03.2012 21:30, Valerie Obenchain wrote: >>>> A 'txLoc' column has been added to the output of predictCoding. >>>> Available in devel version 1.1.57. >>>> >>>> Valerie >>>> >>>> >>>> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: >>>>> Good suggestion. Yes, predictCoding is does this internally. I'll >>>>> post back here when this has been added. >>>>> >>>>> Valerie >>>>> >>>>> >>>>> >>>>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >>>>>> Hi Valerie, >>>>>> >>>>>> Thanks everything works great now. One small feature request - >>>>>> would it be hard to output the protein sequence position of the >>>>>> coding SNPs? At the moment once I've run predictCoding I'm >>>>>> re-extracting the cds and working out the position of each coding >>>>>> SNP so I can see where in the protein sequence it is, but it seems >>>>>> like this is probably just replicating what predictCoding must be >>>>>> doing internally anyway? >>>>>> >>>>>> Alex Gutteridge >>>>> >>>>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >>>>>> Hi Alex, >>>>>> >>>>>> Thanks for the bug report. The cdsID was taken from an overlap >>>>>> between the query and GRangesList of cds by transcripts. This gave >>>>>> the correct transcript number but (incorrectly) took the first cds >>>>>> number in the list by default. Now fixed in devel 1.1.55. >>>>>> >>>>>> I've also updated the man page. >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>> >>>>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >>>>>>> On 22.02.2012 18:58, Hervé Pagès wrote: >>>>>>>> Hi Alex, >>>>>>>> >>>>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >>>>>>> [...] >>>>>>> >>>>>>>>> But the predictCoding call gives this error: >>>>>>>>> >>>>>>>>> Error in .setSeqNames(x, value) : >>>>>>>>> The replacement value for isActiveSeq must be a logical vector, >>>>>>>>> with >>>>>>>>> names that match the seqlevels of the object >>>>>>>> The error message doesn't help much but I think the pb is that you >>>>>>>> didn't rename chMT properly. Try to do this: >>>>>>>> >>>>>>>> seqlevels(snps)<- gsub("chrMT", "chrM", seqlevels(snps)) >>>>>>>> >>>>>>>> before you start the for(eg in entrez.ids){..} loop again. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> H. >>>>>>> Thanks Herv? that nailed it. I'm having some difficulty joining up >>>>>>> the output of predictCoding() with the query SNPs though. If >>>>>>> someone could point out where the disconnect in my thinking is I >>>>>>> would appreciate it! >>>>>>> >>>>>>> Here's my (now edited down) script: >>>>>>> >>>>>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>>>>> library(VariantAnnotation) >>>>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) >>>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>>>> >>>>>>> entrez.ids = c('6335') >>>>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >>>>>>> >>>>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >>>>>>> seqlevels(snps)<- gsub("ch", "chr", seqlevels(snps)) >>>>>>> seqlevels(snps)<- gsub("chrMT", "chrM", seqlevels(snps)) >>>>>>> >>>>>>> gene.list = cdsBy(txdb19, by="gene") >>>>>>> vsd.list = gene.list[entrez.ids] >>>>>>> cds.list = cdsBy(txdb19,by="tx") >>>>>>> >>>>>>> eg = entrez.ids[1] >>>>>>> >>>>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >>>>>>> eg.snps = snps[snp.idx] >>>>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] >>>>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >>>>>>> variant.alleles = >>>>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse="") ,"")[[1]]) >>>>>>> >>>>>>> >>>>>>> >>>>>>> aa = >>>>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele= variant.alleles) >>>>>>> >>>>>>> ##### >>>>>>> >>>>>>> Then if I query the predictCoding results in aa in an interactive >>>>>>> session I get the following (see inline comments for what I think >>>>>>> should be happening, but I must be misinterpreting what queryID >>>>>>> means) >>>>>>> >>>>>>> The docs for predictCoding() contain a small typo >>>>>>> (s/queryHits/queryID), but otherwise seem clear? >>>>>>> >>>>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >>>>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >>>>>>> >>>>>>> ?queryHits? The ?queryHits? column provides a map back to the >>>>>>> variants in the original ?query?. If the ?query? was a >>>>>>> ?VCF? >>>>>>> object this index corresponds to the row in the >>>>>>> ?GRanges? in >>>>>>> the ?rowData? slot. If ?query? was an expanded ?GRanges?, >>>>>>> ?RangedData? or ?RangesList? the index corresponds to >>>>>>> the row >>>>>>> in the expanded object. >>>>>>> >>>>>>> ##### >>>>>>> >>>>>>>> aa[1,] >>>>>>> DataFrame with 1 row and 9 columns >>>>>>> queryID consequence refSeq varSeq refAA >>>>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >>>>>>> 1 1 nonsynonymous CTC ATC L >>>>>>> varAA txID geneID cdsID >>>>>>> <aastringset> <character> <factor> <integer> >>>>>>> 1 I 10921 6335 33668 >>>>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>>>>>>> '10921' and cds '33668'. >>>>>>>> #If I look at the first query SNP I get this: >>>>>>>> eg.snps.exp[aa[1,'queryID'],] >>>>>>> GRanges with 1 range and 2 elementMetadata values: >>>>>>> seqnames ranges strand | RefSNP_id >>>>>>> alleles_as_ambig >>>>>>> <rle> <iranges> <rle> |<character> <character> >>>>>>> [1] chr2 [167055370, 167055370] * | 111558968 >>>>>>> R >>>>>>> --- >>>>>>> seqlengths: >>>>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX >>>>>>> chrY chrM >>>>>>> NA NA NA NA NA NA ... NA NA NA NA >>>>>>> NA NA >>>>>>>> #So SNP 1 is at 167055370 on chr2 >>>>>>>> #But if I check tx '10921' I see that the cds overlapping >>>>>>>> 167055370 is actually '33651' >>>>>>>> #And cds '33668' is at the other end of the tx: >>>>>>>> cds.list[[aa[1,'txID']]] >>>>>>> GRanges with 26 ranges and 3 elementMetadata values: >>>>>>> seqnames ranges strand | cds_id >>>>>>> cds_name >>>>>>> <rle> <iranges> <rle> |<integer> <character> >>>>>>> [1] chr2 [167168009, 167168266] - | 33668<na> >>>>>>> [2] chr2 [167163466, 167163584] - | 33667<na> >>>>>>> [3] chr2 [167163020, 167163109] - | 33666<na> >>>>>>> [4] chr2 [167162302, 167162430] - | 33647<na> >>>>>>> [5] chr2 [167160748, 167160839] - | 33646<na> >>>>>>> [6] chr2 [167159600, 167159812] - | 33645<na> >>>>>>> [7] chr2 [167151109, 167151172] - | 33644<na> >>>>>>> [8] chr2 [167149741, 167149882] - | 33643<na> >>>>>>> [9] chr2 [167144947, 167145153] - | 33642<na> >>>>>>> ... ... ... ... ... ... >>>>>>> ... >>>>>>> [18] chr2 [167099012, 167099166] - | 33659<na> >>>>>>> [19] chr2 [167094604, 167094777] - | 33658<na> >>>>>>> [20] chr2 [167089850, 167089972] - | 33657<na> >>>>>>> [21] chr2 [167085201, 167085482] - | 33656<na> >>>>>>> [22] chr2 [167084180, 167084233] - | 33655<na> >>>>>>> [23] chr2 [167083077, 167083214] - | 33654<na> >>>>>>> [24] chr2 [167060870, 167060974] - | 33653<na> >>>>>>> [25] chr2 [167060465, 167060735] - | 33652<na> >>>>>>> [26] chr2 [167055182, 167056374] - | 33651<na> >>>>>>> exon_rank >>>>>>> <integer> >>>>>>> [1] 2 >>>>>>> [2] 3 >>>>>>> [3] 4 >>>>>>> [4] 5 >>>>>>> [5] 6 >>>>>>> [6] 7 >>>>>>> [7] 8 >>>>>>> [8] 9 >>>>>>> [9] 10 >>>>>>> ... ... >>>>>>> [18] 19 >>>>>>> [19] 20 >>>>>>> [20] 21 >>>>>>> [21] 22 >>>>>>> [22] 23 >>>>>>> [23] 24 >>>>>>> [24] 25 >>>>>>> [25] 26 >>>>>>> [26] 27 >>>>>>> --- >>>>>>> seqlengths: >>>>>>> chr1 chr2 ... >>>>>>> chr18_gl000207_random >>>>>>> 249250621 243199373 ... >>>>>>> 4262 >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use case is mapping coding SNPs to PDB protein structures, so something that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need. Alex Gutteridge. On 03.03.2012 00:58, Thomas Girke wrote: > Hi Valerie, > > Based on my experience the position in the complete protein (rather > than > CDS) sequence would be the most important piece of information to > record > here. For instance, if a SNP changes the 88th triplet CAT (His) in > the > ORF of a transcript to CAA (Gln) then you want to record it like > this: > His88 (CAT) -> Gln88 (CAA). This way the user can map this change to > a > protein structure or inject it into the corresponding protein > sequence > without any additional remapping or coding. > > Another feature to consider are SNPs affecting splice sites (commonly > first and last two nucleotides of an intron). > > If possible it would be also very useful to support users who want to > work with their custom genomes and annotations files provided as > FASTA > and GFF/GTF files, respectively. > > Best, > > Thomas > > > On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: >> Slight change to this - >> >> I'm now returning the following new columns, >> >> \item{\code{seqTxLoc}}{ >> Location in transcript-based coordinates of the first >> nucleotide in >> the codon sequence to be translated. This position >> corresponds to the >> first nucleotide in both the \code{refSeq} and \code{varSeq} >> columns. >> } >> \item{\code{varTxLoc}}{ >> Location in transcript-based coordinates of the first >> nucleotide in >> the variant. This value will be the same as \code{seqTxLoc} >> when the >> variant starts exactly at the beginning of a codon. >> } >> \item{\code{varCdsLoc}}{ >> Location in cds-based coordinates of the first nucleotide in >> the variant. This position is relative to the start of the >> cds region >> defined in the \code{subject} annotation. >> } >> \item{\code{subjStrand}}{ >> The strand of the \code{subject} the variant matched. >> \code{predictCoding} >> determines which variants fall in a coding region by finding >> the >> overlaps >> between the \code{query} and \code{subject}. The \code{query} >> may be >> un-stranded \sQuote{*} but the \code{subject} annotation will >> have a strand. >> } >> >> >> You are interested in 'protein coordinates'. Does 'varCdsLoc' >> described >> above meet the need or are you looking for the actual codon number >> in >> the coding sequence? I am interested in hearing more about what you >> are >> doing with the protein coordinates, how you are using them. It would >> help us better design future functions. >> >> Thanks, >> Valerie >> >> On 03/02/2012 01:11 AM, Alex Gutteridge wrote: >> > Thanks Valerie - much appreciated! >> > >> > On 01.03.2012 21:30, Valerie Obenchain wrote: >> >> A 'txLoc' column has been added to the output of predictCoding. >> >> Available in devel version 1.1.57. >> >> >> >> Valerie >> >> >> >> >> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: >> >>> Good suggestion. Yes, predictCoding is does this internally. >> I'll >> >>> post back here when this has been added. >> >>> >> >>> Valerie >> >>> >> >>> >> >>> >> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >> >>>> Hi Valerie, >> >>>> >> >>>> Thanks everything works great now. One small feature request - >> >>>> would it be hard to output the protein sequence position of the >> >>>> coding SNPs? At the moment once I've run predictCoding I'm >> >>>> re-extracting the cds and working out the position of each >> coding >> >>>> SNP so I can see where in the protein sequence it is, but it >> seems >> >>>> like this is probably just replicating what predictCoding must >> be >> >>>> doing internally anyway? >> >>>> >> >>>> Alex Gutteridge >> >>> >> >>> >> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >> >>>> Hi Alex, >> >>>> >> >>>> Thanks for the bug report. The cdsID was taken from an overlap >> >>>> between the query and GRangesList of cds by transcripts. This >> gave >> >>>> the correct transcript number but (incorrectly) took the first >> cds >> >>>> number in the list by default. Now fixed in devel 1.1.55. >> >>>> >> >>>> I've also updated the man page. >> >>>> >> >>>> Valerie >> >>>> >> >>>> >> >>>> >> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >> >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: >> >>>>>> Hi Alex, >> >>>>>> >> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >> >>>>> >> >>>>> [...] >> >>>>> >> >>>>>>> But the predictCoding call gives this error: >> >>>>>>> >> >>>>>>> Error in .setSeqNames(x, value) : >> >>>>>>> The replacement value for isActiveSeq must be a logical >> vector, >> >>>>>>> with >> >>>>>>> names that match the seqlevels of the object >> >>>>>> >> >>>>>> The error message doesn't help much but I think the pb is >> that you >> >>>>>> didn't rename chMT properly. Try to do this: >> >>>>>> >> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >> >>>>>> >> >>>>>> before you start the for(eg in entrez.ids){..} loop again. >> >>>>>> >> >>>>>> Cheers, >> >>>>>> H. >> >>>>> >> >>>>> Thanks Herv? that nailed it. I'm having some difficulty >> joining up >> >>>>> the output of predictCoding() with the query SNPs though. If >> >>>>> someone could point out where the disconnect in my thinking is >> I >> >>>>> would appreciate it! >> >>>>> >> >>>>> Here's my (now edited down) script: >> >>>>> >> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) >> >>>>> library(VariantAnnotation) >> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) >> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> >>>>> >> >>>>> entrez.ids = c('6335') >> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >> >>>>> >> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >> >>>>> >> >>>>> gene.list = cdsBy(txdb19, by="gene") >> >>>>> vsd.list = gene.list[entrez.ids] >> >>>>> cds.list = cdsBy(txdb19,by="tx") >> >>>>> >> >>>>> eg = entrez.ids[1] >> >>>>> >> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, >> vsd.list[[eg]]))) >> >>>>> eg.snps = snps[snp.idx] >> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] >> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >> >>>>> variant.alleles = >> >>>>> >> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[ [1]]) >> >>>>> >> >>>>> >> >>>>> >> >>>>> aa = >> >>>>> >> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=varia nt.alleles) >> >>>>> >> >>>>> ##### >> >>>>> >> >>>>> Then if I query the predictCoding results in aa in an >> interactive >> >>>>> session I get the following (see inline comments for what I >> think >> >>>>> should be happening, but I must be misinterpreting what >> queryID >> >>>>> means) >> >>>>> >> >>>>> The docs for predictCoding() contain a small typo >> >>>>> (s/queryHits/queryID), but otherwise seem clear? >> >>>>> >> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >> >>>>> >> >>>>> ?queryHits? The ?queryHits? column provides a map back to >> the >> >>>>> variants in the original ?query?. If the ?query? was >> a >> >>>>> ?VCF? >> >>>>> object this index corresponds to the row in the >> >>>>> ?GRanges? in >> >>>>> the ?rowData? slot. If ?query? was an expanded >> ?GRanges?, >> >>>>> ?RangedData? or ?RangesList? the index corresponds >> to >> >>>>> the row >> >>>>> in the expanded object. >> >>>>> >> >>>>> ##### >> >>>>> >> >>>>>> aa[1,] >> >>>>> DataFrame with 1 row and 9 columns >> >>>>> queryID consequence refSeq varSeq >> refAA >> >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >> >>>>> 1 1 nonsynonymous CTC ATC >> L >> >>>>> varAA txID geneID cdsID >> >>>>> <aastringset> <character> <factor> <integer> >> >>>>> 1 I 10921 6335 33668 >> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to >> tx >> >>>>>> '10921' and cds '33668'. >> >>>>>> #If I look at the first query SNP I get this: >> >>>>>> eg.snps.exp[aa[1,'queryID'],] >> >>>>> GRanges with 1 range and 2 elementMetadata values: >> >>>>> seqnames ranges strand | RefSNP_id >> >>>>> alleles_as_ambig >> >>>>> <rle> <iranges> <rle> | <character> <character> >> >>>>> [1] chr2 [167055370, 167055370] * | 111558968 >> >>>>> R >> >>>>> --- >> >>>>> seqlengths: >> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 >> chrX >> >>>>> chrY chrM >> >>>>> NA NA NA NA NA NA ... NA NA NA >> NA >> >>>>> NA NA >> >>>>>> #So SNP 1 is at 167055370 on chr2 >> >>>>>> #But if I check tx '10921' I see that the cds overlapping >> >>>>>> 167055370 is actually '33651' >> >>>>>> #And cds '33668' is at the other end of the tx: >> >>>>>> cds.list[[aa[1,'txID']]] >> >>>>> GRanges with 26 ranges and 3 elementMetadata values: >> >>>>> seqnames ranges strand | cds_id >> >>>>> cds_name >> >>>>> <rle> <iranges> <rle> | <integer> <character> >> >>>>> [1] chr2 [167168009, 167168266] - | 33668 >> <na> >> >>>>> [2] chr2 [167163466, 167163584] - | 33667 >> <na> >> >>>>> [3] chr2 [167163020, 167163109] - | 33666 >> <na> >> >>>>> [4] chr2 [167162302, 167162430] - | 33647 >> <na> >> >>>>> [5] chr2 [167160748, 167160839] - | 33646 >> <na> >> >>>>> [6] chr2 [167159600, 167159812] - | 33645 >> <na> >> >>>>> [7] chr2 [167151109, 167151172] - | 33644 >> <na> >> >>>>> [8] chr2 [167149741, 167149882] - | 33643 >> <na> >> >>>>> [9] chr2 [167144947, 167145153] - | 33642 >> <na> >> >>>>> ... ... ... ... ... ... >> >>>>> ... >> >>>>> [18] chr2 [167099012, 167099166] - | 33659 >> <na> >> >>>>> [19] chr2 [167094604, 167094777] - | 33658 >> <na> >> >>>>> [20] chr2 [167089850, 167089972] - | 33657 >> <na> >> >>>>> [21] chr2 [167085201, 167085482] - | 33656 >> <na> >> >>>>> [22] chr2 [167084180, 167084233] - | 33655 >> <na> >> >>>>> [23] chr2 [167083077, 167083214] - | 33654 >> <na> >> >>>>> [24] chr2 [167060870, 167060974] - | 33653 >> <na> >> >>>>> [25] chr2 [167060465, 167060735] - | 33652 >> <na> >> >>>>> [26] chr2 [167055182, 167056374] - | 33651 >> <na> >> >>>>> exon_rank >> >>>>> <integer> >> >>>>> [1] 2 >> >>>>> [2] 3 >> >>>>> [3] 4 >> >>>>> [4] 5 >> >>>>> [5] 6 >> >>>>> [6] 7 >> >>>>> [7] 8 >> >>>>> [8] 9 >> >>>>> [9] 10 >> >>>>> ... ... >> >>>>> [18] 19 >> >>>>> [19] 20 >> >>>>> [20] 21 >> >>>>> [21] 22 >> >>>>> [22] 23 >> >>>>> [23] 24 >> >>>>> [24] 25 >> >>>>> [25] 26 >> >>>>> [26] 27 >> >>>>> --- >> >>>>> seqlengths: >> >>>>> chr1 chr2 ... >> >>>>> chr18_gl000207_random >> >>>>> 249250621 243199373 ... >> >>>>> 4262 >> >>>>> >> >>>>> >> >>>> >> >>>> _______________________________________________ >> >>>> Bioconductor mailing list >> >>>> Bioconductor at r-project.org >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>>> Search the archives: >> >>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor -- Alex Gutteridge

ADD REPLY • link 12.2 years ago Alex Gutteridge ▴ 650

Login before adding your answer.