Finding coding SNPs with predictCoding

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Thomas, Alex, Marc has been working on a makeTranscriptDbFromGFF() function which is now in GenomicFeatures v 1.9.11 in devel. This addresses the last of Thomas' points below wrt the user providing annotations in GFF or GTF format. The function creates a TxDb which can be given to locateVariants() and predictCoding(). Let us know how it goes. Valerie On 04/05/2012 01:47 PM, Valerie Obenchain wrote: > Hi Thomas, Alex, > > Changes below have been made in release 1.2.4 and devel 1.3.4. > > predictCoding : > > - Now returns a proteinID that is the triplet number in the protein > > - The refAA and varAA are the 1-letter amino acid code but not the > 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a > function that returns the 3-letter code. Is this of interest/use? > > - Over the next dev cycle I will implement methods for annotations > (subject argument) to be gff/gtf. Currently genomes (seqSource > argument) can be a BSgenome or fasta file. > > > locateVariants: > > To add flexibility for finding variants in a particular region the > function now dispatches on a 'region' argument (e.g., > CodingVariants(), IntronVariants(), etc.). I've included a > SpliceSiteVariants() that counts ranges that overlap with any portion > of the first 2 or last 2 nucleotides of an intron. Called as, > > locateVariants(query, subject, range=SpliceSiteVariants()) > > Thanks again for the suggestions. > > Valerie > > > On 03/05/2012 01:25 AM, Alex Gutteridge wrote: >> Hi Valerie, >> >> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use >> case is mapping coding SNPs to PDB protein structures, so something >> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need. >> >> Alex Gutteridge. >> >> On 03.03.2012 00:58, Thomas Girke wrote: >>> Hi Valerie, >>> >>> Based on my experience the position in the complete protein (rather >>> than >>> CDS) sequence would be the most important piece of information to >>> record >>> here. For instance, if a SNP changes the 88th triplet CAT (His) in the >>> ORF of a transcript to CAA (Gln) then you want to record it like this: >>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a >>> protein structure or inject it into the corresponding protein sequence >>> without any additional remapping or coding. >>> >>> Another feature to consider are SNPs affecting splice sites (commonly >>> first and last two nucleotides of an intron). >>> >>> If possible it would be also very useful to support users who want to >>> work with their custom genomes and annotations files provided as FASTA >>> and GFF/GTF files, respectively. >>> >>> Best, >>> >>> Thomas >>> >>> >>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: >>>> Slight change to this - >>>> >>>> I'm now returning the following new columns, >>>> >>>> \item{\code{seqTxLoc}}{ >>>> Location in transcript-based coordinates of the first >>>> nucleotide in >>>> the codon sequence to be translated. This position >>>> corresponds to the >>>> first nucleotide in both the \code{refSeq} and \code{varSeq} >>>> columns. >>>> } >>>> \item{\code{varTxLoc}}{ >>>> Location in transcript-based coordinates of the first >>>> nucleotide in >>>> the variant. This value will be the same as \code{seqTxLoc} >>>> when the >>>> variant starts exactly at the beginning of a codon. >>>> } >>>> \item{\code{varCdsLoc}}{ >>>> Location in cds-based coordinates of the first nucleotide in >>>> the variant. This position is relative to the start of the >>>> cds region >>>> defined in the \code{subject} annotation. >>>> } >>>> \item{\code{subjStrand}}{ >>>> The strand of the \code{subject} the variant matched. >>>> \code{predictCoding} >>>> determines which variants fall in a coding region by finding >>>> the >>>> overlaps >>>> between the \code{query} and \code{subject}. The >>>> \code{query} may be >>>> un-stranded \sQuote{*} but the \code{subject} annotation will >>>> have a strand. >>>> } >>>> >>>> >>>> You are interested in 'protein coordinates'. Does 'varCdsLoc' >>>> described >>>> above meet the need or are you looking for the actual codon number in >>>> the coding sequence? I am interested in hearing more about what you >>>> are >>>> doing with the protein coordinates, how you are using them. It would >>>> help us better design future functions. >>>> >>>> Thanks, >>>> Valerie >>>> >>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote: >>>> > Thanks Valerie - much appreciated! >>>> > >>>> > On 01.03.2012 21:30, Valerie Obenchain wrote: >>>> >> A 'txLoc' column has been added to the output of predictCoding. >>>> >> Available in devel version 1.1.57. >>>> >> >>>> >> Valerie >>>> >> >>>> >> >>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: >>>> >>> Good suggestion. Yes, predictCoding is does this internally. I'll >>>> >>> post back here when this has been added. >>>> >>> >>>> >>> Valerie >>>> >>> >>>> >>> >>>> >>> >>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >>>> >>>> Hi Valerie, >>>> >>>> >>>> >>>> Thanks everything works great now. One small feature request - >>>> >>>> would it be hard to output the protein sequence position of the >>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm >>>> >>>> re-extracting the cds and working out the position of each coding >>>> >>>> SNP so I can see where in the protein sequence it is, but it >>>> seems >>>> >>>> like this is probably just replicating what predictCoding must be >>>> >>>> doing internally anyway? >>>> >>>> >>>> >>>> Alex Gutteridge >>>> >>> >>>> >>> >>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >>>> >>>> Hi Alex, >>>> >>>> >>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap >>>> >>>> between the query and GRangesList of cds by transcripts. This >>>> gave >>>> >>>> the correct transcript number but (incorrectly) took the first >>>> cds >>>> >>>> number in the list by default. Now fixed in devel 1.1.55. >>>> >>>> >>>> >>>> I've also updated the man page. >>>> >>>> >>>> >>>> Valerie >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >>>> >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: >>>> >>>>>> Hi Alex, >>>> >>>>>> >>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >>>> >>>>> >>>> >>>>> [...] >>>> >>>>> >>>> >>>>>>> But the predictCoding call gives this error: >>>> >>>>>>> >>>> >>>>>>> Error in .setSeqNames(x, value) : >>>> >>>>>>> The replacement value for isActiveSeq must be a logical >>>> vector, >>>> >>>>>>> with >>>> >>>>>>> names that match the seqlevels of the object >>>> >>>>>> >>>> >>>>>> The error message doesn't help much but I think the pb is >>>> that you >>>> >>>>>> didn't rename chMT properly. Try to do this: >>>> >>>>>> >>>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>> >>>>>> >>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again. >>>> >>>>>> >>>> >>>>>> Cheers, >>>> >>>>>> H. >>>> >>>>> >>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty >>>> joining up >>>> >>>>> the output of predictCoding() with the query SNPs though. If >>>> >>>>> someone could point out where the disconnect in my thinking is I >>>> >>>>> would appreciate it! >>>> >>>>> >>>> >>>>> Here's my (now edited down) script: >>>> >>>>> >>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>> >>>>> library(VariantAnnotation) >>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) >>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> >>>>> >>>> >>>>> entrez.ids = c('6335') >>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >>>> >>>>> >>>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>> >>>>> >>>> >>>>> gene.list = cdsBy(txdb19, by="gene") >>>> >>>>> vsd.list = gene.list[entrez.ids] >>>> >>>>> cds.list = cdsBy(txdb19,by="tx") >>>> >>>>> >>>> >>>>> eg = entrez.ids[1] >>>> >>>>> >>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >>>> >>>>> eg.snps = snps[snp.idx] >>>> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] >>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >>>> >>>>> variant.alleles = >>>> >>>>> >>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"" )[[1]]) >>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> aa = >>>> >>>>> >>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=var iant.alleles) >>>> >>>>> >>>> >>>>> ##### >>>> >>>>> >>>> >>>>> Then if I query the predictCoding results in aa in an >>>> interactive >>>> >>>>> session I get the following (see inline comments for what I >>>> think >>>> >>>>> should be happening, but I must be misinterpreting what queryID >>>> >>>>> means) >>>> >>>>> >>>> >>>>> The docs for predictCoding() contain a small typo >>>> >>>>> (s/queryHits/queryID), but otherwise seem clear? >>>> >>>>> >>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >>>> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >>>> >>>>> >>>> >>>>> ?queryHits? The ?queryHits? column provides a map back >>>> to the >>>> >>>>> variants in the original ?query?. If the ?query? was a >>>> >>>>> ?VCF? >>>> >>>>> object this index corresponds to the row in the >>>> >>>>> ?GRanges? in >>>> >>>>> the ?rowData? slot. If ?query? was an expanded >>>> ?GRanges?, >>>> >>>>> ?RangedData? or ?RangesList? the index corresponds to >>>> >>>>> the row >>>> >>>>> in the expanded object. >>>> >>>>> >>>> >>>>> ##### >>>> >>>>> >>>> >>>>>> aa[1,] >>>> >>>>> DataFrame with 1 row and 9 columns >>>> >>>>> queryID consequence refSeq >>>> varSeq refAA >>>> >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >>>> >>>>> 1 1 nonsynonymous CTC >>>> ATC L >>>> >>>>> varAA txID geneID cdsID >>>> >>>>> <aastringset> <character> <factor> <integer> >>>> >>>>> 1 I 10921 6335 33668 >>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>>> >>>>>> '10921' and cds '33668'. >>>> >>>>>> #If I look at the first query SNP I get this: >>>> >>>>>> eg.snps.exp[aa[1,'queryID'],] >>>> >>>>> GRanges with 1 range and 2 elementMetadata values: >>>> >>>>> seqnames ranges strand | RefSNP_id >>>> >>>>> alleles_as_ambig >>>> >>>>> <rle> <iranges> <rle> | <character> <character> >>>> >>>>> [1] chr2 [167055370, 167055370] * | 111558968 >>>> >>>>> R >>>> >>>>> --- >>>> >>>>> seqlengths: >>>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 >>>> chrX >>>> >>>>> chrY chrM >>>> >>>>> NA NA NA NA NA NA ... NA NA >>>> NA NA >>>> >>>>> NA NA >>>> >>>>>> #So SNP 1 is at 167055370 on chr2 >>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping >>>> >>>>>> 167055370 is actually '33651' >>>> >>>>>> #And cds '33668' is at the other end of the tx: >>>> >>>>>> cds.list[[aa[1,'txID']]] >>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values: >>>> >>>>> seqnames ranges strand | cds_id >>>> >>>>> cds_name >>>> >>>>> <rle> <iranges> <rle> | <integer> <character> >>>> >>>>> [1] chr2 [167168009, 167168266] - | 33668 <na> >>>> >>>>> [2] chr2 [167163466, 167163584] - | 33667 <na> >>>> >>>>> [3] chr2 [167163020, 167163109] - | 33666 <na> >>>> >>>>> [4] chr2 [167162302, 167162430] - | 33647 <na> >>>> >>>>> [5] chr2 [167160748, 167160839] - | 33646 <na> >>>> >>>>> [6] chr2 [167159600, 167159812] - | 33645 <na> >>>> >>>>> [7] chr2 [167151109, 167151172] - | 33644 <na> >>>> >>>>> [8] chr2 [167149741, 167149882] - | 33643 <na> >>>> >>>>> [9] chr2 [167144947, 167145153] - | 33642 <na> >>>> >>>>> ... ... ... ... ... ... >>>> >>>>> ... >>>> >>>>> [18] chr2 [167099012, 167099166] - | 33659 <na> >>>> >>>>> [19] chr2 [167094604, 167094777] - | 33658 <na> >>>> >>>>> [20] chr2 [167089850, 167089972] - | 33657 <na> >>>> >>>>> [21] chr2 [167085201, 167085482] - | 33656 <na> >>>> >>>>> [22] chr2 [167084180, 167084233] - | 33655 <na> >>>> >>>>> [23] chr2 [167083077, 167083214] - | 33654 <na> >>>> >>>>> [24] chr2 [167060870, 167060974] - | 33653 <na> >>>> >>>>> [25] chr2 [167060465, 167060735] - | 33652 <na> >>>> >>>>> [26] chr2 [167055182, 167056374] - | 33651 <na> >>>> >>>>> exon_rank >>>> >>>>> <integer> >>>> >>>>> [1] 2 >>>> >>>>> [2] 3 >>>> >>>>> [3] 4 >>>> >>>>> [4] 5 >>>> >>>>> [5] 6 >>>> >>>>> [6] 7 >>>> >>>>> [7] 8 >>>> >>>>> [8] 9 >>>> >>>>> [9] 10 >>>> >>>>> ... ... >>>> >>>>> [18] 19 >>>> >>>>> [19] 20 >>>> >>>>> [20] 21 >>>> >>>>> [21] 22 >>>> >>>>> [22] 23 >>>> >>>>> [23] 24 >>>> >>>>> [24] 25 >>>> >>>>> [25] 26 >>>> >>>>> [26] 27 >>>> >>>>> --- >>>> >>>>> seqlengths: >>>> >>>>> chr1 chr2 ... >>>> >>>>> chr18_gl000207_random >>>> >>>>> 249250621 243199373 ... >>>> >>>>> 4262 >>>> >>>>> >>>> >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> Bioconductor mailing list >>>> >>>> Bioconductor at r-project.org >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>>> Search the archives: >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>>> >>> _______________________________________________ >>>> >>> Bioconductor mailing list >>>> >>> Bioconductor at r-project.org >>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>> Search the archives: >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> > >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

SNP Annotation BSgenome BSgenome GenomicFeatures cycle genomes SNP Annotation BSgenome • 915 views

ADD COMMENT • link updated 11.9 years ago by Thomas Girke ★ 1.7k • written 11.9 years ago by Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 4 weeks ago

United States

Great, and thanks for doing this! Creating TxDb objects directly from gff/gtf files has been for a long time on my wish list. If this works now for most of the GFF/GTF files provided by ncbi (ftp://ftp.ncbi.nih.gov/genomes/), ensembl (http://useast.ensembl.org/info/data/ftp/index.html) or other genome databases, then this will save many of us a lot of time and headache. Thomas On Tue, May 15, 2012 at 08:17:52PM +0000, Valerie Obenchain wrote: > Hi Thomas, Alex, > > Marc has been working on a makeTranscriptDbFromGFF() function which is > now in GenomicFeatures v 1.9.11 in devel. This addresses the last of > Thomas' points below wrt the user providing annotations in GFF or GTF > format. The function creates a TxDb which can be given to > locateVariants() and predictCoding(). Let us know how it goes. > > Valerie > > > > On 04/05/2012 01:47 PM, Valerie Obenchain wrote: > > Hi Thomas, Alex, > > > > Changes below have been made in release 1.2.4 and devel 1.3.4. > > > > predictCoding : > > > > - Now returns a proteinID that is the triplet number in the protein > > > > - The refAA and varAA are the 1-letter amino acid code but not the > > 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a > > function that returns the 3-letter code. Is this of interest/use? > > > > - Over the next dev cycle I will implement methods for annotations > > (subject argument) to be gff/gtf. Currently genomes (seqSource > > argument) can be a BSgenome or fasta file. > > > > > > locateVariants: > > > > To add flexibility for finding variants in a particular region the > > function now dispatches on a 'region' argument (e.g., > > CodingVariants(), IntronVariants(), etc.). I've included a > > SpliceSiteVariants() that counts ranges that overlap with any portion > > of the first 2 or last 2 nucleotides of an intron. Called as, > > > > locateVariants(query, subject, range=SpliceSiteVariants()) > > > > Thanks again for the suggestions. > > > > Valerie > > > > > > On 03/05/2012 01:25 AM, Alex Gutteridge wrote: > >> Hi Valerie, > >> > >> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use > >> case is mapping coding SNPs to PDB protein structures, so something > >> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need. > >> > >> Alex Gutteridge. > >> > >> On 03.03.2012 00:58, Thomas Girke wrote: > >>> Hi Valerie, > >>> > >>> Based on my experience the position in the complete protein (rather > >>> than > >>> CDS) sequence would be the most important piece of information to > >>> record > >>> here. For instance, if a SNP changes the 88th triplet CAT (His) in the > >>> ORF of a transcript to CAA (Gln) then you want to record it like this: > >>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a > >>> protein structure or inject it into the corresponding protein sequence > >>> without any additional remapping or coding. > >>> > >>> Another feature to consider are SNPs affecting splice sites (commonly > >>> first and last two nucleotides of an intron). > >>> > >>> If possible it would be also very useful to support users who want to > >>> work with their custom genomes and annotations files provided as FASTA > >>> and GFF/GTF files, respectively. > >>> > >>> Best, > >>> > >>> Thomas > >>> > >>> > >>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: > >>>> Slight change to this - > >>>> > >>>> I'm now returning the following new columns, > >>>> > >>>> \item{\code{seqTxLoc}}{ > >>>> Location in transcript-based coordinates of the first > >>>> nucleotide in > >>>> the codon sequence to be translated. This position > >>>> corresponds to the > >>>> first nucleotide in both the \code{refSeq} and \code{varSeq} > >>>> columns. > >>>> } > >>>> \item{\code{varTxLoc}}{ > >>>> Location in transcript-based coordinates of the first > >>>> nucleotide in > >>>> the variant. This value will be the same as \code{seqTxLoc} > >>>> when the > >>>> variant starts exactly at the beginning of a codon. > >>>> } > >>>> \item{\code{varCdsLoc}}{ > >>>> Location in cds-based coordinates of the first nucleotide in > >>>> the variant. This position is relative to the start of the > >>>> cds region > >>>> defined in the \code{subject} annotation. > >>>> } > >>>> \item{\code{subjStrand}}{ > >>>> The strand of the \code{subject} the variant matched. > >>>> \code{predictCoding} > >>>> determines which variants fall in a coding region by finding > >>>> the > >>>> overlaps > >>>> between the \code{query} and \code{subject}. The > >>>> \code{query} may be > >>>> un-stranded \sQuote{*} but the \code{subject} annotation will > >>>> have a strand. > >>>> } > >>>> > >>>> > >>>> You are interested in 'protein coordinates'. Does 'varCdsLoc' > >>>> described > >>>> above meet the need or are you looking for the actual codon number in > >>>> the coding sequence? I am interested in hearing more about what you > >>>> are > >>>> doing with the protein coordinates, how you are using them. It would > >>>> help us better design future functions. > >>>> > >>>> Thanks, > >>>> Valerie > >>>> > >>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote: > >>>> > Thanks Valerie - much appreciated! > >>>> > > >>>> > On 01.03.2012 21:30, Valerie Obenchain wrote: > >>>> >> A 'txLoc' column has been added to the output of predictCoding. > >>>> >> Available in devel version 1.1.57. > >>>> >> > >>>> >> Valerie > >>>> >> > >>>> >> > >>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: > >>>> >>> Good suggestion. Yes, predictCoding is does this internally. I'll > >>>> >>> post back here when this has been added. > >>>> >>> > >>>> >>> Valerie > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: > >>>> >>>> Hi Valerie, > >>>> >>>> > >>>> >>>> Thanks everything works great now. One small feature request - > >>>> >>>> would it be hard to output the protein sequence position of the > >>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm > >>>> >>>> re-extracting the cds and working out the position of each coding > >>>> >>>> SNP so I can see where in the protein sequence it is, but it > >>>> seems > >>>> >>>> like this is probably just replicating what predictCoding must be > >>>> >>>> doing internally anyway? > >>>> >>>> > >>>> >>>> Alex Gutteridge > >>>> >>> > >>>> >>> > >>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: > >>>> >>>> Hi Alex, > >>>> >>>> > >>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap > >>>> >>>> between the query and GRangesList of cds by transcripts. This > >>>> gave > >>>> >>>> the correct transcript number but (incorrectly) took the first > >>>> cds > >>>> >>>> number in the list by default. Now fixed in devel 1.1.55. > >>>> >>>> > >>>> >>>> I've also updated the man page. > >>>> >>>> > >>>> >>>> Valerie > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: > >>>> >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: > >>>> >>>>>> Hi Alex, > >>>> >>>>>> > >>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: > >>>> >>>>> > >>>> >>>>> [...] > >>>> >>>>> > >>>> >>>>>>> But the predictCoding call gives this error: > >>>> >>>>>>> > >>>> >>>>>>> Error in .setSeqNames(x, value) : > >>>> >>>>>>> The replacement value for isActiveSeq must be a logical > >>>> vector, > >>>> >>>>>>> with > >>>> >>>>>>> names that match the seqlevels of the object > >>>> >>>>>> > >>>> >>>>>> The error message doesn't help much but I think the pb is > >>>> that you > >>>> >>>>>> didn't rename chMT properly. Try to do this: > >>>> >>>>>> > >>>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > >>>> >>>>>> > >>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again. > >>>> >>>>>> > >>>> >>>>>> Cheers, > >>>> >>>>>> H. > >>>> >>>>> > >>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty > >>>> joining up > >>>> >>>>> the output of predictCoding() with the query SNPs though. If > >>>> >>>>> someone could point out where the disconnect in my thinking is I > >>>> >>>>> would appreciate it! > >>>> >>>>> > >>>> >>>>> Here's my (now edited down) script: > >>>> >>>>> > >>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) > >>>> >>>>> library(VariantAnnotation) > >>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) > >>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >>>> >>>>> > >>>> >>>>> entrez.ids = c('6335') > >>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene > >>>> >>>>> > >>>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) > >>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) > >>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > >>>> >>>>> > >>>> >>>>> gene.list = cdsBy(txdb19, by="gene") > >>>> >>>>> vsd.list = gene.list[entrez.ids] > >>>> >>>>> cds.list = cdsBy(txdb19,by="tx") > >>>> >>>>> > >>>> >>>>> eg = entrez.ids[1] > >>>> >>>>> > >>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) > >>>> >>>>> eg.snps = snps[snp.idx] > >>>> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] > >>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) > >>>> >>>>> variant.alleles = > >>>> >>>>> > >>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""), "")[[1]]) > >>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> aa = > >>>> >>>>> > >>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=v ariant.alleles) > >>>> >>>>> > >>>> >>>>> ##### > >>>> >>>>> > >>>> >>>>> Then if I query the predictCoding results in aa in an > >>>> interactive > >>>> >>>>> session I get the following (see inline comments for what I > >>>> think > >>>> >>>>> should be happening, but I must be misinterpreting what queryID > >>>> >>>>> means) > >>>> >>>>> > >>>> >>>>> The docs for predictCoding() contain a small typo > >>>> >>>>> (s/queryHits/queryID), but otherwise seem clear? > >>>> >>>>> > >>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, > >>>> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. > >>>> >>>>> > >>>> >>>>> ?queryHits? The ?queryHits? column provides a map back > >>>> to the > >>>> >>>>> variants in the original ?query?. If the ?query? was a > >>>> >>>>> ?VCF? > >>>> >>>>> object this index corresponds to the row in the > >>>> >>>>> ?GRanges? in > >>>> >>>>> the ?rowData? slot. If ?query? was an expanded > >>>> ?GRanges?, > >>>> >>>>> ?RangedData? or ?RangesList? the index corresponds to > >>>> >>>>> the row > >>>> >>>>> in the expanded object. > >>>> >>>>> > >>>> >>>>> ##### > >>>> >>>>> > >>>> >>>>>> aa[1,] > >>>> >>>>> DataFrame with 1 row and 9 columns > >>>> >>>>> queryID consequence refSeq > >>>> varSeq refAA > >>>> >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> > >>>> >>>>> 1 1 nonsynonymous CTC > >>>> ATC L > >>>> >>>>> varAA txID geneID cdsID > >>>> >>>>> <aastringset> <character> <factor> <integer> > >>>> >>>>> 1 I 10921 6335 33668 > >>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx > >>>> >>>>>> '10921' and cds '33668'. > >>>> >>>>>> #If I look at the first query SNP I get this: > >>>> >>>>>> eg.snps.exp[aa[1,'queryID'],] > >>>> >>>>> GRanges with 1 range and 2 elementMetadata values: > >>>> >>>>> seqnames ranges strand | RefSNP_id > >>>> >>>>> alleles_as_ambig > >>>> >>>>> <rle> <iranges> <rle> | <character> <character> > >>>> >>>>> [1] chr2 [167055370, 167055370] * | 111558968 > >>>> >>>>> R > >>>> >>>>> --- > >>>> >>>>> seqlengths: > >>>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 > >>>> chrX > >>>> >>>>> chrY chrM > >>>> >>>>> NA NA NA NA NA NA ... NA NA > >>>> NA NA > >>>> >>>>> NA NA > >>>> >>>>>> #So SNP 1 is at 167055370 on chr2 > >>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping > >>>> >>>>>> 167055370 is actually '33651' > >>>> >>>>>> #And cds '33668' is at the other end of the tx: > >>>> >>>>>> cds.list[[aa[1,'txID']]] > >>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values: > >>>> >>>>> seqnames ranges strand | cds_id > >>>> >>>>> cds_name > >>>> >>>>> <rle> <iranges> <rle> | <integer> <character> > >>>> >>>>> [1] chr2 [167168009, 167168266] - | 33668 <na> > >>>> >>>>> [2] chr2 [167163466, 167163584] - | 33667 <na> > >>>> >>>>> [3] chr2 [167163020, 167163109] - | 33666 <na> > >>>> >>>>> [4] chr2 [167162302, 167162430] - | 33647 <na> > >>>> >>>>> [5] chr2 [167160748, 167160839] - | 33646 <na> > >>>> >>>>> [6] chr2 [167159600, 167159812] - | 33645 <na> > >>>> >>>>> [7] chr2 [167151109, 167151172] - | 33644 <na> > >>>> >>>>> [8] chr2 [167149741, 167149882] - | 33643 <na> > >>>> >>>>> [9] chr2 [167144947, 167145153] - | 33642 <na> > >>>> >>>>> ... ... ... ... ... ... > >>>> >>>>> ... > >>>> >>>>> [18] chr2 [167099012, 167099166] - | 33659 <na> > >>>> >>>>> [19] chr2 [167094604, 167094777] - | 33658 <na> > >>>> >>>>> [20] chr2 [167089850, 167089972] - | 33657 <na> > >>>> >>>>> [21] chr2 [167085201, 167085482] - | 33656 <na> > >>>> >>>>> [22] chr2 [167084180, 167084233] - | 33655 <na> > >>>> >>>>> [23] chr2 [167083077, 167083214] - | 33654 <na> > >>>> >>>>> [24] chr2 [167060870, 167060974] - | 33653 <na> > >>>> >>>>> [25] chr2 [167060465, 167060735] - | 33652 <na> > >>>> >>>>> [26] chr2 [167055182, 167056374] - | 33651 <na> > >>>> >>>>> exon_rank > >>>> >>>>> <integer> > >>>> >>>>> [1] 2 > >>>> >>>>> [2] 3 > >>>> >>>>> [3] 4 > >>>> >>>>> [4] 5 > >>>> >>>>> [5] 6 > >>>> >>>>> [6] 7 > >>>> >>>>> [7] 8 > >>>> >>>>> [8] 9 > >>>> >>>>> [9] 10 > >>>> >>>>> ... ... > >>>> >>>>> [18] 19 > >>>> >>>>> [19] 20 > >>>> >>>>> [20] 21 > >>>> >>>>> [21] 22 > >>>> >>>>> [22] 23 > >>>> >>>>> [23] 24 > >>>> >>>>> [24] 25 > >>>> >>>>> [25] 26 > >>>> >>>>> [26] 27 > >>>> >>>>> --- > >>>> >>>>> seqlengths: > >>>> >>>>> chr1 chr2 ... > >>>> >>>>> chr18_gl000207_random > >>>> >>>>> 249250621 243199373 ... > >>>> >>>>> 4262 > >>>> >>>>> > >>>> >>>>> > >>>> >>>> > >>>> >>>> _______________________________________________ > >>>> >>>> Bioconductor mailing list > >>>> >>>> Bioconductor at r-project.org > >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> >>>> Search the archives: > >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> >>> > >>>> >>> _______________________________________________ > >>>> >>> Bioconductor mailing list > >>>> >>> Bioconductor at r-project.org > >>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> >>> Search the archives: > >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >

ADD COMMENT • link 11.9 years ago Thomas Girke ★ 1.7k

Login before adding your answer.