Finding coding SNPs with predictCoding
1
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
Hi Thomas, Alex, Marc has been working on a makeTranscriptDbFromGFF() function which is now in GenomicFeatures v 1.9.11 in devel. This addresses the last of Thomas' points below wrt the user providing annotations in GFF or GTF format. The function creates a TxDb which can be given to locateVariants() and predictCoding(). Let us know how it goes. Valerie On 04/05/2012 01:47 PM, Valerie Obenchain wrote: > Hi Thomas, Alex, > > Changes below have been made in release 1.2.4 and devel 1.3.4. > > predictCoding : > > - Now returns a proteinID that is the triplet number in the protein > > - The refAA and varAA are the 1-letter amino acid code but not the > 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a > function that returns the 3-letter code. Is this of interest/use? > > - Over the next dev cycle I will implement methods for annotations > (subject argument) to be gff/gtf. Currently genomes (seqSource > argument) can be a BSgenome or fasta file. > > > locateVariants: > > To add flexibility for finding variants in a particular region the > function now dispatches on a 'region' argument (e.g., > CodingVariants(), IntronVariants(), etc.). I've included a > SpliceSiteVariants() that counts ranges that overlap with any portion > of the first 2 or last 2 nucleotides of an intron. Called as, > > locateVariants(query, subject, range=SpliceSiteVariants()) > > Thanks again for the suggestions. > > Valerie > > > On 03/05/2012 01:25 AM, Alex Gutteridge wrote: >> Hi Valerie, >> >> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use >> case is mapping coding SNPs to PDB protein structures, so something >> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need. >> >> Alex Gutteridge. >> >> On 03.03.2012 00:58, Thomas Girke wrote: >>> Hi Valerie, >>> >>> Based on my experience the position in the complete protein (rather >>> than >>> CDS) sequence would be the most important piece of information to >>> record >>> here. For instance, if a SNP changes the 88th triplet CAT (His) in the >>> ORF of a transcript to CAA (Gln) then you want to record it like this: >>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a >>> protein structure or inject it into the corresponding protein sequence >>> without any additional remapping or coding. >>> >>> Another feature to consider are SNPs affecting splice sites (commonly >>> first and last two nucleotides of an intron). >>> >>> If possible it would be also very useful to support users who want to >>> work with their custom genomes and annotations files provided as FASTA >>> and GFF/GTF files, respectively. >>> >>> Best, >>> >>> Thomas >>> >>> >>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: >>>> Slight change to this - >>>> >>>> I'm now returning the following new columns, >>>> >>>> \item{\code{seqTxLoc}}{ >>>> Location in transcript-based coordinates of the first >>>> nucleotide in >>>> the codon sequence to be translated. This position >>>> corresponds to the >>>> first nucleotide in both the \code{refSeq} and \code{varSeq} >>>> columns. >>>> } >>>> \item{\code{varTxLoc}}{ >>>> Location in transcript-based coordinates of the first >>>> nucleotide in >>>> the variant. This value will be the same as \code{seqTxLoc} >>>> when the >>>> variant starts exactly at the beginning of a codon. >>>> } >>>> \item{\code{varCdsLoc}}{ >>>> Location in cds-based coordinates of the first nucleotide in >>>> the variant. This position is relative to the start of the >>>> cds region >>>> defined in the \code{subject} annotation. >>>> } >>>> \item{\code{subjStrand}}{ >>>> The strand of the \code{subject} the variant matched. >>>> \code{predictCoding} >>>> determines which variants fall in a coding region by finding >>>> the >>>> overlaps >>>> between the \code{query} and \code{subject}. The >>>> \code{query} may be >>>> un-stranded \sQuote{*} but the \code{subject} annotation will >>>> have a strand. >>>> } >>>> >>>> >>>> You are interested in 'protein coordinates'. Does 'varCdsLoc' >>>> described >>>> above meet the need or are you looking for the actual codon number in >>>> the coding sequence? I am interested in hearing more about what you >>>> are >>>> doing with the protein coordinates, how you are using them. It would >>>> help us better design future functions. >>>> >>>> Thanks, >>>> Valerie >>>> >>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote: >>>> > Thanks Valerie - much appreciated! >>>> > >>>> > On 01.03.2012 21:30, Valerie Obenchain wrote: >>>> >> A 'txLoc' column has been added to the output of predictCoding. >>>> >> Available in devel version 1.1.57. >>>> >> >>>> >> Valerie >>>> >> >>>> >> >>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: >>>> >>> Good suggestion. Yes, predictCoding is does this internally. I'll >>>> >>> post back here when this has been added. >>>> >>> >>>> >>> Valerie >>>> >>> >>>> >>> >>>> >>> >>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: >>>> >>>> Hi Valerie, >>>> >>>> >>>> >>>> Thanks everything works great now. One small feature request - >>>> >>>> would it be hard to output the protein sequence position of the >>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm >>>> >>>> re-extracting the cds and working out the position of each coding >>>> >>>> SNP so I can see where in the protein sequence it is, but it >>>> seems >>>> >>>> like this is probably just replicating what predictCoding must be >>>> >>>> doing internally anyway? >>>> >>>> >>>> >>>> Alex Gutteridge >>>> >>> >>>> >>> >>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: >>>> >>>> Hi Alex, >>>> >>>> >>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap >>>> >>>> between the query and GRangesList of cds by transcripts. This >>>> gave >>>> >>>> the correct transcript number but (incorrectly) took the first >>>> cds >>>> >>>> number in the list by default. Now fixed in devel 1.1.55. >>>> >>>> >>>> >>>> I've also updated the man page. >>>> >>>> >>>> >>>> Valerie >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: >>>> >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: >>>> >>>>>> Hi Alex, >>>> >>>>>> >>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: >>>> >>>>> >>>> >>>>> [...] >>>> >>>>> >>>> >>>>>>> But the predictCoding call gives this error: >>>> >>>>>>> >>>> >>>>>>> Error in .setSeqNames(x, value) : >>>> >>>>>>> The replacement value for isActiveSeq must be a logical >>>> vector, >>>> >>>>>>> with >>>> >>>>>>> names that match the seqlevels of the object >>>> >>>>>> >>>> >>>>>> The error message doesn't help much but I think the pb is >>>> that you >>>> >>>>>> didn't rename chMT properly. Try to do this: >>>> >>>>>> >>>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>> >>>>>> >>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again. >>>> >>>>>> >>>> >>>>>> Cheers, >>>> >>>>>> H. >>>> >>>>> >>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty >>>> joining up >>>> >>>>> the output of predictCoding() with the query SNPs though. If >>>> >>>>> someone could point out where the disconnect in my thinking is I >>>> >>>>> would appreciate it! >>>> >>>>> >>>> >>>>> Here's my (now edited down) script: >>>> >>>>> >>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>> >>>>> library(VariantAnnotation) >>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) >>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> >>>>> >>>> >>>>> entrez.ids = c('6335') >>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene >>>> >>>>> >>>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) >>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) >>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) >>>> >>>>> >>>> >>>>> gene.list = cdsBy(txdb19, by="gene") >>>> >>>>> vsd.list = gene.list[entrez.ids] >>>> >>>>> cds.list = cdsBy(txdb19,by="tx") >>>> >>>>> >>>> >>>>> eg = entrez.ids[1] >>>> >>>>> >>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) >>>> >>>>> eg.snps = snps[snp.idx] >>>> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] >>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) >>>> >>>>> variant.alleles = >>>> >>>>> >>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"" )[[1]]) >>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> aa = >>>> >>>>> >>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=var iant.alleles) >>>> >>>>> >>>> >>>>> ##### >>>> >>>>> >>>> >>>>> Then if I query the predictCoding results in aa in an >>>> interactive >>>> >>>>> session I get the following (see inline comments for what I >>>> think >>>> >>>>> should be happening, but I must be misinterpreting what queryID >>>> >>>>> means) >>>> >>>>> >>>> >>>>> The docs for predictCoding() contain a small typo >>>> >>>>> (s/queryHits/queryID), but otherwise seem clear? >>>> >>>>> >>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, >>>> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. >>>> >>>>> >>>> >>>>> ?queryHits? The ?queryHits? column provides a map back >>>> to the >>>> >>>>> variants in the original ?query?. If the ?query? was a >>>> >>>>> ?VCF? >>>> >>>>> object this index corresponds to the row in the >>>> >>>>> ?GRanges? in >>>> >>>>> the ?rowData? slot. If ?query? was an expanded >>>> ?GRanges?, >>>> >>>>> ?RangedData? or ?RangesList? the index corresponds to >>>> >>>>> the row >>>> >>>>> in the expanded object. >>>> >>>>> >>>> >>>>> ##### >>>> >>>>> >>>> >>>>>> aa[1,] >>>> >>>>> DataFrame with 1 row and 9 columns >>>> >>>>> queryID consequence refSeq >>>> varSeq refAA >>>> >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> >>>> >>>>> 1 1 nonsynonymous CTC >>>> ATC L >>>> >>>>> varAA txID geneID cdsID >>>> >>>>> <aastringset> <character> <factor> <integer> >>>> >>>>> 1 I 10921 6335 33668 >>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx >>>> >>>>>> '10921' and cds '33668'. >>>> >>>>>> #If I look at the first query SNP I get this: >>>> >>>>>> eg.snps.exp[aa[1,'queryID'],] >>>> >>>>> GRanges with 1 range and 2 elementMetadata values: >>>> >>>>> seqnames ranges strand | RefSNP_id >>>> >>>>> alleles_as_ambig >>>> >>>>> <rle> <iranges> <rle> | <character> <character> >>>> >>>>> [1] chr2 [167055370, 167055370] * | 111558968 >>>> >>>>> R >>>> >>>>> --- >>>> >>>>> seqlengths: >>>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 >>>> chrX >>>> >>>>> chrY chrM >>>> >>>>> NA NA NA NA NA NA ... NA NA >>>> NA NA >>>> >>>>> NA NA >>>> >>>>>> #So SNP 1 is at 167055370 on chr2 >>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping >>>> >>>>>> 167055370 is actually '33651' >>>> >>>>>> #And cds '33668' is at the other end of the tx: >>>> >>>>>> cds.list[[aa[1,'txID']]] >>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values: >>>> >>>>> seqnames ranges strand | cds_id >>>> >>>>> cds_name >>>> >>>>> <rle> <iranges> <rle> | <integer> <character> >>>> >>>>> [1] chr2 [167168009, 167168266] - | 33668 <na> >>>> >>>>> [2] chr2 [167163466, 167163584] - | 33667 <na> >>>> >>>>> [3] chr2 [167163020, 167163109] - | 33666 <na> >>>> >>>>> [4] chr2 [167162302, 167162430] - | 33647 <na> >>>> >>>>> [5] chr2 [167160748, 167160839] - | 33646 <na> >>>> >>>>> [6] chr2 [167159600, 167159812] - | 33645 <na> >>>> >>>>> [7] chr2 [167151109, 167151172] - | 33644 <na> >>>> >>>>> [8] chr2 [167149741, 167149882] - | 33643 <na> >>>> >>>>> [9] chr2 [167144947, 167145153] - | 33642 <na> >>>> >>>>> ... ... ... ... ... ... >>>> >>>>> ... >>>> >>>>> [18] chr2 [167099012, 167099166] - | 33659 <na> >>>> >>>>> [19] chr2 [167094604, 167094777] - | 33658 <na> >>>> >>>>> [20] chr2 [167089850, 167089972] - | 33657 <na> >>>> >>>>> [21] chr2 [167085201, 167085482] - | 33656 <na> >>>> >>>>> [22] chr2 [167084180, 167084233] - | 33655 <na> >>>> >>>>> [23] chr2 [167083077, 167083214] - | 33654 <na> >>>> >>>>> [24] chr2 [167060870, 167060974] - | 33653 <na> >>>> >>>>> [25] chr2 [167060465, 167060735] - | 33652 <na> >>>> >>>>> [26] chr2 [167055182, 167056374] - | 33651 <na> >>>> >>>>> exon_rank >>>> >>>>> <integer> >>>> >>>>> [1] 2 >>>> >>>>> [2] 3 >>>> >>>>> [3] 4 >>>> >>>>> [4] 5 >>>> >>>>> [5] 6 >>>> >>>>> [6] 7 >>>> >>>>> [7] 8 >>>> >>>>> [8] 9 >>>> >>>>> [9] 10 >>>> >>>>> ... ... >>>> >>>>> [18] 19 >>>> >>>>> [19] 20 >>>> >>>>> [20] 21 >>>> >>>>> [21] 22 >>>> >>>>> [22] 23 >>>> >>>>> [23] 24 >>>> >>>>> [24] 25 >>>> >>>>> [25] 26 >>>> >>>>> [26] 27 >>>> >>>>> --- >>>> >>>>> seqlengths: >>>> >>>>> chr1 chr2 ... >>>> >>>>> chr18_gl000207_random >>>> >>>>> 249250621 243199373 ... >>>> >>>>> 4262 >>>> >>>>> >>>> >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> Bioconductor mailing list >>>> >>>> Bioconductor at r-project.org >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>>> Search the archives: >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>>> >>> _______________________________________________ >>>> >>> Bioconductor mailing list >>>> >>> Bioconductor at r-project.org >>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>> Search the archives: >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> > >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
SNP Annotation BSgenome BSgenome GenomicFeatures cycle genomes SNP Annotation BSgenome • 915 views
ADD COMMENT
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 4 weeks ago
United States
Great, and thanks for doing this! Creating TxDb objects directly from gff/gtf files has been for a long time on my wish list. If this works now for most of the GFF/GTF files provided by ncbi (ftp://ftp.ncbi.nih.gov/genomes/), ensembl (http://useast.ensembl.org/info/data/ftp/index.html) or other genome databases, then this will save many of us a lot of time and headache. Thomas On Tue, May 15, 2012 at 08:17:52PM +0000, Valerie Obenchain wrote: > Hi Thomas, Alex, > > Marc has been working on a makeTranscriptDbFromGFF() function which is > now in GenomicFeatures v 1.9.11 in devel. This addresses the last of > Thomas' points below wrt the user providing annotations in GFF or GTF > format. The function creates a TxDb which can be given to > locateVariants() and predictCoding(). Let us know how it goes. > > Valerie > > > > On 04/05/2012 01:47 PM, Valerie Obenchain wrote: > > Hi Thomas, Alex, > > > > Changes below have been made in release 1.2.4 and devel 1.3.4. > > > > predictCoding : > > > > - Now returns a proteinID that is the triplet number in the protein > > > > - The refAA and varAA are the 1-letter amino acid code but not the > > 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a > > function that returns the 3-letter code. Is this of interest/use? > > > > - Over the next dev cycle I will implement methods for annotations > > (subject argument) to be gff/gtf. Currently genomes (seqSource > > argument) can be a BSgenome or fasta file. > > > > > > locateVariants: > > > > To add flexibility for finding variants in a particular region the > > function now dispatches on a 'region' argument (e.g., > > CodingVariants(), IntronVariants(), etc.). I've included a > > SpliceSiteVariants() that counts ranges that overlap with any portion > > of the first 2 or last 2 nucleotides of an intron. Called as, > > > > locateVariants(query, subject, range=SpliceSiteVariants()) > > > > Thanks again for the suggestions. > > > > Valerie > > > > > > On 03/05/2012 01:25 AM, Alex Gutteridge wrote: > >> Hi Valerie, > >> > >> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use > >> case is mapping coding SNPs to PDB protein structures, so something > >> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need. > >> > >> Alex Gutteridge. > >> > >> On 03.03.2012 00:58, Thomas Girke wrote: > >>> Hi Valerie, > >>> > >>> Based on my experience the position in the complete protein (rather > >>> than > >>> CDS) sequence would be the most important piece of information to > >>> record > >>> here. For instance, if a SNP changes the 88th triplet CAT (His) in the > >>> ORF of a transcript to CAA (Gln) then you want to record it like this: > >>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a > >>> protein structure or inject it into the corresponding protein sequence > >>> without any additional remapping or coding. > >>> > >>> Another feature to consider are SNPs affecting splice sites (commonly > >>> first and last two nucleotides of an intron). > >>> > >>> If possible it would be also very useful to support users who want to > >>> work with their custom genomes and annotations files provided as FASTA > >>> and GFF/GTF files, respectively. > >>> > >>> Best, > >>> > >>> Thomas > >>> > >>> > >>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote: > >>>> Slight change to this - > >>>> > >>>> I'm now returning the following new columns, > >>>> > >>>> \item{\code{seqTxLoc}}{ > >>>> Location in transcript-based coordinates of the first > >>>> nucleotide in > >>>> the codon sequence to be translated. This position > >>>> corresponds to the > >>>> first nucleotide in both the \code{refSeq} and \code{varSeq} > >>>> columns. > >>>> } > >>>> \item{\code{varTxLoc}}{ > >>>> Location in transcript-based coordinates of the first > >>>> nucleotide in > >>>> the variant. This value will be the same as \code{seqTxLoc} > >>>> when the > >>>> variant starts exactly at the beginning of a codon. > >>>> } > >>>> \item{\code{varCdsLoc}}{ > >>>> Location in cds-based coordinates of the first nucleotide in > >>>> the variant. This position is relative to the start of the > >>>> cds region > >>>> defined in the \code{subject} annotation. > >>>> } > >>>> \item{\code{subjStrand}}{ > >>>> The strand of the \code{subject} the variant matched. > >>>> \code{predictCoding} > >>>> determines which variants fall in a coding region by finding > >>>> the > >>>> overlaps > >>>> between the \code{query} and \code{subject}. The > >>>> \code{query} may be > >>>> un-stranded \sQuote{*} but the \code{subject} annotation will > >>>> have a strand. > >>>> } > >>>> > >>>> > >>>> You are interested in 'protein coordinates'. Does 'varCdsLoc' > >>>> described > >>>> above meet the need or are you looking for the actual codon number in > >>>> the coding sequence? I am interested in hearing more about what you > >>>> are > >>>> doing with the protein coordinates, how you are using them. It would > >>>> help us better design future functions. > >>>> > >>>> Thanks, > >>>> Valerie > >>>> > >>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote: > >>>> > Thanks Valerie - much appreciated! > >>>> > > >>>> > On 01.03.2012 21:30, Valerie Obenchain wrote: > >>>> >> A 'txLoc' column has been added to the output of predictCoding. > >>>> >> Available in devel version 1.1.57. > >>>> >> > >>>> >> Valerie > >>>> >> > >>>> >> > >>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote: > >>>> >>> Good suggestion. Yes, predictCoding is does this internally. I'll > >>>> >>> post back here when this has been added. > >>>> >>> > >>>> >>> Valerie > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote: > >>>> >>>> Hi Valerie, > >>>> >>>> > >>>> >>>> Thanks everything works great now. One small feature request - > >>>> >>>> would it be hard to output the protein sequence position of the > >>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm > >>>> >>>> re-extracting the cds and working out the position of each coding > >>>> >>>> SNP so I can see where in the protein sequence it is, but it > >>>> seems > >>>> >>>> like this is probably just replicating what predictCoding must be > >>>> >>>> doing internally anyway? > >>>> >>>> > >>>> >>>> Alex Gutteridge > >>>> >>> > >>>> >>> > >>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote: > >>>> >>>> Hi Alex, > >>>> >>>> > >>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap > >>>> >>>> between the query and GRangesList of cds by transcripts. This > >>>> gave > >>>> >>>> the correct transcript number but (incorrectly) took the first > >>>> cds > >>>> >>>> number in the list by default. Now fixed in devel 1.1.55. > >>>> >>>> > >>>> >>>> I've also updated the man page. > >>>> >>>> > >>>> >>>> Valerie > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote: > >>>> >>>>> On 22.02.2012 18:58, Hervé Pagès wrote: > >>>> >>>>>> Hi Alex, > >>>> >>>>>> > >>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote: > >>>> >>>>> > >>>> >>>>> [...] > >>>> >>>>> > >>>> >>>>>>> But the predictCoding call gives this error: > >>>> >>>>>>> > >>>> >>>>>>> Error in .setSeqNames(x, value) : > >>>> >>>>>>> The replacement value for isActiveSeq must be a logical > >>>> vector, > >>>> >>>>>>> with > >>>> >>>>>>> names that match the seqlevels of the object > >>>> >>>>>> > >>>> >>>>>> The error message doesn't help much but I think the pb is > >>>> that you > >>>> >>>>>> didn't rename chMT properly. Try to do this: > >>>> >>>>>> > >>>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > >>>> >>>>>> > >>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again. > >>>> >>>>>> > >>>> >>>>>> Cheers, > >>>> >>>>>> H. > >>>> >>>>> > >>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty > >>>> joining up > >>>> >>>>> the output of predictCoding() with the query SNPs though. If > >>>> >>>>> someone could point out where the disconnect in my thinking is I > >>>> >>>>> would appreciate it! > >>>> >>>>> > >>>> >>>>> Here's my (now edited down) script: > >>>> >>>>> > >>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) > >>>> >>>>> library(VariantAnnotation) > >>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815) > >>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >>>> >>>>> > >>>> >>>>> entrez.ids = c('6335') > >>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene > >>>> >>>>> > >>>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T) > >>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps)) > >>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps)) > >>>> >>>>> > >>>> >>>>> gene.list = cdsBy(txdb19, by="gene") > >>>> >>>>> vsd.list = gene.list[entrez.ids] > >>>> >>>>> cds.list = cdsBy(txdb19,by="tx") > >>>> >>>>> > >>>> >>>>> eg = entrez.ids[1] > >>>> >>>>> > >>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]]))) > >>>> >>>>> eg.snps = snps[snp.idx] > >>>> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"] > >>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac])) > >>>> >>>>> variant.alleles = > >>>> >>>>> > >>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""), "")[[1]]) > >>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> aa = > >>>> >>>>> > >>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=v ariant.alleles) > >>>> >>>>> > >>>> >>>>> ##### > >>>> >>>>> > >>>> >>>>> Then if I query the predictCoding results in aa in an > >>>> interactive > >>>> >>>>> session I get the following (see inline comments for what I > >>>> think > >>>> >>>>> should be happening, but I must be misinterpreting what queryID > >>>> >>>>> means) > >>>> >>>>> > >>>> >>>>> The docs for predictCoding() contain a small typo > >>>> >>>>> (s/queryHits/queryID), but otherwise seem clear? > >>>> >>>>> > >>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?, > >>>> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?. > >>>> >>>>> > >>>> >>>>> ?queryHits? The ?queryHits? column provides a map back > >>>> to the > >>>> >>>>> variants in the original ?query?. If the ?query? was a > >>>> >>>>> ?VCF? > >>>> >>>>> object this index corresponds to the row in the > >>>> >>>>> ?GRanges? in > >>>> >>>>> the ?rowData? slot. If ?query? was an expanded > >>>> ?GRanges?, > >>>> >>>>> ?RangedData? or ?RangesList? the index corresponds to > >>>> >>>>> the row > >>>> >>>>> in the expanded object. > >>>> >>>>> > >>>> >>>>> ##### > >>>> >>>>> > >>>> >>>>>> aa[1,] > >>>> >>>>> DataFrame with 1 row and 9 columns > >>>> >>>>> queryID consequence refSeq > >>>> varSeq refAA > >>>> >>>>> <integer> <factor> <dnastringset> <dnastringset> <aastringset> > >>>> >>>>> 1 1 nonsynonymous CTC > >>>> ATC L > >>>> >>>>> varAA txID geneID cdsID > >>>> >>>>> <aastringset> <character> <factor> <integer> > >>>> >>>>> 1 I 10921 6335 33668 > >>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx > >>>> >>>>>> '10921' and cds '33668'. > >>>> >>>>>> #If I look at the first query SNP I get this: > >>>> >>>>>> eg.snps.exp[aa[1,'queryID'],] > >>>> >>>>> GRanges with 1 range and 2 elementMetadata values: > >>>> >>>>> seqnames ranges strand | RefSNP_id > >>>> >>>>> alleles_as_ambig > >>>> >>>>> <rle> <iranges> <rle> | <character> <character> > >>>> >>>>> [1] chr2 [167055370, 167055370] * | 111558968 > >>>> >>>>> R > >>>> >>>>> --- > >>>> >>>>> seqlengths: > >>>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 > >>>> chrX > >>>> >>>>> chrY chrM > >>>> >>>>> NA NA NA NA NA NA ... NA NA > >>>> NA NA > >>>> >>>>> NA NA > >>>> >>>>>> #So SNP 1 is at 167055370 on chr2 > >>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping > >>>> >>>>>> 167055370 is actually '33651' > >>>> >>>>>> #And cds '33668' is at the other end of the tx: > >>>> >>>>>> cds.list[[aa[1,'txID']]] > >>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values: > >>>> >>>>> seqnames ranges strand | cds_id > >>>> >>>>> cds_name > >>>> >>>>> <rle> <iranges> <rle> | <integer> <character> > >>>> >>>>> [1] chr2 [167168009, 167168266] - | 33668 <na> > >>>> >>>>> [2] chr2 [167163466, 167163584] - | 33667 <na> > >>>> >>>>> [3] chr2 [167163020, 167163109] - | 33666 <na> > >>>> >>>>> [4] chr2 [167162302, 167162430] - | 33647 <na> > >>>> >>>>> [5] chr2 [167160748, 167160839] - | 33646 <na> > >>>> >>>>> [6] chr2 [167159600, 167159812] - | 33645 <na> > >>>> >>>>> [7] chr2 [167151109, 167151172] - | 33644 <na> > >>>> >>>>> [8] chr2 [167149741, 167149882] - | 33643 <na> > >>>> >>>>> [9] chr2 [167144947, 167145153] - | 33642 <na> > >>>> >>>>> ... ... ... ... ... ... > >>>> >>>>> ... > >>>> >>>>> [18] chr2 [167099012, 167099166] - | 33659 <na> > >>>> >>>>> [19] chr2 [167094604, 167094777] - | 33658 <na> > >>>> >>>>> [20] chr2 [167089850, 167089972] - | 33657 <na> > >>>> >>>>> [21] chr2 [167085201, 167085482] - | 33656 <na> > >>>> >>>>> [22] chr2 [167084180, 167084233] - | 33655 <na> > >>>> >>>>> [23] chr2 [167083077, 167083214] - | 33654 <na> > >>>> >>>>> [24] chr2 [167060870, 167060974] - | 33653 <na> > >>>> >>>>> [25] chr2 [167060465, 167060735] - | 33652 <na> > >>>> >>>>> [26] chr2 [167055182, 167056374] - | 33651 <na> > >>>> >>>>> exon_rank > >>>> >>>>> <integer> > >>>> >>>>> [1] 2 > >>>> >>>>> [2] 3 > >>>> >>>>> [3] 4 > >>>> >>>>> [4] 5 > >>>> >>>>> [5] 6 > >>>> >>>>> [6] 7 > >>>> >>>>> [7] 8 > >>>> >>>>> [8] 9 > >>>> >>>>> [9] 10 > >>>> >>>>> ... ... > >>>> >>>>> [18] 19 > >>>> >>>>> [19] 20 > >>>> >>>>> [20] 21 > >>>> >>>>> [21] 22 > >>>> >>>>> [22] 23 > >>>> >>>>> [23] 24 > >>>> >>>>> [24] 25 > >>>> >>>>> [25] 26 > >>>> >>>>> [26] 27 > >>>> >>>>> --- > >>>> >>>>> seqlengths: > >>>> >>>>> chr1 chr2 ... > >>>> >>>>> chr18_gl000207_random > >>>> >>>>> 249250621 243199373 ... > >>>> >>>>> 4262 > >>>> >>>>> > >>>> >>>>> > >>>> >>>> > >>>> >>>> _______________________________________________ > >>>> >>>> Bioconductor mailing list > >>>> >>>> Bioconductor at r-project.org > >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> >>>> Search the archives: > >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> >>> > >>>> >>> _______________________________________________ > >>>> >>> Bioconductor mailing list > >>>> >>> Bioconductor at r-project.org > >>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> >>> Search the archives: > >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >
ADD COMMENT

Login before adding your answer.

Traffic: 610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6