Entering edit mode
                    Hi Thomas, Alex,
Changes below have been made in release 1.2.4 and devel 1.3.4.
predictCoding :
- Now returns a proteinID that is the triplet number in the protein
- The refAA and varAA are the 1-letter amino acid code but not the
3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a
function that returns the 3-letter code. Is this of interest/use?
- Over the next dev cycle I will implement methods for annotations
(subject argument) to be gff/gtf. Currently genomes (seqSource
argument)
can be a BSgenome or fasta file.
locateVariants:
To add flexibility for finding variants in a particular region the
function now dispatches on a 'region' argument (e.g.,
CodingVariants(),
IntronVariants(), etc.). I've included a SpliceSiteVariants() that
counts ranges that overlap with any portion of the first 2 or last 2
nucleotides of an intron. Called as,
     locateVariants(query, subject, range=SpliceSiteVariants())
Thanks again for the suggestions.
Valerie
On 03/05/2012 01:25 AM, Alex Gutteridge wrote:
> Hi Valerie,
>
> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use
> case is mapping coding SNPs to PDB protein structures, so something
> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need.
>
> Alex Gutteridge.
>
> On 03.03.2012 00:58, Thomas Girke wrote:
>> Hi Valerie,
>>
>> Based on my experience the position in the complete protein (rather
than
>> CDS) sequence would be the most important piece of information to
record
>> here. For instance, if a SNP changes the 88th triplet CAT (His) in
the
>> ORF of a transcript to CAA (Gln) then you want to record it like
this:
>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change
to a
>> protein structure or inject it into the corresponding protein
sequence
>> without any additional remapping or coding.
>>
>> Another feature to consider are SNPs affecting splice sites
(commonly
>> first and last two nucleotides of an intron).
>>
>> If possible it would be also very useful to support users who want
to
>> work with their custom genomes and annotations files provided as
FASTA
>> and GFF/GTF files, respectively.
>>
>> Best,
>>
>> Thomas
>>
>>
>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote:
>>> Slight change to this -
>>>
>>> I'm now returning the following new columns,
>>>
>>>    \item{\code{seqTxLoc}}{
>>>        Location in transcript-based coordinates of the first
>>> nucleotide in
>>>        the codon sequence to be translated. This position
>>> corresponds to the
>>>        first nucleotide in both the \code{refSeq} and
\code{varSeq}
>>> columns.
>>>      }
>>>      \item{\code{varTxLoc}}{
>>>        Location in transcript-based coordinates of the first
>>> nucleotide in
>>>        the variant. This value will be the same as \code{seqTxLoc}
>>> when the
>>>        variant starts exactly at the beginning of a codon.
>>>      }
>>>      \item{\code{varCdsLoc}}{
>>>        Location in cds-based coordinates of the first nucleotide
in
>>>        the variant. This position is relative to the start of the
>>> cds region
>>>        defined in the \code{subject} annotation.
>>>      }
>>>      \item{\code{subjStrand}}{
>>>        The strand of the \code{subject} the variant matched.
>>> \code{predictCoding}
>>>        determines which variants fall in a coding region by
finding the
>>> overlaps
>>>        between the \code{query} and \code{subject}. The
\code{query}
>>> may be
>>>        un-stranded \sQuote{*} but the \code{subject} annotation
will
>>> have a strand.
>>>      }
>>>
>>>
>>> You are interested in 'protein coordinates'. Does 'varCdsLoc'
described
>>> above meet the need or are you looking for the actual codon number
in
>>> the coding sequence? I am interested in hearing more about what
you are
>>> doing with the protein coordinates, how you are using them. It
would
>>> help us better design future functions.
>>>
>>> Thanks,
>>> Valerie
>>>
>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
>>> > Thanks Valerie - much appreciated!
>>> >
>>> > On 01.03.2012 21:30, Valerie Obenchain wrote:
>>> >> A 'txLoc' column has been added to the output of predictCoding.
>>> >> Available in devel version 1.1.57.
>>> >>
>>> >> Valerie
>>> >>
>>> >>
>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>>> >>> Good suggestion. Yes, predictCoding is does this internally.
I'll
>>> >>> post back here when this has been added.
>>> >>>
>>> >>> Valerie
>>> >>>
>>> >>>
>>> >>>
>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>>> >>>> Hi Valerie,
>>> >>>>
>>> >>>> Thanks everything works great now. One small feature request
-
>>> >>>> would it be hard to output the protein sequence position of
the
>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm
>>> >>>> re-extracting the cds and working out the position of each
coding
>>> >>>> SNP so I can see where in the protein sequence it is, but it
seems
>>> >>>> like this is probably just replicating what predictCoding
must be
>>> >>>> doing internally anyway?
>>> >>>>
>>> >>>> Alex Gutteridge
>>> >>>
>>> >>>
>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>>> >>>> Hi Alex,
>>> >>>>
>>> >>>> Thanks for the bug report. The cdsID was taken from an
overlap
>>> >>>> between the query and GRangesList of cds by transcripts. This
gave
>>> >>>> the correct transcript number but (incorrectly) took the
first cds
>>> >>>> number in the list by default. Now fixed in devel 1.1.55.
>>> >>>>
>>> >>>> I've also updated the man page.
>>> >>>>
>>> >>>> Valerie
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>>> >>>>> On 22.02.2012 18:58, Hervé Pagès wrote:
>>> >>>>>> Hi Alex,
>>> >>>>>>
>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>>> >>>>>
>>> >>>>> [...]
>>> >>>>>
>>> >>>>>>> But the predictCoding call gives this error:
>>> >>>>>>>
>>> >>>>>>> Error in .setSeqNames(x, value) :
>>> >>>>>>> The replacement value for isActiveSeq must be a logical
vector,
>>> >>>>>>> with
>>> >>>>>>> names that match the seqlevels of the object
>>> >>>>>>
>>> >>>>>> The error message doesn't help much but I think the pb is
>>> that you
>>> >>>>>> didn't rename chMT properly. Try to do this:
>>> >>>>>>
>>> >>>>>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>> >>>>>>
>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again.
>>> >>>>>>
>>> >>>>>> Cheers,
>>> >>>>>> H.
>>> >>>>>
>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty
>>> joining up
>>> >>>>> the output of predictCoding() with the query SNPs though. If
>>> >>>>> someone could point out where the disconnect in my thinking
is I
>>> >>>>> would appreciate it!
>>> >>>>>
>>> >>>>> Here's my (now edited down) script:
>>> >>>>>
>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>> >>>>> library(VariantAnnotation)
>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>> >>>>>
>>> >>>>> entrez.ids = c('6335')
>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>>> >>>>>
>>> >>>>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>> >>>>>
>>> >>>>> gene.list = cdsBy(txdb19, by="gene")
>>> >>>>> vsd.list = gene.list[entrez.ids]
>>> >>>>> cds.list = cdsBy(txdb19,by="tx")
>>> >>>>>
>>> >>>>> eg = entrez.ids[1]
>>> >>>>>
>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps,
vsd.list[[eg]])))
>>> >>>>> eg.snps = snps[snp.idx]
>>> >>>>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>>> >>>>> variant.alleles =
>>> >>>>>
>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")
[[1]])
>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> aa =
>>> >>>>>
>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=vari
ant.alleles)
>>> >>>>>
>>> >>>>> #####
>>> >>>>>
>>> >>>>> Then if I query the predictCoding results in aa in an
interactive
>>> >>>>> session I get the following (see inline comments for what I
think
>>> >>>>> should be happening, but I must be misinterpreting what
queryID
>>> >>>>> means)
>>> >>>>>
>>> >>>>> The docs for predictCoding() contain a small typo
>>> >>>>> (s/queryHits/queryID), but otherwise seem clear?
>>> >>>>>
>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?,
?varSeq?,
>>> >>>>>      ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
>>> >>>>>
>>> >>>>>      ?queryHits? The ?queryHits? column provides a map back
to
>>> the
>>> >>>>>           variants in the original ?query?. If the ?query?
was a
>>> >>>>> ?VCF?
>>> >>>>>           object this index corresponds to the row in the
>>> >>>>> ?GRanges? in
>>> >>>>>           the ?rowData? slot. If ?query? was an expanded
>>> ?GRanges?,
>>> >>>>>           ?RangedData? or ?RangesList? the index corresponds
to
>>> >>>>> the row
>>> >>>>>           in the expanded object.
>>> >>>>>
>>> >>>>> #####
>>> >>>>>
>>> >>>>>> aa[1,]
>>> >>>>> DataFrame with 1 row and 9 columns
>>> >>>>>     queryID   consequence         refSeq         varSeq
>>> refAA
>>> >>>>> <integer> <factor> <dnastringset> <dnastringset>
<aastringset>
>>> >>>>> 1         1 nonsynonymous            CTC            ATC
>>>    L
>>> >>>>>           varAA        txID   geneID     cdsID
>>> >>>>> <aastringset> <character> <factor> <integer>
>>> >>>>> 1             I       10921     6335     33668
>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to
tx
>>> >>>>>> '10921' and cds '33668'.
>>> >>>>>> #If I look at the first query SNP I get this:
>>> >>>>>> eg.snps.exp[aa[1,'queryID'],]
>>> >>>>> GRanges with 1 range and 2 elementMetadata values:
>>> >>>>>       seqnames                 ranges strand |   RefSNP_id
>>> >>>>> alleles_as_ambig
>>> >>>>> <rle> <iranges> <rle> | <character> <character>
>>> >>>>>   [1]     chr2 [167055370, 167055370]      * |   111558968
>>> >>>>>      R
>>> >>>>>   ---
>>> >>>>>   seqlengths:
>>> >>>>>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22
>>> chrX
>>> >>>>> chrY  chrM
>>> >>>>>       NA    NA    NA    NA    NA    NA ...    NA    NA    NA
NA
>>> >>>>>   NA    NA
>>> >>>>>> #So SNP 1 is at 167055370 on chr2
>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping
>>> >>>>>> 167055370 is actually '33651'
>>> >>>>>> #And cds '33668' is at the other end of the tx:
>>> >>>>>> cds.list[[aa[1,'txID']]]
>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values:
>>> >>>>>        seqnames                 ranges strand   |    cds_id
>>> >>>>> cds_name
>>> >>>>> <rle> <iranges> <rle>   | <integer> <character>
>>> >>>>>    [1]     chr2 [167168009, 167168266]      -   |     33668
<na>
>>> >>>>>    [2]     chr2 [167163466, 167163584]      -   |     33667
<na>
>>> >>>>>    [3]     chr2 [167163020, 167163109]      -   |     33666
<na>
>>> >>>>>    [4]     chr2 [167162302, 167162430]      -   |     33647
<na>
>>> >>>>>    [5]     chr2 [167160748, 167160839]      -   |     33646
<na>
>>> >>>>>    [6]     chr2 [167159600, 167159812]      -   |     33645
<na>
>>> >>>>>    [7]     chr2 [167151109, 167151172]      -   |     33644
<na>
>>> >>>>>    [8]     chr2 [167149741, 167149882]      -   |     33643
<na>
>>> >>>>>    [9]     chr2 [167144947, 167145153]      -   |     33642
<na>
>>> >>>>>    ...      ...                    ...    ... ...       ...
>>> >>>>> ...
>>> >>>>>   [18]     chr2 [167099012, 167099166]      -   |     33659
<na>
>>> >>>>>   [19]     chr2 [167094604, 167094777]      -   |     33658
<na>
>>> >>>>>   [20]     chr2 [167089850, 167089972]      -   |     33657
<na>
>>> >>>>>   [21]     chr2 [167085201, 167085482]      -   |     33656
<na>
>>> >>>>>   [22]     chr2 [167084180, 167084233]      -   |     33655
<na>
>>> >>>>>   [23]     chr2 [167083077, 167083214]      -   |     33654
<na>
>>> >>>>>   [24]     chr2 [167060870, 167060974]      -   |     33653
<na>
>>> >>>>>   [25]     chr2 [167060465, 167060735]      -   |     33652
<na>
>>> >>>>>   [26]     chr2 [167055182, 167056374]      -   |     33651
<na>
>>> >>>>>        exon_rank
>>> >>>>> <integer>
>>> >>>>>    [1]         2
>>> >>>>>    [2]         3
>>> >>>>>    [3]         4
>>> >>>>>    [4]         5
>>> >>>>>    [5]         6
>>> >>>>>    [6]         7
>>> >>>>>    [7]         8
>>> >>>>>    [8]         9
>>> >>>>>    [9]        10
>>> >>>>>    ...       ...
>>> >>>>>   [18]        19
>>> >>>>>   [19]        20
>>> >>>>>   [20]        21
>>> >>>>>   [21]        22
>>> >>>>>   [22]        23
>>> >>>>>   [23]        24
>>> >>>>>   [24]        25
>>> >>>>>   [25]        26
>>> >>>>>   [26]        27
>>> >>>>>   ---
>>> >>>>>   seqlengths:
>>> >>>>>                     chr1                  chr2 ...
>>> >>>>> chr18_gl000207_random
>>> >>>>>                249250621             243199373 ...
>>> >>>>> 4262
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Bioconductor mailing list
>>> >>>> Bioconductor at r-project.org
>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >>>> Search the archives:
>>> >>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>>
>>> >>> _______________________________________________
>>> >>> Bioconductor mailing list
>>> >>> Bioconductor at r-project.org
>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >>> Search the archives:
>>> >>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
                    
                
                