Extracting protein sequence associated to UCSC transcript id

0

Entering edit mode

rcaloger ▴ 500

@rcaloger-1888

Last seen 10.9 years ago

European Union

Hi, In order to validate fusion products I need to be sure that the peptides encoded by the the two fused proteins are in the same frame. I have now a function that allows to confirm the protein1 and protein2 have sequences located in the same frame. However, I got stack to retrieve those proteins sequences from UCSC. I did not found a quick way to retrieve the protein sequence associated to a UCSC ID. Indeed the protein sequence is present in the UCSC genome browser, but I do not know how to grab it. Any suggestion? Cheers Raffaele -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

• 3.5k views

ADD COMMENT • link updated 12.0 years ago by Michael Lawrence ★ 11k • written 12.0 years ago by rcaloger ▴ 500

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.1 years ago

United States

If you had the transcript coordinates (as GRangesList, perhaps from an exonsBy() on a TranscriptDb) you could use extractTranscriptsFromGenome() and translate, see the GenomicFeatures vignette for an example. Michael On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero@gmail.com>wrote: > Hi, > In order to validate fusion products I need to be sure that the peptides > encoded by the the two fused proteins are in the same frame. > I have now a function that allows to confirm the protein1 and protein2 > have sequences located in the same frame. > However, I got stack to retrieve those proteins sequences from UCSC. I did > not found a quick way to retrieve the protein sequence associated to a UCSC > ID. > Indeed the protein sequence is present in the UCSC genome browser, but I > do not know how to grab it. > Any suggestion? > Cheers > Raffaele > > -- > > ---------------------------------------- > Prof. Raffaele A. Calogero > Bioinformatics and Genomics Unit > MBC Centro di Biotecnologie Molecolari > Via Nizza 52, Torino 10126 > tel. ++39 0116706457 > Fax ++39 0112366457 > Mobile ++39 3333827080 > email: raffaele.calogero@unito.it > raffaele[dot]calogero[at]gmail[dot]com > www: http://www.bioinformatica.unito.it > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Dear Michael, thank for the kind suggestion but unfortunately it does not solve my problem because, using the approach you are suggesting, I do not have access to the position of the start codon for the different isoforms. Cheers Raf On 07/01/14 16:44, Michael Lawrence wrote: > If you had the transcript coordinates (as GRangesList, perhaps from an > exonsBy() on a TranscriptDb) you could use extractTranscriptsFromGenome() > and translate, see the GenomicFeatures vignette for an example. > > Michael > > > On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at="" gmail.com="">wrote: > >> Hi, >> In order to validate fusion products I need to be sure that the peptides >> encoded by the the two fused proteins are in the same frame. >> I have now a function that allows to confirm the protein1 and protein2 >> have sequences located in the same frame. >> However, I got stack to retrieve those proteins sequences from UCSC. I did >> not found a quick way to retrieve the protein sequence associated to a UCSC >> ID. >> Indeed the protein sequence is present in the UCSC genome browser, but I >> do not know how to grab it. >> Any suggestion? >> Cheers >> Raffaele >> >> -- >> >> ---------------------------------------- >> Prof. Raffaele A. Calogero >> Bioinformatics and Genomics Unit >> MBC Centro di Biotecnologie Molecolari >> Via Nizza 52, Torino 10126 >> tel. ++39 0116706457 >> Fax ++39 0112366457 >> Mobile ++39 3333827080 >> email: raffaele.calogero at unito.it >> raffaele[dot]calogero[at]gmail[dot]com >> www: http://www.bioinformatica.unito.it >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

ADD REPLY • link 12.0 years ago raffaele calogero ▴ 500

0

Entering edit mode

In theory, you should be able to get the cds regions using e.g. the Homo.sapiens package, but it seems kind of tough to retrieve those for UCSC Known Gene identifiers (assuming that is what you have). Marc Carlson could probably help more. Michael On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero@unito.it> wrote: > Dear Michael, > thank for the kind suggestion but unfortunately it does not solve my > problem because, using the approach you are suggesting, I do not have > access to the position of the start codon for the different isoforms. > Cheers > Raf > > > On 07/01/14 16:44, Michael Lawrence wrote: > >> If you had the transcript coordinates (as GRangesList, perhaps from an >> exonsBy() on a TranscriptDb) you could use extractTranscriptsFromGenome() >> and translate, see the GenomicFeatures vignette for an example. >> >> Michael >> >> >> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero@gmail.com> >> wrote: >> >> Hi, >>> In order to validate fusion products I need to be sure that the peptides >>> encoded by the the two fused proteins are in the same frame. >>> I have now a function that allows to confirm the protein1 and protein2 >>> have sequences located in the same frame. >>> However, I got stack to retrieve those proteins sequences from UCSC. I >>> did >>> not found a quick way to retrieve the protein sequence associated to a >>> UCSC >>> ID. >>> Indeed the protein sequence is present in the UCSC genome browser, but I >>> do not know how to grab it. >>> Any suggestion? >>> Cheers >>> Raffaele >>> >>> -- >>> >>> ---------------------------------------- >>> Prof. Raffaele A. Calogero >>> Bioinformatics and Genomics Unit >>> MBC Centro di Biotecnologie Molecolari >>> Via Nizza 52, Torino 10126 >>> tel. ++39 0116706457 >>> Fax ++39 0112366457 >>> Mobile ++39 3333827080 >>> email: raffaele.calogero@unito.it >>> raffaele[dot]calogero[at]gmail[dot]com >>> www: http://www.bioinformatica.unito.it >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> > > -- > > ---------------------------------------- > Prof. Raffaele A. Calogero > Bioinformatics and Genomics Unit > MBC Centro di Biotecnologie Molecolari > Via Nizza 52, Torino 10126 > tel. ++39 0116706457 > Fax ++39 0112366457 > Mobile ++39 3333827080 > email: raffaele.calogero@unito.it > raffaele[dot]calogero[at]gmail[dot]com > www: http://www.bioinformatica.unito.it > > [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I found a way to extract the protein sequences querying the UCSC web page. However, there should be a more elegant way to do it. library(RCurl) trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") myquery<- list() for(i in 1:length(trs)){ myquery[[i]] <- getURL(paste("http://genome-euro.ucsc.edu/cgi- bin/hgGene?hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", trs[i],sep="")) Sys.sleep(30) } It is interesting that in bioconductor there are no databases linking transcripts to proteins Cheers Raf On 08/01/14 17:10, Michael Lawrence wrote: > In theory, you should be able to get the cds regions using e.g. the > Homo.sapiens package, but it seems kind of tough to retrieve those for UCSC > Known Gene identifiers (assuming that is what you have). Marc Carlson could > probably help more. > > Michael > > > > > > On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at="" unito.it=""> wrote: > >> Dear Michael, >> thank for the kind suggestion but unfortunately it does not solve my >> problem because, using the approach you are suggesting, I do not have >> access to the position of the start codon for the different isoforms. >> Cheers >> Raf >> >> >> On 07/01/14 16:44, Michael Lawrence wrote: >> >>> If you had the transcript coordinates (as GRangesList, perhaps from an >>> exonsBy() on a TranscriptDb) you could use extractTranscriptsFromGenome() >>> and translate, see the GenomicFeatures vignette for an example. >>> >>> Michael >>> >>> >>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at="" gmail.com=""> >>> wrote: >>> >>> Hi, >>>> In order to validate fusion products I need to be sure that the peptides >>>> encoded by the the two fused proteins are in the same frame. >>>> I have now a function that allows to confirm the protein1 and protein2 >>>> have sequences located in the same frame. >>>> However, I got stack to retrieve those proteins sequences from UCSC. I >>>> did >>>> not found a quick way to retrieve the protein sequence associated to a >>>> UCSC >>>> ID. >>>> Indeed the protein sequence is present in the UCSC genome browser, but I >>>> do not know how to grab it. >>>> Any suggestion? >>>> Cheers >>>> Raffaele >>>> >>>> -- >>>> >>>> ---------------------------------------- >>>> Prof. Raffaele A. Calogero >>>> Bioinformatics and Genomics Unit >>>> MBC Centro di Biotecnologie Molecolari >>>> Via Nizza 52, Torino 10126 >>>> tel. ++39 0116706457 >>>> Fax ++39 0112366457 >>>> Mobile ++39 3333827080 >>>> email: raffaele.calogero at unito.it >>>> raffaele[dot]calogero[at]gmail[dot]com >>>> www: http://www.bioinformatica.unito.it >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane. >>>> science.biology.informatics.conductor >>>> >>>> >> -- >> >> ---------------------------------------- >> Prof. Raffaele A. Calogero >> Bioinformatics and Genomics Unit >> MBC Centro di Biotecnologie Molecolari >> Via Nizza 52, Torino 10126 >> tel. ++39 0116706457 >> Fax ++39 0112366457 >> Mobile ++39 3333827080 >> email: raffaele.calogero at unito.it >> raffaele[dot]calogero[at]gmail[dot]com >> www: http://www.bioinformatica.unito.it >> >> -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

ADD REPLY • link 12.0 years ago raffaele calogero ▴ 500

0

Entering edit mode

Hi Raf, On 01/08/2014 10:25 AM, rcaloger wrote: > I found a way to extract the protein sequences querying the UCSC web page. > However, there should be a more elegant way to do it. > library(RCurl) > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") > myquery<- list() > for(i in 1:length(trs)){ > myquery[[i]] <- > getURL(paste("http://genome-euro.ucsc.edu/cgi- bin/hgGene?hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", > trs[i],sep="")) > Sys.sleep(30) > } Your 3rd transcript is non-coding hence no protein sequence for it. Otherwise you get exactly what you wanted using Michael's suggestion: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) See which of your transcripts are coding: > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") > trs %in% names(cds_by_tx) [1] TRUE TRUE FALSE Extract and translate the CDS sequences (note that here I choose to compute the full proteome but I don't have to): library(BSgenome.Hsapiens.UCSC.hg19) cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, cds_by_tx) proteome <- translate(cds_seqs) names(proteome) <- names(cds_seqs) Then: > proteome A AAStringSet instance of length 63691 width seq names [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1 [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1 [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2 [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3 [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1 ... ... ... [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1 [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1 [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1 [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2 [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2 > trs <- c("uc003mfv.3", "uc001ajb.1") > proteome[trs] A AAStringSet instance of length 2 width seq names [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3 [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1 You can drop the trailing "*" with subseq(proteome[trs], end=-2) Cheers, H. > > It is interesting that in bioconductor there are no databases linking > transcripts to proteins > Cheers > Raf > > On 08/01/14 17:10, Michael Lawrence wrote: >> In theory, you should be able to get the cds regions using e.g. the >> Homo.sapiens package, but it seems kind of tough to retrieve those for >> UCSC >> Known Gene identifiers (assuming that is what you have). Marc Carlson >> could >> probably help more. >> >> Michael >> >> >> >> >> >> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at="" unito.it=""> >> wrote: >> >>> Dear Michael, >>> thank for the kind suggestion but unfortunately it does not solve my >>> problem because, using the approach you are suggesting, I do not have >>> access to the position of the start codon for the different isoforms. >>> Cheers >>> Raf >>> >>> >>> On 07/01/14 16:44, Michael Lawrence wrote: >>> >>>> If you had the transcript coordinates (as GRangesList, perhaps from an >>>> exonsBy() on a TranscriptDb) you could use >>>> extractTranscriptsFromGenome() >>>> and translate, see the GenomicFeatures vignette for an example. >>>> >>>> Michael >>>> >>>> >>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at="" gmail.com=""> >>>> wrote: >>>> >>>> Hi, >>>>> In order to validate fusion products I need to be sure that the >>>>> peptides >>>>> encoded by the the two fused proteins are in the same frame. >>>>> I have now a function that allows to confirm the protein1 and protein2 >>>>> have sequences located in the same frame. >>>>> However, I got stack to retrieve those proteins sequences from UCSC. I >>>>> did >>>>> not found a quick way to retrieve the protein sequence associated to a >>>>> UCSC >>>>> ID. >>>>> Indeed the protein sequence is present in the UCSC genome browser, >>>>> but I >>>>> do not know how to grab it. >>>>> Any suggestion? >>>>> Cheers >>>>> Raffaele >>>>> >>>>> -- >>>>> >>>>> ---------------------------------------- >>>>> Prof. Raffaele A. Calogero >>>>> Bioinformatics and Genomics Unit >>>>> MBC Centro di Biotecnologie Molecolari >>>>> Via Nizza 52, Torino 10126 >>>>> tel. ++39 0116706457 >>>>> Fax ++39 0112366457 >>>>> Mobile ++39 3333827080 >>>>> email: raffaele.calogero at unito.it >>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>> www: http://www.bioinformatica.unito.it >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane. >>>>> science.biology.informatics.conductor >>>>> >>>>> >>> -- >>> >>> ---------------------------------------- >>> Prof. Raffaele A. Calogero >>> Bioinformatics and Genomics Unit >>> MBC Centro di Biotecnologie Molecolari >>> Via Nizza 52, Torino 10126 >>> tel. ++39 0116706457 >>> Fax ++39 0112366457 >>> Mobile ++39 3333827080 >>> email: raffaele.calogero at unito.it >>> raffaele[dot]calogero[at]gmail[dot]com >>> www: http://www.bioinformatica.unito.it >>> >>> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Raffaele, I was about to post the same thing as Herve just did, but he not only beat me to the punch, but he also suggested using the names field for Biotstrings (which is an elegant suggestion). But if in addition if you should want to get pre-packaged translations from somewhere else, you can get some of those via biomaRt too. Some can be accessed using the UniProt.ws package. Here is an example of how you can also get that data via the UniProt.ws package (which talks to the UniProt web services): libraryUniProt.ws) seqs <- selectUniProt.ws, trs, "SEQUENCE", "UCSC") Which solution you should choose may depend on your use case. For example if you seek data where post-translational modifications are recorded, you would definitely not want to be translating them yourself. But translating them yourself gives you the advantage of being able to control which genome you pull sequences from. Hope this helps you, Marc On 01/08/2014 11:59 AM, Hervé Pagès wrote: > Hi Raf, > > On 01/08/2014 10:25 AM, rcaloger wrote: >> I found a way to extract the protein sequences querying the UCSC web >> page. >> However, there should be a more elegant way to do it. >> library(RCurl) >> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >> myquery<- list() >> for(i in 1:length(trs)){ >> myquery[[i]] <- >> getURL(paste("http://genome-euro.ucsc.edu/cgi- bin/hgGene?hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >> >> trs[i],sep="")) >> Sys.sleep(30) >> } > > Your 3rd transcript is non-coding hence no protein sequence for it. > > Otherwise you get exactly what you wanted using Michael's suggestion: > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) > > See which of your transcripts are coding: > > > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") > > trs %in% names(cds_by_tx) > [1] TRUE TRUE FALSE > > Extract and translate the CDS sequences (note that here I choose to > compute the full proteome but I don't have to): > > library(BSgenome.Hsapiens.UCSC.hg19) > cds_seqs <- > extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, cds_by_tx) > proteome <- translate(cds_seqs) > names(proteome) <- names(cds_seqs) > > Then: > > > proteome > A AAStringSet instance of length 63691 > width seq names > [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1 > [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1 > [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2 > [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3 > [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1 > ... ... ... > [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1 > [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1 > [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1 > [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2 > [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2 > > > trs <- c("uc003mfv.3", "uc001ajb.1") > > > proteome[trs] > A AAStringSet instance of length 2 > width seq names > [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3 > [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1 > > You can drop the trailing "*" with > > subseq(proteome[trs], end=-2) > > Cheers, > H. > >> >> It is interesting that in bioconductor there are no databases linking >> transcripts to proteins >> Cheers >> Raf >> >> On 08/01/14 17:10, Michael Lawrence wrote: >>> In theory, you should be able to get the cds regions using e.g. the >>> Homo.sapiens package, but it seems kind of tough to retrieve those for >>> UCSC >>> Known Gene identifiers (assuming that is what you have). Marc Carlson >>> could >>> probably help more. >>> >>> Michael >>> >>> >>> >>> >>> >>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at="" unito.it=""> >>> wrote: >>> >>>> Dear Michael, >>>> thank for the kind suggestion but unfortunately it does not solve my >>>> problem because, using the approach you are suggesting, I do not have >>>> access to the position of the start codon for the different isoforms. >>>> Cheers >>>> Raf >>>> >>>> >>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>> >>>>> If you had the transcript coordinates (as GRangesList, perhaps >>>>> from an >>>>> exonsBy() on a TranscriptDb) you could use >>>>> extractTranscriptsFromGenome() >>>>> and translate, see the GenomicFeatures vignette for an example. >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger >>>>> <raffaele.calogero at="" gmail.com=""> >>>>> wrote: >>>>> >>>>> Hi, >>>>>> In order to validate fusion products I need to be sure that the >>>>>> peptides >>>>>> encoded by the the two fused proteins are in the same frame. >>>>>> I have now a function that allows to confirm the protein1 and >>>>>> protein2 >>>>>> have sequences located in the same frame. >>>>>> However, I got stack to retrieve those proteins sequences from >>>>>> UCSC. I >>>>>> did >>>>>> not found a quick way to retrieve the protein sequence associated >>>>>> to a >>>>>> UCSC >>>>>> ID. >>>>>> Indeed the protein sequence is present in the UCSC genome browser, >>>>>> but I >>>>>> do not know how to grab it. >>>>>> Any suggestion? >>>>>> Cheers >>>>>> Raffaele >>>>>> >>>>>> -- >>>>>> >>>>>> ---------------------------------------- >>>>>> Prof. Raffaele A. Calogero >>>>>> Bioinformatics and Genomics Unit >>>>>> MBC Centro di Biotecnologie Molecolari >>>>>> Via Nizza 52, Torino 10126 >>>>>> tel. ++39 0116706457 >>>>>> Fax ++39 0112366457 >>>>>> Mobile ++39 3333827080 >>>>>> email: raffaele.calogero at unito.it >>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>> www: http://www.bioinformatica.unito.it >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>> science.biology.informatics.conductor >>>>>> >>>>>> >>>> -- >>>> >>>> ---------------------------------------- >>>> Prof. Raffaele A. Calogero >>>> Bioinformatics and Genomics Unit >>>> MBC Centro di Biotecnologie Molecolari >>>> Via Nizza 52, Torino 10126 >>>> tel. ++39 0116706457 >>>> Fax ++39 0112366457 >>>> Mobile ++39 3333827080 >>>> email: raffaele.calogero at unito.it >>>> raffaele[dot]calogero[at]gmail[dot]com >>>> www: http://www.bioinformatica.unito.it >>>> >>>> >> >> >

ADD REPLY • link 12.0 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

I had issues doing this with Homo.sapiens: First I tried just: > cdsBy(Homo.sapiens) Error in .select(x, keys, columns, keytype, ...) : You must provide cols argument Not really sure why? And the argument is "columns"... Then: > cdsBy(Homo.sapiens, columns=character()) Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : incorrect number of dimensions > cdsBy(Homo.sapiens, columns="UCSCKG") Warning messages: 1: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows 2: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows 3: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : duplicated levels in factors are deprecated Worked, but a bunch of warnings (not sure if all of them are helpful?) Finally, I was not aware of the use.names argument. Seems very useful. But apparently not supported yet for the OrganismDb objects: > cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) Error in .local(x, by, ...) : unused argument (use.names = TRUE) I think the OrganismDb objects are in theory a great step forward, because they let us easily integrate other identifiers. But I've had little luck getting them to work so far. Michael On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Raf, > > > On 01/08/2014 10:25 AM, rcaloger wrote: > >> I found a way to extract the protein sequences querying the UCSC web page. >> However, there should be a more elegant way to do it. >> library(RCurl) >> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >> myquery<- list() >> for(i in 1:length(trs)){ >> myquery[[i]] <- >> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >> trs[i],sep="")) >> Sys.sleep(30) >> } >> > > Your 3rd transcript is non-coding hence no protein sequence for it. > > Otherwise you get exactly what you wanted using Michael's suggestion: > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) > > See which of your transcripts are coding: > > > > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") > > trs %in% names(cds_by_tx) > [1] TRUE TRUE FALSE > > Extract and translate the CDS sequences (note that here I choose to > compute the full proteome but I don't have to): > > library(BSgenome.Hsapiens.UCSC.hg19) > cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, > cds_by_tx) > proteome <- translate(cds_seqs) > names(proteome) <- names(cds_seqs) > > Then: > > > proteome > A AAStringSet instance of length 63691 > width seq names > [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1 > [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1 > [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2 > [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3 > [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1 > ... ... ... > [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1 > [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1 > [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1 > [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2 > [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2 > > > trs <- c("uc003mfv.3", "uc001ajb.1") > > > proteome[trs] > A AAStringSet instance of length 2 > width seq names > [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3 > [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1 > > You can drop the trailing "*" with > > subseq(proteome[trs], end=-2) > > Cheers, > H. > > > >> It is interesting that in bioconductor there are no databases linking >> transcripts to proteins >> Cheers >> Raf >> >> On 08/01/14 17:10, Michael Lawrence wrote: >> >>> In theory, you should be able to get the cds regions using e.g. the >>> Homo.sapiens package, but it seems kind of tough to retrieve those for >>> UCSC >>> Known Gene identifiers (assuming that is what you have). Marc Carlson >>> could >>> probably help more. >>> >>> Michael >>> >>> >>> >>> >>> >>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero@unito.it> >>> wrote: >>> >>> Dear Michael, >>>> thank for the kind suggestion but unfortunately it does not solve my >>>> problem because, using the approach you are suggesting, I do not have >>>> access to the position of the start codon for the different isoforms. >>>> Cheers >>>> Raf >>>> >>>> >>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>> >>>> If you had the transcript coordinates (as GRangesList, perhaps from an >>>>> exonsBy() on a TranscriptDb) you could use >>>>> extractTranscriptsFromGenome() >>>>> and translate, see the GenomicFeatures vignette for an example. >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>>> In order to validate fusion products I need to be sure that the >>>>>> peptides >>>>>> encoded by the the two fused proteins are in the same frame. >>>>>> I have now a function that allows to confirm the protein1 and protein2 >>>>>> have sequences located in the same frame. >>>>>> However, I got stack to retrieve those proteins sequences from UCSC. I >>>>>> did >>>>>> not found a quick way to retrieve the protein sequence associated to a >>>>>> UCSC >>>>>> ID. >>>>>> Indeed the protein sequence is present in the UCSC genome browser, >>>>>> but I >>>>>> do not know how to grab it. >>>>>> Any suggestion? >>>>>> Cheers >>>>>> Raffaele >>>>>> >>>>>> -- >>>>>> >>>>>> ---------------------------------------- >>>>>> Prof. Raffaele A. Calogero >>>>>> Bioinformatics and Genomics Unit >>>>>> MBC Centro di Biotecnologie Molecolari >>>>>> Via Nizza 52, Torino 10126 >>>>>> tel. ++39 0116706457 >>>>>> Fax ++39 0112366457 >>>>>> Mobile ++39 3333827080 >>>>>> email: raffaele.calogero@unito.it >>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>> www: http://www.bioinformatica.unito.it >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor@r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>> science.biology.informatics.conductor >>>>>> >>>>>> >>>>>> -- >>>> >>>> ---------------------------------------- >>>> Prof. Raffaele A. Calogero >>>> Bioinformatics and Genomics Unit >>>> MBC Centro di Biotecnologie Molecolari >>>> Via Nizza 52, Torino 10126 >>>> tel. ++39 0116706457 >>>> Fax ++39 0112366457 >>>> Mobile ++39 3333827080 >>>> email: raffaele.calogero@unito.it >>>> raffaele[dot]calogero[at]gmail[dot]com >>>> www: http://www.bioinformatica.unito.it >>>> >>>> >>>> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Dear Michael and Herv?, many thanks for the great help I followed the suggestion of Herv? Since I need to retrieve only two sequences I did the following ... txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) trs <- c(id1,id2) if(length(which(trs %in% names(cds_by_tx)))==1){ cat("\nOnly names(cds_by_tx)[which(trs %in% names(cds_by_tx))] is coding\n" return() }else if(length(which(trs %in% names(cds_by_tx)))==0){ cat("\nNon of the two transcripts is coding\n" return() } cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, cds_by_tx[which(names(cds_by_tx) %in% trs)]) proteome <- translate(cds_seqs) names(proteome) <- names(cds_seqs) p1 <- proteome[which(names(proteome)==id1)] p2 <- proteome[which(names(proteome)==id2)] ... Many thanks again! One point that would be of general interest. TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, there is no vignette associated to them, and documentation, in my opinion does not allow an easy understanding of the db. Do you know if some tutorial like that available for BSgenome is also present for TxDb.XX.YY.ZZ.knownGene? Cheers Raf On 09/01/14 00:49, Michael Lawrence wrote: > I had issues doing this with Homo.sapiens: > > First I tried just: >> cdsBy(Homo.sapiens) > Error in .select(x, keys, columns, keytype, ...) : > You must provide cols argument > > Not really sure why? And the argument is "columns"... > > Then: >> cdsBy(Homo.sapiens, columns=character()) > Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : > incorrect number of dimensions > > >> cdsBy(Homo.sapiens, columns="UCSCKG") > Warning messages: > 1: In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > 2: In .generateExtraRows(tab, keys, jointype) : > 'select' resulted in 1:many mapping between keys and return rows > 3: In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else > paste0(labels, : > duplicated levels in factors are deprecated > > Worked, but a bunch of warnings (not sure if all of them are helpful?) > > Finally, I was not aware of the use.names argument. Seems very useful. But > apparently not supported yet for the OrganismDb objects: > >> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) > Error in .local(x, by, ...) : unused argument (use.names = TRUE) > > I think the OrganismDb objects are in theory a great step forward, because > they let us easily integrate other identifiers. But I've had little luck > getting them to work so far. > > Michael > > > On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > >> Hi Raf, >> >> >> On 01/08/2014 10:25 AM, rcaloger wrote: >> >>> I found a way to extract the protein sequences querying the UCSC web page. >>> However, there should be a more elegant way to do it. >>> library(RCurl) >>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>> myquery<- list() >>> for(i in 1:length(trs)){ >>> myquery[[i]] <- >>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >>> trs[i],sep="")) >>> Sys.sleep(30) >>> } >>> >> Your 3rd transcript is non-coding hence no protein sequence for it. >> >> Otherwise you get exactly what you wanted using Michael's suggestion: >> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >> >> See which of your transcripts are coding: >> >> >> > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >> > trs %in% names(cds_by_tx) >> [1] TRUE TRUE FALSE >> >> Extract and translate the CDS sequences (note that here I choose to >> compute the full proteome but I don't have to): >> >> library(BSgenome.Hsapiens.UCSC.hg19) >> cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >> cds_by_tx) >> proteome <- translate(cds_seqs) >> names(proteome) <- names(cds_seqs) >> >> Then: >> >> > proteome >> A AAStringSet instance of length 63691 >> width seq names >> [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1 >> [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1 >> [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2 >> [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3 >> [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1 >> ... ... ... >> [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1 >> [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1 >> [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1 >> [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2 >> [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2 >> >> > trs <- c("uc003mfv.3", "uc001ajb.1") >> >> > proteome[trs] >> A AAStringSet instance of length 2 >> width seq names >> [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3 >> [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1 >> >> You can drop the trailing "*" with >> >> subseq(proteome[trs], end=-2) >> >> Cheers, >> H. >> >> >> >>> It is interesting that in bioconductor there are no databases linking >>> transcripts to proteins >>> Cheers >>> Raf >>> >>> On 08/01/14 17:10, Michael Lawrence wrote: >>> >>>> In theory, you should be able to get the cds regions using e.g. the >>>> Homo.sapiens package, but it seems kind of tough to retrieve those for >>>> UCSC >>>> Known Gene identifiers (assuming that is what you have). Marc Carlson >>>> could >>>> probably help more. >>>> >>>> Michael >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at="" unito.it=""> >>>> wrote: >>>> >>>> Dear Michael, >>>>> thank for the kind suggestion but unfortunately it does not solve my >>>>> problem because, using the approach you are suggesting, I do not have >>>>> access to the position of the start codon for the different isoforms. >>>>> Cheers >>>>> Raf >>>>> >>>>> >>>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>>> >>>>> If you had the transcript coordinates (as GRangesList, perhaps from an >>>>>> exonsBy() on a TranscriptDb) you could use >>>>>> extractTranscriptsFromGenome() >>>>>> and translate, see the GenomicFeatures vignette for an example. >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at="" gmail.com=""> >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> In order to validate fusion products I need to be sure that the >>>>>>> peptides >>>>>>> encoded by the the two fused proteins are in the same frame. >>>>>>> I have now a function that allows to confirm the protein1 and protein2 >>>>>>> have sequences located in the same frame. >>>>>>> However, I got stack to retrieve those proteins sequences from UCSC. I >>>>>>> did >>>>>>> not found a quick way to retrieve the protein sequence associated to a >>>>>>> UCSC >>>>>>> ID. >>>>>>> Indeed the protein sequence is present in the UCSC genome browser, >>>>>>> but I >>>>>>> do not know how to grab it. >>>>>>> Any suggestion? >>>>>>> Cheers >>>>>>> Raffaele >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ---------------------------------------- >>>>>>> Prof. Raffaele A. Calogero >>>>>>> Bioinformatics and Genomics Unit >>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>> Via Nizza 52, Torino 10126 >>>>>>> tel. ++39 0116706457 >>>>>>> Fax ++39 0112366457 >>>>>>> Mobile ++39 3333827080 >>>>>>> email: raffaele.calogero at unito.it >>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>> www: http://www.bioinformatica.unito.it >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>>> science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>>> -- >>>>> ---------------------------------------- >>>>> Prof. Raffaele A. Calogero >>>>> Bioinformatics and Genomics Unit >>>>> MBC Centro di Biotecnologie Molecolari >>>>> Via Nizza 52, Torino 10126 >>>>> tel. ++39 0116706457 >>>>> Fax ++39 0112366457 >>>>> Mobile ++39 3333827080 >>>>> email: raffaele.calogero at unito.it >>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>> www: http://www.bioinformatica.unito.it >>>>> >>>>> >>>>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

ADD REPLY • link 12.0 years ago raffaele calogero ▴ 500

0

Entering edit mode

Hi Raf, The GenomicFeatures vignette covers these packages: http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFea tures/inst/doc/GenomicFeatures.pdf Best, Jim On 1/9/2014 5:24 AM, rcaloger wrote: > Dear Michael and Herv?, > many thanks for the great help > I followed the suggestion of Herv? > Since I need to retrieve only two sequences I did the following > ... > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) > trs <- c(id1,id2) > if(length(which(trs %in% names(cds_by_tx)))==1){ > cat("\nOnly names(cds_by_tx)[which(trs %in% names(cds_by_tx))] > is coding\n" > return() > }else if(length(which(trs %in% names(cds_by_tx)))==0){ > cat("\nNon of the two transcripts is coding\n" > return() > } > cds_seqs <- > extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, > cds_by_tx[which(names(cds_by_tx) %in% trs)]) > proteome <- translate(cds_seqs) > names(proteome) <- names(cds_seqs) > p1 <- proteome[which(names(proteome)==id1)] > p2 <- proteome[which(names(proteome)==id2)] > ... > > Many thanks again! > > One point that would be of general interest. > TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, > there is no vignette associated to them, and documentation, in my > opinion does not allow an easy understanding of the db. > Do you know if some tutorial like that available for BSgenome is also > present for TxDb.XX.YY.ZZ.knownGene? > Cheers > Raf > > On 09/01/14 00:49, Michael Lawrence wrote: >> I had issues doing this with Homo.sapiens: >> >> First I tried just: >>> cdsBy(Homo.sapiens) >> Error in .select(x, keys, columns, keytype, ...) : >> You must provide cols argument >> >> Not really sure why? And the argument is "columns"... >> >> Then: >>> cdsBy(Homo.sapiens, columns=character()) >> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : >> incorrect number of dimensions >> >> >>> cdsBy(Homo.sapiens, columns="UCSCKG") >> Warning messages: >> 1: In .generateExtraRows(tab, keys, jointype) : >> 'select' and duplicate query keys resulted in 1:many mapping between >> keys and return rows >> 2: In .generateExtraRows(tab, keys, jointype) : >> 'select' resulted in 1:many mapping between keys and return rows >> 3: In .generateExtraRows(tab, keys, jointype) : >> 'select' and duplicate query keys resulted in 1:many mapping between >> keys and return rows >> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) >> else >> paste0(labels, : >> duplicated levels in factors are deprecated >> >> Worked, but a bunch of warnings (not sure if all of them are helpful?) >> >> Finally, I was not aware of the use.names argument. Seems very >> useful. But >> apparently not supported yet for the OrganismDb objects: >> >>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) >> Error in .local(x, by, ...) : unused argument (use.names = TRUE) >> >> I think the OrganismDb objects are in theory a great step forward, >> because >> they let us easily integrate other identifiers. But I've had little luck >> getting them to work so far. >> >> Michael >> >> >> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> >>> Hi Raf, >>> >>> >>> On 01/08/2014 10:25 AM, rcaloger wrote: >>> >>>> I found a way to extract the protein sequences querying the UCSC >>>> web page. >>>> However, there should be a more elegant way to do it. >>>> library(RCurl) >>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>> myquery<- list() >>>> for(i in 1:length(trs)){ >>>> myquery[[i]] <- >>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >>>> trs[i],sep="")) >>>> Sys.sleep(30) >>>> } >>>> >>> Your 3rd transcript is non-coding hence no protein sequence for it. >>> >>> Otherwise you get exactly what you wanted using Michael's suggestion: >>> >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >>> >>> See which of your transcripts are coding: >>> >>> >>> > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>> > trs %in% names(cds_by_tx) >>> [1] TRUE TRUE FALSE >>> >>> Extract and translate the CDS sequences (note that here I choose to >>> compute the full proteome but I don't have to): >>> >>> library(BSgenome.Hsapiens.UCSC.hg19) >>> cds_seqs <- >>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >>> cds_by_tx) >>> proteome <- translate(cds_seqs) >>> names(proteome) <- names(cds_seqs) >>> >>> Then: >>> >>> > proteome >>> A AAStringSet instance of length 63691 >>> width seq names >>> [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* >>> uc010nxq.1 >>> [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* >>> uc001aal.1 >>> [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>> uc009vjk.2 >>> [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>> uc001aau.3 >>> [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* >>> uc021oeh.1 >>> ... ... ... >>> [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>> uc031tgq.1 >>> [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>> uc031tgr.1 >>> [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>> uc031tgs.1 >>> [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>> uc011mgh.2 >>> [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>> uc011mgi.2 >>> >>> > trs <- c("uc003mfv.3", "uc001ajb.1") >>> >>> > proteome[trs] >>> A AAStringSet instance of length 2 >>> width seq names >>> [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* >>> uc003mfv.3 >>> [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* >>> uc001ajb.1 >>> >>> You can drop the trailing "*" with >>> >>> subseq(proteome[trs], end=-2) >>> >>> Cheers, >>> H. >>> >>> >>> >>>> It is interesting that in bioconductor there are no databases linking >>>> transcripts to proteins >>>> Cheers >>>> Raf >>>> >>>> On 08/01/14 17:10, Michael Lawrence wrote: >>>> >>>>> In theory, you should be able to get the cds regions using e.g. the >>>>> Homo.sapiens package, but it seems kind of tough to retrieve those >>>>> for >>>>> UCSC >>>>> Known Gene identifiers (assuming that is what you have). Marc Carlson >>>>> could >>>>> probably help more. >>>>> >>>>> Michael >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at="" unito.it=""> >>>>> wrote: >>>>> >>>>> Dear Michael, >>>>>> thank for the kind suggestion but unfortunately it does not solve my >>>>>> problem because, using the approach you are suggesting, I do not >>>>>> have >>>>>> access to the position of the start codon for the different >>>>>> isoforms. >>>>>> Cheers >>>>>> Raf >>>>>> >>>>>> >>>>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>>>> >>>>>> If you had the transcript coordinates (as GRangesList, perhaps >>>>>> from an >>>>>>> exonsBy() on a TranscriptDb) you could use >>>>>>> extractTranscriptsFromGenome() >>>>>>> and translate, see the GenomicFeatures vignette for an example. >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger >>>>>>> <raffaele.calogero at="" gmail.com=""> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>>> In order to validate fusion products I need to be sure that the >>>>>>>> peptides >>>>>>>> encoded by the the two fused proteins are in the same frame. >>>>>>>> I have now a function that allows to confirm the protein1 and >>>>>>>> protein2 >>>>>>>> have sequences located in the same frame. >>>>>>>> However, I got stack to retrieve those proteins sequences from >>>>>>>> UCSC. I >>>>>>>> did >>>>>>>> not found a quick way to retrieve the protein sequence >>>>>>>> associated to a >>>>>>>> UCSC >>>>>>>> ID. >>>>>>>> Indeed the protein sequence is present in the UCSC genome browser, >>>>>>>> but I >>>>>>>> do not know how to grab it. >>>>>>>> Any suggestion? >>>>>>>> Cheers >>>>>>>> Raffaele >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ---------------------------------------- >>>>>>>> Prof. Raffaele A. Calogero >>>>>>>> Bioinformatics and Genomics Unit >>>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>>> Via Nizza 52, Torino 10126 >>>>>>>> tel. ++39 0116706457 >>>>>>>> Fax ++39 0112366457 >>>>>>>> Mobile ++39 3333827080 >>>>>>>> email: raffaele.calogero at unito.it >>>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>>> www: http://www.bioinformatica.unito.it >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>>>> science.biology.informatics.conductor >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>> ---------------------------------------- >>>>>> Prof. Raffaele A. Calogero >>>>>> Bioinformatics and Genomics Unit >>>>>> MBC Centro di Biotecnologie Molecolari >>>>>> Via Nizza 52, Torino 10126 >>>>>> tel. ++39 0116706457 >>>>>> Fax ++39 0112366457 >>>>>> Mobile ++39 3333827080 >>>>>> email: raffaele.calogero at unito.it >>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>> www: http://www.bioinformatica.unito.it >>>>>> >>>>>> >>>>>> >>>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 12.0 years ago James W. MacDonald 68k

0

Entering edit mode

Great! Thanks Raf On 09/01/14 15:19, James W. MacDonald wrote: > Hi Raf, > > The GenomicFeatures vignette covers these packages: > > http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicF eatures/inst/doc/GenomicFeatures.pdf > > > Best, > > Jim > > > On 1/9/2014 5:24 AM, rcaloger wrote: >> Dear Michael and Herv?, >> many thanks for the great help >> I followed the suggestion of Herv? >> Since I need to retrieve only two sequences I did the following >> ... >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >> trs <- c(id1,id2) >> if(length(which(trs %in% names(cds_by_tx)))==1){ >> cat("\nOnly names(cds_by_tx)[which(trs %in% >> names(cds_by_tx))] is coding\n" >> return() >> }else if(length(which(trs %in% names(cds_by_tx)))==0){ >> cat("\nNon of the two transcripts is coding\n" >> return() >> } >> cds_seqs <- >> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >> cds_by_tx[which(names(cds_by_tx) %in% trs)]) >> proteome <- translate(cds_seqs) >> names(proteome) <- names(cds_seqs) >> p1 <- proteome[which(names(proteome)==id1)] >> p2 <- proteome[which(names(proteome)==id2)] >> ... >> >> Many thanks again! >> >> One point that would be of general interest. >> TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, >> there is no vignette associated to them, and documentation, in my >> opinion does not allow an easy understanding of the db. >> Do you know if some tutorial like that available for BSgenome is also >> present for TxDb.XX.YY.ZZ.knownGene? >> Cheers >> Raf >> >> On 09/01/14 00:49, Michael Lawrence wrote: >>> I had issues doing this with Homo.sapiens: >>> >>> First I tried just: >>>> cdsBy(Homo.sapiens) >>> Error in .select(x, keys, columns, keytype, ...) : >>> You must provide cols argument >>> >>> Not really sure why? And the argument is "columns"... >>> >>> Then: >>>> cdsBy(Homo.sapiens, columns=character()) >>> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : >>> incorrect number of dimensions >>> >>> >>>> cdsBy(Homo.sapiens, columns="UCSCKG") >>> Warning messages: >>> 1: In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> 2: In .generateExtraRows(tab, keys, jointype) : >>> 'select' resulted in 1:many mapping between keys and return rows >>> 3: In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) >>> else >>> paste0(labels, : >>> duplicated levels in factors are deprecated >>> >>> Worked, but a bunch of warnings (not sure if all of them are helpful?) >>> >>> Finally, I was not aware of the use.names argument. Seems very >>> useful. But >>> apparently not supported yet for the OrganismDb objects: >>> >>>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) >>> Error in .local(x, by, ...) : unused argument (use.names = TRUE) >>> >>> I think the OrganismDb objects are in theory a great step forward, >>> because >>> they let us easily integrate other identifiers. But I've had little >>> luck >>> getting them to work so far. >>> >>> Michael >>> >>> >>> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> >>>> Hi Raf, >>>> >>>> >>>> On 01/08/2014 10:25 AM, rcaloger wrote: >>>> >>>>> I found a way to extract the protein sequences querying the UCSC >>>>> web page. >>>>> However, there should be a more elegant way to do it. >>>>> library(RCurl) >>>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>>> myquery<- list() >>>>> for(i in 1:length(trs)){ >>>>> myquery[[i]] <- >>>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >>>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >>>>> trs[i],sep="")) >>>>> Sys.sleep(30) >>>>> } >>>>> >>>> Your 3rd transcript is non-coding hence no protein sequence for it. >>>> >>>> Otherwise you get exactly what you wanted using Michael's suggestion: >>>> >>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >>>> >>>> See which of your transcripts are coding: >>>> >>>> >>>> > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>> > trs %in% names(cds_by_tx) >>>> [1] TRUE TRUE FALSE >>>> >>>> Extract and translate the CDS sequences (note that here I choose to >>>> compute the full proteome but I don't have to): >>>> >>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>> cds_seqs <- >>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >>>> cds_by_tx) >>>> proteome <- translate(cds_seqs) >>>> names(proteome) <- names(cds_seqs) >>>> >>>> Then: >>>> >>>> > proteome >>>> A AAStringSet instance of length 63691 >>>> width seq names >>>> [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* >>>> uc010nxq.1 >>>> [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* >>>> uc001aal.1 >>>> [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>>> uc009vjk.2 >>>> [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>>> uc001aau.3 >>>> [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* >>>> uc021oeh.1 >>>> ... ... ... >>>> [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>> uc031tgq.1 >>>> [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>> uc031tgr.1 >>>> [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>> uc031tgs.1 >>>> [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>>> uc011mgh.2 >>>> [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>>> uc011mgi.2 >>>> >>>> > trs <- c("uc003mfv.3", "uc001ajb.1") >>>> >>>> > proteome[trs] >>>> A AAStringSet instance of length 2 >>>> width seq names >>>> [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* >>>> uc003mfv.3 >>>> [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* >>>> uc001ajb.1 >>>> >>>> You can drop the trailing "*" with >>>> >>>> subseq(proteome[trs], end=-2) >>>> >>>> Cheers, >>>> H. >>>> >>>> >>>> >>>>> It is interesting that in bioconductor there are no databases linking >>>>> transcripts to proteins >>>>> Cheers >>>>> Raf >>>>> >>>>> On 08/01/14 17:10, Michael Lawrence wrote: >>>>> >>>>>> In theory, you should be able to get the cds regions using e.g. the >>>>>> Homo.sapiens package, but it seems kind of tough to retrieve >>>>>> those for >>>>>> UCSC >>>>>> Known Gene identifiers (assuming that is what you have). Marc >>>>>> Carlson >>>>>> could >>>>>> probably help more. >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger >>>>>> <raffaele.calogero at="" unito.it=""> >>>>>> wrote: >>>>>> >>>>>> Dear Michael, >>>>>>> thank for the kind suggestion but unfortunately it does not >>>>>>> solve my >>>>>>> problem because, using the approach you are suggesting, I do not >>>>>>> have >>>>>>> access to the position of the start codon for the different >>>>>>> isoforms. >>>>>>> Cheers >>>>>>> Raf >>>>>>> >>>>>>> >>>>>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>>>>> >>>>>>> If you had the transcript coordinates (as GRangesList, perhaps >>>>>>> from an >>>>>>>> exonsBy() on a TranscriptDb) you could use >>>>>>>> extractTranscriptsFromGenome() >>>>>>>> and translate, see the GenomicFeatures vignette for an example. >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger >>>>>>>> <raffaele.calogero at="" gmail.com=""> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>>> In order to validate fusion products I need to be sure that the >>>>>>>>> peptides >>>>>>>>> encoded by the the two fused proteins are in the same frame. >>>>>>>>> I have now a function that allows to confirm the protein1 and >>>>>>>>> protein2 >>>>>>>>> have sequences located in the same frame. >>>>>>>>> However, I got stack to retrieve those proteins sequences from >>>>>>>>> UCSC. I >>>>>>>>> did >>>>>>>>> not found a quick way to retrieve the protein sequence >>>>>>>>> associated to a >>>>>>>>> UCSC >>>>>>>>> ID. >>>>>>>>> Indeed the protein sequence is present in the UCSC genome >>>>>>>>> browser, >>>>>>>>> but I >>>>>>>>> do not know how to grab it. >>>>>>>>> Any suggestion? >>>>>>>>> Cheers >>>>>>>>> Raffaele >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ---------------------------------------- >>>>>>>>> Prof. Raffaele A. Calogero >>>>>>>>> Bioinformatics and Genomics Unit >>>>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>>>> Via Nizza 52, Torino 10126 >>>>>>>>> tel. ++39 0116706457 >>>>>>>>> Fax ++39 0112366457 >>>>>>>>> Mobile ++39 3333827080 >>>>>>>>> email: raffaele.calogero at unito.it >>>>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>>>> www: http://www.bioinformatica.unito.it >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>>>>> science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>> ---------------------------------------- >>>>>>> Prof. Raffaele A. Calogero >>>>>>> Bioinformatics and Genomics Unit >>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>> Via Nizza 52, Torino 10126 >>>>>>> tel. ++39 0116706457 >>>>>>> Fax ++39 0112366457 >>>>>>> Mobile ++39 3333827080 >>>>>>> email: raffaele.calogero at unito.it >>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>> www: http://www.bioinformatica.unito.it >>>>>>> >>>>>>> >>>>>>> >>>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >> >> > -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it

ADD REPLY • link 12.0 years ago rcaloger ▴ 500

0

Entering edit mode

Hi Raffaele, There is also workflow that describes all this stuff in a high level overview here: http://www.bioconductor.org/help/workflows/annotation/annotation/ This is a higher level view of things. So it may help you know about the different kinds of resources available and how to make use of each kind. Marc On 01/09/2014 07:01 AM, rcaloger wrote: > Great! > Thanks > Raf > > On 09/01/14 15:19, James W. MacDonald wrote: >> Hi Raf, >> >> The GenomicFeatures vignette covers these packages: >> >> http://www.bioconductor.org/packages/release/bioc/vignettes/Genomic Features/inst/doc/GenomicFeatures.pdf >> >> >> Best, >> >> Jim >> >> >> On 1/9/2014 5:24 AM, rcaloger wrote: >>> Dear Michael and Herv?, >>> many thanks for the great help >>> I followed the suggestion of Herv? >>> Since I need to retrieve only two sequences I did the following >>> ... >>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >>> trs <- c(id1,id2) >>> if(length(which(trs %in% names(cds_by_tx)))==1){ >>> cat("\nOnly names(cds_by_tx)[which(trs %in% >>> names(cds_by_tx))] is coding\n" >>> return() >>> }else if(length(which(trs %in% names(cds_by_tx)))==0){ >>> cat("\nNon of the two transcripts is coding\n" >>> return() >>> } >>> cds_seqs <- >>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >>> cds_by_tx[which(names(cds_by_tx) %in% trs)]) >>> proteome <- translate(cds_seqs) >>> names(proteome) <- names(cds_seqs) >>> p1 <- proteome[which(names(proteome)==id1)] >>> p2 <- proteome[which(names(proteome)==id2)] >>> ... >>> >>> Many thanks again! >>> >>> One point that would be of general interest. >>> TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, >>> there is no vignette associated to them, and documentation, in my >>> opinion does not allow an easy understanding of the db. >>> Do you know if some tutorial like that available for BSgenome is >>> also present for TxDb.XX.YY.ZZ.knownGene? >>> Cheers >>> Raf >>> >>> On 09/01/14 00:49, Michael Lawrence wrote: >>>> I had issues doing this with Homo.sapiens: >>>> >>>> First I tried just: >>>>> cdsBy(Homo.sapiens) >>>> Error in .select(x, keys, columns, keytype, ...) : >>>> You must provide cols argument >>>> >>>> Not really sure why? And the argument is "columns"... >>>> >>>> Then: >>>>> cdsBy(Homo.sapiens, columns=character()) >>>> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : >>>> incorrect number of dimensions >>>> >>>> >>>>> cdsBy(Homo.sapiens, columns="UCSCKG") >>>> Warning messages: >>>> 1: In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> 2: In .generateExtraRows(tab, keys, jointype) : >>>> 'select' resulted in 1:many mapping between keys and return rows >>>> 3: In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) >>>> as.character(labels) else >>>> paste0(labels, : >>>> duplicated levels in factors are deprecated >>>> >>>> Worked, but a bunch of warnings (not sure if all of them are helpful?) >>>> >>>> Finally, I was not aware of the use.names argument. Seems very >>>> useful. But >>>> apparently not supported yet for the OrganismDb objects: >>>> >>>>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) >>>> Error in .local(x, by, ...) : unused argument (use.names = TRUE) >>>> >>>> I think the OrganismDb objects are in theory a great step forward, >>>> because >>>> they let us easily integrate other identifiers. But I've had little >>>> luck >>>> getting them to work so far. >>>> >>>> Michael >>>> >>>> >>>> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>> >>>>> Hi Raf, >>>>> >>>>> >>>>> On 01/08/2014 10:25 AM, rcaloger wrote: >>>>> >>>>>> I found a way to extract the protein sequences querying the UCSC >>>>>> web page. >>>>>> However, there should be a more elegant way to do it. >>>>>> library(RCurl) >>>>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>>>> myquery<- list() >>>>>> for(i in 1:length(trs)){ >>>>>> myquery[[i]] <- >>>>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >>>>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >>>>>> trs[i],sep="")) >>>>>> Sys.sleep(30) >>>>>> } >>>>>> >>>>> Your 3rd transcript is non-coding hence no protein sequence for it. >>>>> >>>>> Otherwise you get exactly what you wanted using Michael's suggestion: >>>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>>>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >>>>> >>>>> See which of your transcripts are coding: >>>>> >>>>> >>>>> > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>>> > trs %in% names(cds_by_tx) >>>>> [1] TRUE TRUE FALSE >>>>> >>>>> Extract and translate the CDS sequences (note that here I choose to >>>>> compute the full proteome but I don't have to): >>>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>>> cds_seqs <- >>>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >>>>> cds_by_tx) >>>>> proteome <- translate(cds_seqs) >>>>> names(proteome) <- names(cds_seqs) >>>>> >>>>> Then: >>>>> >>>>> > proteome >>>>> A AAStringSet instance of length 63691 >>>>> width seq names >>>>> [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* >>>>> uc010nxq.1 >>>>> [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* >>>>> uc001aal.1 >>>>> [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>>>> uc009vjk.2 >>>>> [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>>>> uc001aau.3 >>>>> [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* >>>>> uc021oeh.1 >>>>> ... ... ... >>>>> [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>>> uc031tgq.1 >>>>> [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>>> uc031tgr.1 >>>>> [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>>> uc031tgs.1 >>>>> [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>>>> uc011mgh.2 >>>>> [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>>>> uc011mgi.2 >>>>> >>>>> > trs <- c("uc003mfv.3", "uc001ajb.1") >>>>> >>>>> > proteome[trs] >>>>> A AAStringSet instance of length 2 >>>>> width seq names >>>>> [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* >>>>> uc003mfv.3 >>>>> [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* >>>>> uc001ajb.1 >>>>> >>>>> You can drop the trailing "*" with >>>>> >>>>> subseq(proteome[trs], end=-2) >>>>> >>>>> Cheers, >>>>> H. >>>>> >>>>> >>>>> >>>>>> It is interesting that in bioconductor there are no databases >>>>>> linking >>>>>> transcripts to proteins >>>>>> Cheers >>>>>> Raf >>>>>> >>>>>> On 08/01/14 17:10, Michael Lawrence wrote: >>>>>> >>>>>>> In theory, you should be able to get the cds regions using e.g. the >>>>>>> Homo.sapiens package, but it seems kind of tough to retrieve >>>>>>> those for >>>>>>> UCSC >>>>>>> Known Gene identifiers (assuming that is what you have). Marc >>>>>>> Carlson >>>>>>> could >>>>>>> probably help more. >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger >>>>>>> <raffaele.calogero at="" unito.it=""> >>>>>>> wrote: >>>>>>> >>>>>>> Dear Michael, >>>>>>>> thank for the kind suggestion but unfortunately it does not >>>>>>>> solve my >>>>>>>> problem because, using the approach you are suggesting, I do >>>>>>>> not have >>>>>>>> access to the position of the start codon for the different >>>>>>>> isoforms. >>>>>>>> Cheers >>>>>>>> Raf >>>>>>>> >>>>>>>> >>>>>>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>>>>>> >>>>>>>> If you had the transcript coordinates (as GRangesList, >>>>>>>> perhaps from an >>>>>>>>> exonsBy() on a TranscriptDb) you could use >>>>>>>>> extractTranscriptsFromGenome() >>>>>>>>> and translate, see the GenomicFeatures vignette for an example. >>>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger >>>>>>>>> <raffaele.calogero at="" gmail.com=""> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>>> In order to validate fusion products I need to be sure that the >>>>>>>>>> peptides >>>>>>>>>> encoded by the the two fused proteins are in the same frame. >>>>>>>>>> I have now a function that allows to confirm the protein1 and >>>>>>>>>> protein2 >>>>>>>>>> have sequences located in the same frame. >>>>>>>>>> However, I got stack to retrieve those proteins sequences >>>>>>>>>> from UCSC. I >>>>>>>>>> did >>>>>>>>>> not found a quick way to retrieve the protein sequence >>>>>>>>>> associated to a >>>>>>>>>> UCSC >>>>>>>>>> ID. >>>>>>>>>> Indeed the protein sequence is present in the UCSC genome >>>>>>>>>> browser, >>>>>>>>>> but I >>>>>>>>>> do not know how to grab it. >>>>>>>>>> Any suggestion? >>>>>>>>>> Cheers >>>>>>>>>> Raffaele >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ---------------------------------------- >>>>>>>>>> Prof. Raffaele A. Calogero >>>>>>>>>> Bioinformatics and Genomics Unit >>>>>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>>>>> Via Nizza 52, Torino 10126 >>>>>>>>>> tel. ++39 0116706457 >>>>>>>>>> Fax ++39 0112366457 >>>>>>>>>> Mobile ++39 3333827080 >>>>>>>>>> email: raffaele.calogero at unito.it >>>>>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>>>>> www: http://www.bioinformatica.unito.it >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>>>>>> science.biology.informatics.conductor >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>> ---------------------------------------- >>>>>>>> Prof. Raffaele A. Calogero >>>>>>>> Bioinformatics and Genomics Unit >>>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>>> Via Nizza 52, Torino 10126 >>>>>>>> tel. ++39 0116706457 >>>>>>>> Fax ++39 0112366457 >>>>>>>> Mobile ++39 3333827080 >>>>>>>> email: raffaele.calogero at unito.it >>>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>>> www: http://www.bioinformatica.unito.it >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>>> >>> >>> >> > >

ADD REPLY • link 12.0 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Raffaele, There is also workflow that describes all this stuff in a high level overview here: http://www.bioconductor.org/help/workflows/annotation/annotation/ This is a higher level view of things. So it may help you know about the different kinds of resources available and how to make use of each kind. Marc On 01/09/2014 07:01 AM, rcaloger wrote: > Great! > Thanks > Raf > > On 09/01/14 15:19, James W. MacDonald wrote: >> Hi Raf, >> >> The GenomicFeatures vignette covers these packages: >> >> http://www.bioconductor.org/packages/release/bioc/vignettes/Genomic Features/inst/doc/GenomicFeatures.pdf >> >> >> Best, >> >> Jim >> >> >> On 1/9/2014 5:24 AM, rcaloger wrote: >>> Dear Michael and Herv?, >>> many thanks for the great help >>> I followed the suggestion of Herv? >>> Since I need to retrieve only two sequences I did the following >>> ... >>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >>> trs <- c(id1,id2) >>> if(length(which(trs %in% names(cds_by_tx)))==1){ >>> cat("\nOnly names(cds_by_tx)[which(trs %in% >>> names(cds_by_tx))] is coding\n" >>> return() >>> }else if(length(which(trs %in% names(cds_by_tx)))==0){ >>> cat("\nNon of the two transcripts is coding\n" >>> return() >>> } >>> cds_seqs <- >>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >>> cds_by_tx[which(names(cds_by_tx) %in% trs)]) >>> proteome <- translate(cds_seqs) >>> names(proteome) <- names(cds_seqs) >>> p1 <- proteome[which(names(proteome)==id1)] >>> p2 <- proteome[which(names(proteome)==id2)] >>> ... >>> >>> Many thanks again! >>> >>> One point that would be of general interest. >>> TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, >>> there is no vignette associated to them, and documentation, in my >>> opinion does not allow an easy understanding of the db. >>> Do you know if some tutorial like that available for BSgenome is >>> also present for TxDb.XX.YY.ZZ.knownGene? >>> Cheers >>> Raf >>> >>> On 09/01/14 00:49, Michael Lawrence wrote: >>>> I had issues doing this with Homo.sapiens: >>>> >>>> First I tried just: >>>>> cdsBy(Homo.sapiens) >>>> Error in .select(x, keys, columns, keytype, ...) : >>>> You must provide cols argument >>>> >>>> Not really sure why? And the argument is "columns"... >>>> >>>> Then: >>>>> cdsBy(Homo.sapiens, columns=character()) >>>> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : >>>> incorrect number of dimensions >>>> >>>> >>>>> cdsBy(Homo.sapiens, columns="UCSCKG") >>>> Warning messages: >>>> 1: In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> 2: In .generateExtraRows(tab, keys, jointype) : >>>> 'select' resulted in 1:many mapping between keys and return rows >>>> 3: In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) >>>> as.character(labels) else >>>> paste0(labels, : >>>> duplicated levels in factors are deprecated >>>> >>>> Worked, but a bunch of warnings (not sure if all of them are helpful?) >>>> >>>> Finally, I was not aware of the use.names argument. Seems very >>>> useful. But >>>> apparently not supported yet for the OrganismDb objects: >>>> >>>>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) >>>> Error in .local(x, by, ...) : unused argument (use.names = TRUE) >>>> >>>> I think the OrganismDb objects are in theory a great step forward, >>>> because >>>> they let us easily integrate other identifiers. But I've had little >>>> luck >>>> getting them to work so far. >>>> >>>> Michael >>>> >>>> >>>> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>> >>>>> Hi Raf, >>>>> >>>>> >>>>> On 01/08/2014 10:25 AM, rcaloger wrote: >>>>> >>>>>> I found a way to extract the protein sequences querying the UCSC >>>>>> web page. >>>>>> However, there should be a more elegant way to do it. >>>>>> library(RCurl) >>>>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>>>> myquery<- list() >>>>>> for(i in 1:length(trs)){ >>>>>> myquery[[i]] <- >>>>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >>>>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >>>>>> trs[i],sep="")) >>>>>> Sys.sleep(30) >>>>>> } >>>>>> >>>>> Your 3rd transcript is non-coding hence no protein sequence for it. >>>>> >>>>> Otherwise you get exactly what you wanted using Michael's suggestion: >>>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>>>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >>>>> >>>>> See which of your transcripts are coding: >>>>> >>>>> >>>>> > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>>>> > trs %in% names(cds_by_tx) >>>>> [1] TRUE TRUE FALSE >>>>> >>>>> Extract and translate the CDS sequences (note that here I choose to >>>>> compute the full proteome but I don't have to): >>>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>>> cds_seqs <- >>>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >>>>> cds_by_tx) >>>>> proteome <- translate(cds_seqs) >>>>> names(proteome) <- names(cds_seqs) >>>>> >>>>> Then: >>>>> >>>>> > proteome >>>>> A AAStringSet instance of length 63691 >>>>> width seq names >>>>> [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* >>>>> uc010nxq.1 >>>>> [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* >>>>> uc001aal.1 >>>>> [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>>>> uc009vjk.2 >>>>> [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* >>>>> uc001aau.3 >>>>> [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* >>>>> uc021oeh.1 >>>>> ... ... ... >>>>> [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>>> uc031tgq.1 >>>>> [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>>> uc031tgr.1 >>>>> [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* >>>>> uc031tgs.1 >>>>> [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>>>> uc011mgh.2 >>>>> [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* >>>>> uc011mgi.2 >>>>> >>>>> > trs <- c("uc003mfv.3", "uc001ajb.1") >>>>> >>>>> > proteome[trs] >>>>> A AAStringSet instance of length 2 >>>>> width seq names >>>>> [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* >>>>> uc003mfv.3 >>>>> [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* >>>>> uc001ajb.1 >>>>> >>>>> You can drop the trailing "*" with >>>>> >>>>> subseq(proteome[trs], end=-2) >>>>> >>>>> Cheers, >>>>> H. >>>>> >>>>> >>>>> >>>>>> It is interesting that in bioconductor there are no databases >>>>>> linking >>>>>> transcripts to proteins >>>>>> Cheers >>>>>> Raf >>>>>> >>>>>> On 08/01/14 17:10, Michael Lawrence wrote: >>>>>> >>>>>>> In theory, you should be able to get the cds regions using e.g. the >>>>>>> Homo.sapiens package, but it seems kind of tough to retrieve >>>>>>> those for >>>>>>> UCSC >>>>>>> Known Gene identifiers (assuming that is what you have). Marc >>>>>>> Carlson >>>>>>> could >>>>>>> probably help more. >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger >>>>>>> <raffaele.calogero at="" unito.it=""> >>>>>>> wrote: >>>>>>> >>>>>>> Dear Michael, >>>>>>>> thank for the kind suggestion but unfortunately it does not >>>>>>>> solve my >>>>>>>> problem because, using the approach you are suggesting, I do >>>>>>>> not have >>>>>>>> access to the position of the start codon for the different >>>>>>>> isoforms. >>>>>>>> Cheers >>>>>>>> Raf >>>>>>>> >>>>>>>> >>>>>>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>>>>>> >>>>>>>> If you had the transcript coordinates (as GRangesList, >>>>>>>> perhaps from an >>>>>>>>> exonsBy() on a TranscriptDb) you could use >>>>>>>>> extractTranscriptsFromGenome() >>>>>>>>> and translate, see the GenomicFeatures vignette for an example. >>>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger >>>>>>>>> <raffaele.calogero at="" gmail.com=""> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>>> In order to validate fusion products I need to be sure that the >>>>>>>>>> peptides >>>>>>>>>> encoded by the the two fused proteins are in the same frame. >>>>>>>>>> I have now a function that allows to confirm the protein1 and >>>>>>>>>> protein2 >>>>>>>>>> have sequences located in the same frame. >>>>>>>>>> However, I got stack to retrieve those proteins sequences >>>>>>>>>> from UCSC. I >>>>>>>>>> did >>>>>>>>>> not found a quick way to retrieve the protein sequence >>>>>>>>>> associated to a >>>>>>>>>> UCSC >>>>>>>>>> ID. >>>>>>>>>> Indeed the protein sequence is present in the UCSC genome >>>>>>>>>> browser, >>>>>>>>>> but I >>>>>>>>>> do not know how to grab it. >>>>>>>>>> Any suggestion? >>>>>>>>>> Cheers >>>>>>>>>> Raffaele >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ---------------------------------------- >>>>>>>>>> Prof. Raffaele A. Calogero >>>>>>>>>> Bioinformatics and Genomics Unit >>>>>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>>>>> Via Nizza 52, Torino 10126 >>>>>>>>>> tel. ++39 0116706457 >>>>>>>>>> Fax ++39 0112366457 >>>>>>>>>> Mobile ++39 3333827080 >>>>>>>>>> email: raffaele.calogero at unito.it >>>>>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>>>>> www: http://www.bioinformatica.unito.it >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>>>>>> science.biology.informatics.conductor >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>> ---------------------------------------- >>>>>>>> Prof. Raffaele A. Calogero >>>>>>>> Bioinformatics and Genomics Unit >>>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>>> Via Nizza 52, Torino 10126 >>>>>>>> tel. ++39 0116706457 >>>>>>>> Fax ++39 0112366457 >>>>>>>> Mobile ++39 3333827080 >>>>>>>> email: raffaele.calogero at unito.it >>>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>>> www: http://www.bioinformatica.unito.it >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>>> >>> >>> >> > >

ADD REPLY • link 12.0 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Michael, I am sorry to hear that you have had problems. I just made some bug fixes to trunk based on your feedback and these methods do work properly if you call them using all the required arguments, but I clearly can do some more work here to make things more user friendly. But in order to make use.names=TRUE work, I am going to have to make some more major changes to how the code works internally so that will take a little more time. And while the warnings you mentioned are all technically correct, I agree that at this point they are a little removed from the context that the user is probably expecting. I can probably do something to summarize the warning results that come about after calling several different select methods. I will put a revision of this into my to do list. Thanks for the feedback, Marc On 01/08/2014 03:49 PM, Michael Lawrence wrote: > I had issues doing this with Homo.sapiens: > > First I tried just: >> cdsBy(Homo.sapiens) > Error in .select(x, keys, columns, keytype, ...) : > You must provide cols argument > > Not really sure why? And the argument is "columns"... > > Then: >> cdsBy(Homo.sapiens, columns=character()) > Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : > incorrect number of dimensions > > >> cdsBy(Homo.sapiens, columns="UCSCKG") > Warning messages: > 1: In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > 2: In .generateExtraRows(tab, keys, jointype) : > 'select' resulted in 1:many mapping between keys and return rows > 3: In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else > paste0(labels, : > duplicated levels in factors are deprecated > > Worked, but a bunch of warnings (not sure if all of them are helpful?) > > Finally, I was not aware of the use.names argument. Seems very useful. But > apparently not supported yet for the OrganismDb objects: > >> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE) > Error in .local(x, by, ...) : unused argument (use.names = TRUE) > > I think the OrganismDb objects are in theory a great step forward, because > they let us easily integrate other identifiers. But I've had little luck > getting them to work so far. > > Michael > > > On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > >> Hi Raf, >> >> >> On 01/08/2014 10:25 AM, rcaloger wrote: >> >>> I found a way to extract the protein sequences querying the UCSC web page. >>> However, there should be a more elegant way to do it. >>> library(RCurl) >>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >>> myquery<- list() >>> for(i in 1:length(trs)){ >>> myquery[[i]] <- >>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene? >>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", >>> trs[i],sep="")) >>> Sys.sleep(30) >>> } >>> >> Your 3rd transcript is non-coding hence no protein sequence for it. >> >> Otherwise you get exactly what you wanted using Michael's suggestion: >> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE) >> >> See which of your transcripts are coding: >> >> >> > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2") >> > trs %in% names(cds_by_tx) >> [1] TRUE TRUE FALSE >> >> Extract and translate the CDS sequences (note that here I choose to >> compute the full proteome but I don't have to): >> >> library(BSgenome.Hsapiens.UCSC.hg19) >> cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, >> cds_by_tx) >> proteome <- translate(cds_seqs) >> names(proteome) <- names(cds_seqs) >> >> Then: >> >> > proteome >> A AAStringSet instance of length 63691 >> width seq names >> [1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1 >> [2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1 >> [3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2 >> [4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3 >> [5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1 >> ... ... ... >> [63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1 >> [63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1 >> [63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1 >> [63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2 >> [63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2 >> >> > trs <- c("uc003mfv.3", "uc001ajb.1") >> >> > proteome[trs] >> A AAStringSet instance of length 2 >> width seq names >> [1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3 >> [2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1 >> >> You can drop the trailing "*" with >> >> subseq(proteome[trs], end=-2) >> >> Cheers, >> H. >> >> >> >>> It is interesting that in bioconductor there are no databases linking >>> transcripts to proteins >>> Cheers >>> Raf >>> >>> On 08/01/14 17:10, Michael Lawrence wrote: >>> >>>> In theory, you should be able to get the cds regions using e.g. the >>>> Homo.sapiens package, but it seems kind of tough to retrieve those for >>>> UCSC >>>> Known Gene identifiers (assuming that is what you have). Marc Carlson >>>> could >>>> probably help more. >>>> >>>> Michael >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero@unito.it> >>>> wrote: >>>> >>>> Dear Michael, >>>>> thank for the kind suggestion but unfortunately it does not solve my >>>>> problem because, using the approach you are suggesting, I do not have >>>>> access to the position of the start codon for the different isoforms. >>>>> Cheers >>>>> Raf >>>>> >>>>> >>>>> On 07/01/14 16:44, Michael Lawrence wrote: >>>>> >>>>> If you had the transcript coordinates (as GRangesList, perhaps from an >>>>>> exonsBy() on a TranscriptDb) you could use >>>>>> extractTranscriptsFromGenome() >>>>>> and translate, see the GenomicFeatures vignette for an example. >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> In order to validate fusion products I need to be sure that the >>>>>>> peptides >>>>>>> encoded by the the two fused proteins are in the same frame. >>>>>>> I have now a function that allows to confirm the protein1 and protein2 >>>>>>> have sequences located in the same frame. >>>>>>> However, I got stack to retrieve those proteins sequences from UCSC. I >>>>>>> did >>>>>>> not found a quick way to retrieve the protein sequence associated to a >>>>>>> UCSC >>>>>>> ID. >>>>>>> Indeed the protein sequence is present in the UCSC genome browser, >>>>>>> but I >>>>>>> do not know how to grab it. >>>>>>> Any suggestion? >>>>>>> Cheers >>>>>>> Raffaele >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ---------------------------------------- >>>>>>> Prof. Raffaele A. Calogero >>>>>>> Bioinformatics and Genomics Unit >>>>>>> MBC Centro di Biotecnologie Molecolari >>>>>>> Via Nizza 52, Torino 10126 >>>>>>> tel. ++39 0116706457 >>>>>>> Fax ++39 0112366457 >>>>>>> Mobile ++39 3333827080 >>>>>>> email: raffaele.calogero@unito.it >>>>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>>>> www: http://www.bioinformatica.unito.it >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor@r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/gmane. >>>>>>> science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>>> -- >>>>> ---------------------------------------- >>>>> Prof. Raffaele A. Calogero >>>>> Bioinformatics and Genomics Unit >>>>> MBC Centro di Biotecnologie Molecolari >>>>> Via Nizza 52, Torino 10126 >>>>> tel. ++39 0116706457 >>>>> Fax ++39 0112366457 >>>>> Mobile ++39 3333827080 >>>>> email: raffaele.calogero@unito.it >>>>> raffaele[dot]calogero[at]gmail[dot]com >>>>> www: http://www.bioinformatica.unito.it >>>>> >>>>> >>>>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Marc Carlson ★ 7.2k

Login before adding your answer.