merging GRange objects
1
1
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.7 years ago
Hi, I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation. txdb = TxDb.Hsapiens.UCSC.hg19.knownGene mate.range= GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1[i sSameCzome])-offset,start(mate1[isSameCzome])+offset)) codingRegions = refLocsToLocalLocs(mate.range, txdb) trans.info=select(txdb, key=values(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) mrg.data1=merge(as.data.frame(trans.names), as.data.frametrans.info), by.x="ENTREZID", by.y="GENEID") mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") Thanks ../Murli -- output of sessionInfo(): > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] biomaRt_2.16.0 [2] org.Hs.eg.db_2.9.0 [3] RSQLite_0.11.4 [4] DBI_0.2-7 [5] VariantAnnotation_1.6.6 [6] Rsamtools_1.12.3 [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19 [8] BSgenome_1.28.0 [9] Biostrings_2.28.0 [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 [11] GenomicFeatures_1.12.2 [12] AnnotationDbi_1.22.6 [13] Biobase_2.20.0 [14] GenomicRanges_1.12.4 [15] IRanges_1.18.1 [16] BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 [5] tcltk_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 -- Sent via the guest posting facility at bioconductor.org.
Annotation BSgenome BSgenome Annotation BSgenome BSgenome • 7.4k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 4 hours ago
Seattle, WA, United States
Hi Murli, On 06/24/2013 04:55 PM, Murli [guest] wrote: > > Hi, > > I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation. > > txdb = TxDb.Hsapiens.UCSC.hg19.knownGene > > mate.range= GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1 [isSameCzome])-offset,start(mate1[isSameCzome])+offset)) > > codingRegions = refLocsToLocalLocs(mate.range, txdb) > > trans.info=select(txdb, key=values(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") > > trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) > > mrg.data1=merge(as.data.frame(trans.names), as.data.frametrans.info), by.x="ENTREZID", by.y="GENEID") > > mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") It looks like you are merging data.frames, not GappedAlignments or GRanges objects. Also you say that 'mate1' and 'mate2' are GappedAlignments objects but I only see 'mate1' in the above code. The exact meaning of "merging" depends on the objects involved. Sometimes people use the term "merging" when they actually want to combine or bind objects together with c(), rbind() or cbind(). Note that since GRanges and GappedAlignments objects are conceptually vector-like objects of dimension 1, only c() works on them. That is, rbind(), cbind(), and merge() (which are typically operating on 2-D objects) are not supported on those objects. Cheers, H. > > Thanks ../Murli > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] biomaRt_2.16.0 > [2] org.Hs.eg.db_2.9.0 > [3] RSQLite_0.11.4 > [4] DBI_0.2-7 > [5] VariantAnnotation_1.6.6 > [6] Rsamtools_1.12.3 > [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [8] BSgenome_1.28.0 > [9] Biostrings_2.28.0 > [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [11] GenomicFeatures_1.12.2 > [12] AnnotationDbi_1.22.6 > [13] Biobase_2.20.0 > [14] GenomicRanges_1.12.4 > [15] IRanges_1.18.1 > [16] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 > [5] tcltk_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi Herv? , I am annotating paired end reads and code is for mate1, mate2 is the same. As you can see I am using the genomic coordinates and trying to annotate them using the UCSC known genes table. I need to ultimately make an association with the coordinates, a reason why I am trying to merge the outputs. I am converting them into data frames and merging them because select returns a data frame, so I have to convert the GRanges object to a data frame to merge them. I want to make sure is that I am not messing my data when I am merging them. Would the following lines correctly combine them? > mrg.data1=merge(as.data.frame(trans.names), as.data.frametrans.info), by.x="ENTREZID", by.y="GENEID") > mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") When I reviewed the first few lines and they seemed ok, but there could always be exceptions. If there is a better way please let me know. I am very new to Bioconductor. Thanks for your help. Cheers../Murli -----Original Message----- From: Hervé Pagès [mailto:hpages@fhcrc.org] Sent: Monday, June 24, 2013 8:29 PM To: Murli [guest] Cc: bioconductor at r-project.org; Nair, Murlidharan T Subject: Re: [BioC] merging GRange objects Hi Murli, On 06/24/2013 04:55 PM, Murli [guest] wrote: > > Hi, > > I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation. > > txdb = TxDb.Hsapiens.UCSC.hg19.knownGene > > mate.range= > GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1[isSameCzome]) > -offset,start(mate1[isSameCzome])+offset)) > > codingRegions = refLocsToLocalLocs(mate.range, txdb) > > trans.info=select(txdb, key=values(codingRegions)$TXID, > cols=c("GENEID","TXNAME"), keytype="TXID") > > trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", > "SYMBOL")) > > mrg.data1=merge(as.data.frame(trans.names), as.data.frametrans.info), > by.x="ENTREZID", by.y="GENEID") > > mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", > by.y="TXID") It looks like you are merging data.frames, not GappedAlignments or GRanges objects. Also you say that 'mate1' and 'mate2' are GappedAlignments objects but I only see 'mate1' in the above code. The exact meaning of "merging" depends on the objects involved. Sometimes people use the term "merging" when they actually want to combine or bind objects together with c(), rbind() or cbind(). Note that since GRanges and GappedAlignments objects are conceptually vector-like objects of dimension 1, only c() works on them. That is, rbind(), cbind(), and merge() (which are typically operating on 2-D objects) are not supported on those objects. Cheers, H. > > Thanks ../Murli > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] biomaRt_2.16.0 > [2] org.Hs.eg.db_2.9.0 > [3] RSQLite_0.11.4 > [4] DBI_0.2-7 > [5] VariantAnnotation_1.6.6 > [6] Rsamtools_1.12.3 > [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [8] BSgenome_1.28.0 > [9] Biostrings_2.28.0 > [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [11] GenomicFeatures_1.12.2 > [12] AnnotationDbi_1.22.6 > [13] Biobase_2.20.0 > [14] GenomicRanges_1.12.4 > [15] IRanges_1.18.1 > [16] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 > [5] tcltk_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Murli, On 06/24/2013 06:00 PM, Nair, Murlidharan T wrote: > Hi Herv? , > I am annotating paired end reads and code is for mate1, mate2 is the same. As you can see I am using the genomic coordinates and trying to annotate them using the UCSC known genes table. I need to ultimately make an association with the coordinates, a reason why I am trying to merge the outputs. I am converting them into data frames and merging them because select returns a data frame, so I have to convert the GRanges object to a data frame to merge them. I want to make sure is that I am not messing my data when I am merging them. Would the following lines correctly combine them? > >> mrg.data1=merge(as.data.frame(trans.names), as.data.frametrans.info), by.x="ENTREZID", by.y="GENEID") > >> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > > When I reviewed the first few lines and they seemed ok, but there could always be exceptions. If there is a better way please let me know. I am very new to Bioconductor. You didn't send us code we can reproduce but here is code that tries to achieve something similar: library(GenomicRanges) mate.range <- GRanges("chr1", IRanges(c(39800000, 90400000, 39800066),width=500)) library(VariantAnnotation) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene codingRegions <- refLocsToLocalLocs(mate.range, txdb) trans.info <- select(txdb, key=mcols(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") library(org.Hs.eg.db) trans.names <- select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) mrg.data1 <- merge(trans.names, trans.info, by.x="ENTREZID", by.y="GENEID") mrg.data2 <- merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > mrg.data2 TXID ENTREZID GENENAME SYMBOL 1 963 23499 microtubule-actin crosslinking factor 1 MACF1 2 963 23499 microtubule-actin crosslinking factor 1 MACF1 3 963 23499 microtubule-actin crosslinking factor 1 MACF1 4 963 23499 microtubule-actin crosslinking factor 1 MACF1 5 963 23499 microtubule-actin crosslinking factor 1 MACF1 6 963 23499 microtubule-actin crosslinking factor 1 MACF1 7 963 23499 microtubule-actin crosslinking factor 1 MACF1 8 963 23499 microtubule-actin crosslinking factor 1 MACF1 9 1687 55144 leucine rich repeat containing 8 family, member D LRRC8D 10 1687 55144 leucine rich repeat containing 8 family, member D LRRC8D 11 1688 55144 leucine rich repeat containing 8 family, member D LRRC8D 12 1688 55144 leucine rich repeat containing 8 family, member D LRRC8D TXNAME seqnames start end width strand CDSLOC.start CDSLOC.end 1 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 2 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 3 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 4 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 5 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 6 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 7 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 8 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 9 uc001dnm.3 chr1 90400000 90400499 500 + 1373 1872 10 uc001dnm.3 chr1 90400000 90400499 500 + 1373 1872 11 uc001dnn.3 chr1 90400000 90400499 500 + 1373 1872 12 uc001dnn.3 chr1 90400000 90400499 500 + 1373 1872 CDSLOC.width PROTEINLOC QUERYID CDSID 1 500 1020, 1187 1 2948 2 500 1042, 1209 3 2948 3 500 1020, 1187 1 2948 4 500 1042, 1209 3 2948 5 500 1020, 1187 1 2948 6 500 1042, 1209 3 2948 7 500 1020, 1187 1 2948 8 500 1042, 1209 3 2948 9 500 458, 624 2 5132 10 500 458, 624 2 5132 11 500 458, 624 2 5132 12 500 458, 624 2 5132 At first glance it looks like the merging worked. But there are so many duplicated rows in the final data.frame that I don't find this approach very appealing: > duplicated(mrg.data2) [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE Querying the Homo.sapiens package allows you to retrieve all the annotations you want in one single call: > library(Homo.sapiens) > mrg.data1b <- select(Homo.sapiens, keys=mcols(codingRegions)$TXID, cols=c("GENEID", "TXNAME", "ENTREZID", "GENENAME", "SYMBOL"), keytype="TXID") > mrg.data2b <- merge(mrg.data1b, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > mrg.data2b TXID GENEID TXNAME SYMBOL 1 963 23499 uc021olw.1 MACF1 2 963 23499 uc021olw.1 MACF1 3 1687 55144 uc001dnm.3 LRRC8D 4 1688 55144 uc001dnn.3 LRRC8D 5 1689 <na> uc021opq.1 <na> GENENAME ENTREZID seqnames start 1 microtubule-actin crosslinking factor 1 23499 chr1 39800000 2 microtubule-actin crosslinking factor 1 23499 chr1 39800066 3 leucine rich repeat containing 8 family, member D 55144 chr1 90400000 4 leucine rich repeat containing 8 family, member D 55144 chr1 90400000 5 <na> <na> chr1 90400000 end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC QUERYID 1 39800499 500 + 3060 3559 500 1020, 1187 1 2 39800565 500 + 3126 3625 500 1042, 1209 3 3 90400499 500 + 1373 1872 500 458, 624 2 4 90400499 500 + 1373 1872 500 458, 624 2 5 90400499 500 + 1373 1872 500 458, 624 2 CDSID 1 2948 2 2948 3 5132 4 5132 5 5132 Note that this solution does not produce all the duplicated rows and it also preserves those rows that correspond to transcripts not linked to a gene id (e.g. transcript uc021opq.1). Personally, I prefer to merge the annotations to the metadata columns of the GRanges object: m <- match(mcols(codingRegions)$TXID, mrg.data1b$TXID) mcols(codingRegions) <- cbind(mcols(codingRegions), DataFrame(mrg.data1b)[m, -1]) > codingRegions GRanges with 5 ranges and 10 metadata columns: seqnames ranges strand | CDSLOC PROTEINLOC <rle> <iranges> <rle> | <iranges> <integerlist> [1] chr1 [39800000, 39800499] + | [3060, 3559] 1020,1187 [2] chr1 [90400000, 90400499] + | [1373, 1872] 458,624 [3] chr1 [90400000, 90400499] + | [1373, 1872] 458,624 [4] chr1 [90400000, 90400499] + | [1373, 1872] 458,624 [5] chr1 [39800066, 39800565] + | [3126, 3625] 1042,1209 QUERYID TXID CDSID GENEID TXNAME SYMBOL <integer> <character> <integer> <character> <character> <character> [1] 1 963 2948 23499 uc021olw.1 MACF1 [2] 2 1687 5132 55144 uc001dnm.3 LRRC8D [3] 2 1688 5132 55144 uc001dnn.3 LRRC8D [4] 2 1689 5132 <na> uc021opq.1 <na> [5] 3 963 2948 23499 uc021olw.1 MACF1 GENENAME ENTREZID <character> <factor> [1] microtubule-actin crosslinking factor 1 23499 [2] leucine rich repeat containing 8 family, member D 55144 [3] leucine rich repeat containing 8 family, member D 55144 [4] <na> <na> [5] microtubule-actin crosslinking factor 1 23499 --- seqlengths: chr1 NA Maybe this is what you wanted i.e. add the GENEID, TXNAME, SYMBOL, GENENAME, and ENTREZID metadata cols to the GRanges object 'codingRegions'? H. > > Thanks for your help. > > Cheers../Murli > > > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Monday, June 24, 2013 8:29 PM > To: Murli [guest] > Cc: bioconductor at r-project.org; Nair, Murlidharan T > Subject: Re: [BioC] merging GRange objects > > Hi Murli, > > On 06/24/2013 04:55 PM, Murli [guest] wrote: >> >> Hi, >> >> I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation. >> >> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene >> >> mate.range= >> GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1[isSameCzome]) >> -offset,start(mate1[isSameCzome])+offset)) >> >> codingRegions = refLocsToLocalLocs(mate.range, txdb) >> >> trans.info=select(txdb, key=values(codingRegions)$TXID, >> cols=c("GENEID","TXNAME"), keytype="TXID") >> >> trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", >> "SYMBOL")) >> >> mrg.data1=merge(as.data.frame(trans.names), as.data.frametrans.info), >> by.x="ENTREZID", by.y="GENEID") >> >> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", >> by.y="TXID") > > It looks like you are merging data.frames, not GappedAlignments or GRanges objects. Also you say that 'mate1' and 'mate2' are GappedAlignments objects but I only see 'mate1' in the above code. > > The exact meaning of "merging" depends on the objects involved. > Sometimes people use the term "merging" when they actually want to combine or bind objects together with c(), rbind() or cbind(). > Note that since GRanges and GappedAlignments objects are conceptually vector-like objects of dimension 1, only c() works on them. That is, rbind(), cbind(), and merge() (which are typically operating on 2-D > objects) are not supported on those objects. > > Cheers, > H. > >> >> Thanks ../Murli >> >> >> >> -- output of sessionInfo(): >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] biomaRt_2.16.0 >> [2] org.Hs.eg.db_2.9.0 >> [3] RSQLite_0.11.4 >> [4] DBI_0.2-7 >> [5] VariantAnnotation_1.6.6 >> [6] Rsamtools_1.12.3 >> [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19 >> [8] BSgenome_1.28.0 >> [9] Biostrings_2.28.0 >> [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 >> [11] GenomicFeatures_1.12.2 >> [12] AnnotationDbi_1.22.6 >> [13] Biobase_2.20.0 >> [14] GenomicRanges_1.12.4 >> [15] IRanges_1.18.1 >> [16] BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 >> [5] tcltk_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thanks Herv?, that helps a lot. Indeed I was getting a lot of duplicates as you pointed out. I shall use your approach. Cheers../Murli -----Original Message----- From: Hervé Pagès [mailto:hpages@fhcrc.org] Sent: Tuesday, June 25, 2013 2:50 AM To: Nair, Murlidharan T Cc: Murli [guest]; bioconductor at r-project.org Subject: Re: [BioC] merging GRange objects Murli, On 06/24/2013 06:00 PM, Nair, Murlidharan T wrote: > Hi Herv? , > I am annotating paired end reads and code is for mate1, mate2 is the same. As you can see I am using the genomic coordinates and trying to annotate them using the UCSC known genes table. I need to ultimately make an association with the coordinates, a reason why I am trying to merge the outputs. I am converting them into data frames and merging them because select returns a data frame, so I have to convert the GRanges object to a data frame to merge them. I want to make sure is that I am not messing my data when I am merging them. Would the following lines correctly combine them? > >> mrg.data1=merge(as.data.frame(trans.names), >> as.data.frametrans.info), by.x="ENTREZID", by.y="GENEID") > >> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > > When I reviewed the first few lines and they seemed ok, but there could always be exceptions. If there is a better way please let me know. I am very new to Bioconductor. You didn't send us code we can reproduce but here is code that tries to achieve something similar: library(GenomicRanges) mate.range <- GRanges("chr1", IRanges(c(39800000, 90400000, 39800066),width=500)) library(VariantAnnotation) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene codingRegions <- refLocsToLocalLocs(mate.range, txdb) trans.info <- select(txdb, key=mcols(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") library(org.Hs.eg.db) trans.names <- select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) mrg.data1 <- merge(trans.names, trans.info, by.x="ENTREZID", by.y="GENEID") mrg.data2 <- merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > mrg.data2 TXID ENTREZID GENENAME SYMBOL 1 963 23499 microtubule-actin crosslinking factor 1 MACF1 2 963 23499 microtubule-actin crosslinking factor 1 MACF1 3 963 23499 microtubule-actin crosslinking factor 1 MACF1 4 963 23499 microtubule-actin crosslinking factor 1 MACF1 5 963 23499 microtubule-actin crosslinking factor 1 MACF1 6 963 23499 microtubule-actin crosslinking factor 1 MACF1 7 963 23499 microtubule-actin crosslinking factor 1 MACF1 8 963 23499 microtubule-actin crosslinking factor 1 MACF1 9 1687 55144 leucine rich repeat containing 8 family, member D LRRC8D 10 1687 55144 leucine rich repeat containing 8 family, member D LRRC8D 11 1688 55144 leucine rich repeat containing 8 family, member D LRRC8D 12 1688 55144 leucine rich repeat containing 8 family, member D LRRC8D TXNAME seqnames start end width strand CDSLOC.start CDSLOC.end 1 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 2 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 3 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 4 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 5 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 6 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 7 uc021olw.1 chr1 39800000 39800499 500 + 3060 3559 8 uc021olw.1 chr1 39800066 39800565 500 + 3126 3625 9 uc001dnm.3 chr1 90400000 90400499 500 + 1373 1872 10 uc001dnm.3 chr1 90400000 90400499 500 + 1373 1872 11 uc001dnn.3 chr1 90400000 90400499 500 + 1373 1872 12 uc001dnn.3 chr1 90400000 90400499 500 + 1373 1872 CDSLOC.width PROTEINLOC QUERYID CDSID 1 500 1020, 1187 1 2948 2 500 1042, 1209 3 2948 3 500 1020, 1187 1 2948 4 500 1042, 1209 3 2948 5 500 1020, 1187 1 2948 6 500 1042, 1209 3 2948 7 500 1020, 1187 1 2948 8 500 1042, 1209 3 2948 9 500 458, 624 2 5132 10 500 458, 624 2 5132 11 500 458, 624 2 5132 12 500 458, 624 2 5132 At first glance it looks like the merging worked. But there are so many duplicated rows in the final data.frame that I don't find this approach very appealing: > duplicated(mrg.data2) [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE Querying the Homo.sapiens package allows you to retrieve all the annotations you want in one single call: > library(Homo.sapiens) > mrg.data1b <- select(Homo.sapiens, keys=mcols(codingRegions)$TXID, cols=c("GENEID", "TXNAME", "ENTREZID", "GENENAME", "SYMBOL"), keytype="TXID") > mrg.data2b <- merge(mrg.data1b, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > mrg.data2b TXID GENEID TXNAME SYMBOL 1 963 23499 uc021olw.1 MACF1 2 963 23499 uc021olw.1 MACF1 3 1687 55144 uc001dnm.3 LRRC8D 4 1688 55144 uc001dnn.3 LRRC8D 5 1689 <na> uc021opq.1 <na> GENENAME ENTREZID seqnames start 1 microtubule-actin crosslinking factor 1 23499 chr1 39800000 2 microtubule-actin crosslinking factor 1 23499 chr1 39800066 3 leucine rich repeat containing 8 family, member D 55144 chr1 90400000 4 leucine rich repeat containing 8 family, member D 55144 chr1 90400000 5 <na> <na> chr1 90400000 end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC QUERYID 1 39800499 500 + 3060 3559 500 1020, 1187 1 2 39800565 500 + 3126 3625 500 1042, 1209 3 3 90400499 500 + 1373 1872 500 458, 624 2 4 90400499 500 + 1373 1872 500 458, 624 2 5 90400499 500 + 1373 1872 500 458, 624 2 CDSID 1 2948 2 2948 3 5132 4 5132 5 5132 Note that this solution does not produce all the duplicated rows and it also preserves those rows that correspond to transcripts not linked to a gene id (e.g. transcript uc021opq.1). Personally, I prefer to merge the annotations to the metadata columns of the GRanges object: m <- match(mcols(codingRegions)$TXID, mrg.data1b$TXID) mcols(codingRegions) <- cbind(mcols(codingRegions), DataFrame(mrg.data1b)[m, -1]) > codingRegions GRanges with 5 ranges and 10 metadata columns: seqnames ranges strand | CDSLOC PROTEINLOC <rle> <iranges> <rle> | <iranges> <integerlist> [1] chr1 [39800000, 39800499] + | [3060, 3559] 1020,1187 [2] chr1 [90400000, 90400499] + | [1373, 1872] 458,624 [3] chr1 [90400000, 90400499] + | [1373, 1872] 458,624 [4] chr1 [90400000, 90400499] + | [1373, 1872] 458,624 [5] chr1 [39800066, 39800565] + | [3126, 3625] 1042,1209 QUERYID TXID CDSID GENEID TXNAME SYMBOL <integer> <character> <integer> <character> <character> <character> [1] 1 963 2948 23499 uc021olw.1 MACF1 [2] 2 1687 5132 55144 uc001dnm.3 LRRC8D [3] 2 1688 5132 55144 uc001dnn.3 LRRC8D [4] 2 1689 5132 <na> uc021opq.1 <na> [5] 3 963 2948 23499 uc021olw.1 MACF1 GENENAME ENTREZID <character> <factor> [1] microtubule-actin crosslinking factor 1 23499 [2] leucine rich repeat containing 8 family, member D 55144 [3] leucine rich repeat containing 8 family, member D 55144 [4] <na> <na> [5] microtubule-actin crosslinking factor 1 23499 --- seqlengths: chr1 NA Maybe this is what you wanted i.e. add the GENEID, TXNAME, SYMBOL, GENENAME, and ENTREZID metadata cols to the GRanges object 'codingRegions'? H. > > Thanks for your help. > > Cheers../Murli > > > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Monday, June 24, 2013 8:29 PM > To: Murli [guest] > Cc: bioconductor at r-project.org; Nair, Murlidharan T > Subject: Re: [BioC] merging GRange objects > > Hi Murli, > > On 06/24/2013 04:55 PM, Murli [guest] wrote: >> >> Hi, >> >> I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation. >> >> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene >> >> mate.range= >> GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1[isSameCzome] >> ) >> -offset,start(mate1[isSameCzome])+offset)) >> >> codingRegions = refLocsToLocalLocs(mate.range, txdb) >> >> trans.info=select(txdb, key=values(codingRegions)$TXID, >> cols=c("GENEID","TXNAME"), keytype="TXID") >> >> trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", >> "SYMBOL")) >> >> mrg.data1=merge(as.data.frame(trans.names), >> as.data.frametrans.info), by.x="ENTREZID", by.y="GENEID") >> >> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", >> by.y="TXID") > > It looks like you are merging data.frames, not GappedAlignments or GRanges objects. Also you say that 'mate1' and 'mate2' are GappedAlignments objects but I only see 'mate1' in the above code. > > The exact meaning of "merging" depends on the objects involved. > Sometimes people use the term "merging" when they actually want to combine or bind objects together with c(), rbind() or cbind(). > Note that since GRanges and GappedAlignments objects are conceptually > vector-like objects of dimension 1, only c() works on them. That is, > rbind(), cbind(), and merge() (which are typically operating on 2-D > objects) are not supported on those objects. > > Cheers, > H. > >> >> Thanks ../Murli >> >> >> >> -- output of sessionInfo(): >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] biomaRt_2.16.0 >> [2] org.Hs.eg.db_2.9.0 >> [3] RSQLite_0.11.4 >> [4] DBI_0.2-7 >> [5] VariantAnnotation_1.6.6 >> [6] Rsamtools_1.12.3 >> [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19 >> [8] BSgenome_1.28.0 >> [9] Biostrings_2.28.0 >> [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 >> [11] GenomicFeatures_1.12.2 >> [12] AnnotationDbi_1.22.6 >> [13] Biobase_2.20.0 >> [14] GenomicRanges_1.12.4 >> [15] IRanges_1.18.1 >> [16] BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 >> [5] tcltk_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6