TXNAME mapping

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.3 years ago

Hi, I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8 13515402 13515702 301 and chr12 71612488 71612788 301. I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME. Thanks ../Murli >mrg.data[1000:1100,] TXID GENEID TXNAME seqnames start end width strand 1000 32071 7038 uc003ytw.3 chr8 13515402 13515702 301 * 1001 68728 63934 uc002qnd.3 chr8 14339379 14339679 301 * 1002 68729 63934 uc002qne.3 chr8 14339379 14339679 301 * 1003 68730 63934 uc010etm.3 chr8 14339379 14339679 301 * 1004 32071 7038 uc003ytw.3 chr8 14339379 14339679 301 * 1005 68728 63934 uc002qnd.3 chr12 71612488 71612788 301 * 1006 68729 63934 uc002qne.3 chr12 71612488 71612788 301 * 1007 68730 63934 uc010etm.3 chr12 71612488 71612788 301 * 1008 32071 7038 uc003ytw.3 chr12 71612488 71612788 301 * 1009 68728 63934 uc002qnd.3 chr14 24809972 24810272 301 * 1010 68729 63934 uc002qne.3 chr14 24809972 24810272 301 * -- output of sessionInfo(): > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Homo.sapiens_1.1.1 [2] GO.db_2.9.0 [3] OrganismDbi_1.2.0 [4] org.Hs.eg.db_2.9.0 [5] RSQLite_0.11.4 [6] DBI_0.2-7 [7] VariantAnnotation_1.6.6 [8] Rsamtools_1.12.3 [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 [10] BSgenome_1.28.0 [11] Biostrings_2.28.0 [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 [13] GenomicFeatures_1.12.2 [14] AnnotationDbi_1.22.6 [15] Biobase_2.20.0 [16] GenomicRanges_1.12.4 [17] IRanges_1.18.1 [18] BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] biomaRt_2.16.0 bitops_1.0-5 graph_1.38.2 RBGL_1.36.2 [5] RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 [9] XML_3.98-1.1 zlibbioc_1.6.0 -- Sent via the guest posting facility at bioconductor.org.

GO BSgenome BSgenome GO BSgenome BSgenome • 2.0k views

ADD COMMENT • link updated 12.5 years ago by Murli ▴ 90 • written 12.5 years ago by Guest User ★ 13k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.4 years ago

United States

Hi Murli, I have no idea what you did since you didn't give me an example. In the future, you might find it helpful to look at the posting guide which you can find on our web site here: http://www.bioconductor.org/help/mailing-list/posting-guide/ But from what you did tell me, my guess is that you just wanted to extract the information you listed. Here is how I would do something like this: library(Homo.sapiens) select(Homo.sapiens, keys=c(63934,7038), cols=c("TXID","GENEID","TXNAME","TXSTART","TXEND","TXCHROM","TXSTRAND" ), keytype="ENTREZID") Hope that this helps you, Marc On 06/21/2013 07:16 PM, Murli [guest] wrote: > Hi, > > I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8 13515402 13515702 301 and chr12 71612488 71612788 301. I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME. > > Thanks ../Murli > > > > >> mrg.data[1000:1100,] > TXID GENEID TXNAME seqnames start end width strand > 1000 32071 7038 uc003ytw.3 chr8 13515402 13515702 301 * > 1001 68728 63934 uc002qnd.3 chr8 14339379 14339679 301 * > 1002 68729 63934 uc002qne.3 chr8 14339379 14339679 301 * > 1003 68730 63934 uc010etm.3 chr8 14339379 14339679 301 * > 1004 32071 7038 uc003ytw.3 chr8 14339379 14339679 301 * > 1005 68728 63934 uc002qnd.3 chr12 71612488 71612788 301 * > 1006 68729 63934 uc002qne.3 chr12 71612488 71612788 301 * > 1007 68730 63934 uc010etm.3 chr12 71612488 71612788 301 * > 1008 32071 7038 uc003ytw.3 chr12 71612488 71612788 301 * > 1009 68728 63934 uc002qnd.3 chr14 24809972 24810272 301 * > 1010 68729 63934 uc002qne.3 chr14 24809972 24810272 301 * > > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Homo.sapiens_1.1.1 > [2] GO.db_2.9.0 > [3] OrganismDbi_1.2.0 > [4] org.Hs.eg.db_2.9.0 > [5] RSQLite_0.11.4 > [6] DBI_0.2-7 > [7] VariantAnnotation_1.6.6 > [8] Rsamtools_1.12.3 > [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [10] BSgenome_1.28.0 > [11] Biostrings_2.28.0 > [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [13] GenomicFeatures_1.12.2 > [14] AnnotationDbi_1.22.6 > [15] Biobase_2.20.0 > [16] GenomicRanges_1.12.4 > [17] IRanges_1.18.1 > [18] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 graph_1.38.2 RBGL_1.36.2 > [5] RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 > [9] XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 12.5 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc/James, Many thanks for your prompt reply. My apologies for not posting the code. Here is code. I guess, I messed up when I tried to merge it. What I want to achieve is to determine what the reads corresponds to, i.e. whether it is in the coding region, promoter region, UTR as well as determine if there are any transcription factors that bind to the reads. bf.data= readGappedAlignments(bam_file, param=ScanBamParam(what=scanBamWhat())) mate.pairs=table(mcols(bf.data)$qname) onlyPairs=names(mate.pairs)[mate.pairs==2] mappedPairs=bf.data[mcols(bf.data)$qname %in% onlyPairs] mate1=mappedPairs[c(T,F)] mate2=mappedPairs[c(F,T)] isSameCzome= (seqnames(mate1)==seqnames(mate2)) offset=150 txdb = TxDb.Hsapiens.UCSC.hg19.knownGene mate.range= GRanges(seqnames(mate1[isSameCzome])[1:1000],IRanges(start (mate1[isSameCzome])[1:1000]-offset,start(mate1[isSameCzome])[1:1000]+ offset)) codingRegions = refLocsToLocalLocs(mate.range, txdb) trans.info=select(txdb, key=values(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) mate.range.df=as.data.frame(mate.range) trans.info.df=as.data.frame(trans.info.df) trans.names.df=as.data.frame(trans.names) mrg.data=merge(trans.info.df,mate.range.df) mrg.data=merge(mrg.data, trans.names.df) Thanks for your help. Cheers../murli -----Original Message----- From: Marc Carlson [mailto:mcarlson@fhcrc.org] Sent: Saturday, June 22, 2013 12:07 AM To: Murli [guest] Cc: bioconductor at r-project.org; Nair, Murlidharan T Subject: Re: TXNAME mapping Hi Murli, I have no idea what you did since you didn't give me an example. In the future, you might find it helpful to look at the posting guide which you can find on our web site here: http://www.bioconductor.org/help/mailing-list/posting-guide/ But from what you did tell me, my guess is that you just wanted to extract the information you listed. Here is how I would do something like this: library(Homo.sapiens) select(Homo.sapiens, keys=c(63934,7038), cols=c("TXID","GENEID","TXNAME","TXSTART","TXEND","TXCHROM","TXSTRAND" ), keytype="ENTREZID") Hope that this helps you, Marc On 06/21/2013 07:16 PM, Murli [guest] wrote: > Hi, > > I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8 13515402 13515702 301 and chr12 71612488 71612788 301. I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME. > > Thanks ../Murli > > > > >> mrg.data[1000:1100,] > TXID GENEID TXNAME seqnames start end width strand > 1000 32071 7038 uc003ytw.3 chr8 13515402 13515702 301 * > 1001 68728 63934 uc002qnd.3 chr8 14339379 14339679 301 * > 1002 68729 63934 uc002qne.3 chr8 14339379 14339679 301 * > 1003 68730 63934 uc010etm.3 chr8 14339379 14339679 301 * > 1004 32071 7038 uc003ytw.3 chr8 14339379 14339679 301 * > 1005 68728 63934 uc002qnd.3 chr12 71612488 71612788 301 * > 1006 68729 63934 uc002qne.3 chr12 71612488 71612788 301 * > 1007 68730 63934 uc010etm.3 chr12 71612488 71612788 301 * > 1008 32071 7038 uc003ytw.3 chr12 71612488 71612788 301 * > 1009 68728 63934 uc002qnd.3 chr14 24809972 24810272 301 * > 1010 68729 63934 uc002qne.3 chr14 24809972 24810272 301 * > > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Homo.sapiens_1.1.1 > [2] GO.db_2.9.0 > [3] OrganismDbi_1.2.0 > [4] org.Hs.eg.db_2.9.0 > [5] RSQLite_0.11.4 > [6] DBI_0.2-7 > [7] VariantAnnotation_1.6.6 > [8] Rsamtools_1.12.3 > [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [10] BSgenome_1.28.0 > [11] Biostrings_2.28.0 > [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [13] GenomicFeatures_1.12.2 > [14] AnnotationDbi_1.22.6 > [15] Biobase_2.20.0 > [16] GenomicRanges_1.12.4 > [17] IRanges_1.18.1 > [18] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 graph_1.38.2 RBGL_1.36.2 > [5] RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 > [9] XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org.

ADD REPLY • link 12.5 years ago Murli ▴ 90

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 23 hours ago

United States

Hi Murli, I think you will need to show a small example script that gives this result. I see only one region that corresponds to that TXNAME: > x <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names=T) > x["uc003ytw.3"] GRangesList of length 1: $uc003ytw.3 GRanges with 48 ranges and 3 metadata columns: seqnames ranges strand | exon_id exon_name <rle> <iranges> <rle> | <integer> <character> [1] chr8 [133879205, 133879312] + | 116041 <na> [2] chr8 [133880360, 133880468] + | 116042 <na> [3] chr8 [133881974, 133882071] + | 116043 <na> [4] chr8 [133883593, 133883796] + | 116044 <na> [5] chr8 [133885307, 133885466] + | 116045 <na> ... ... ... ... ... ... ... [44] chr8 [134125666, 134125847] + | 116085 <na> [45] chr8 [134128853, 134128960] + | 116086 <na> [46] chr8 [134144056, 134144190] + | 116087 <na> [47] chr8 [134145714, 134145904] + | 116088 <na> [48] chr8 [134146920, 134147143] + | 116089 <na> exon_rank <integer> [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 ... ... [44] 44 [45] 45 [46] 46 [47] 47 [48] 48 > select(Homo.sapiens, "uc003ytw.3", c("TXID","GENEID","CHR", "CHRLOC","CHRLOCEND"), "TXNAME") TXNAME GENEID TXID CHR CHRLOC CHRLOCCHR CHRLOCEND 1 uc003ytw.3 7038 32071 8 133879205 8 134147143 Best, Jim On 6/21/2013 10:16 PM, Murli [guest] wrote: > Hi, > > I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8 13515402 13515702 301 and chr12 71612488 71612788 301. I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME. > > Thanks ../Murli > > > > >> mrg.data[1000:1100,] > TXID GENEID TXNAME seqnames start end width strand > 1000 32071 7038 uc003ytw.3 chr8 13515402 13515702 301 * > 1001 68728 63934 uc002qnd.3 chr8 14339379 14339679 301 * > 1002 68729 63934 uc002qne.3 chr8 14339379 14339679 301 * > 1003 68730 63934 uc010etm.3 chr8 14339379 14339679 301 * > 1004 32071 7038 uc003ytw.3 chr8 14339379 14339679 301 * > 1005 68728 63934 uc002qnd.3 chr12 71612488 71612788 301 * > 1006 68729 63934 uc002qne.3 chr12 71612488 71612788 301 * > 1007 68730 63934 uc010etm.3 chr12 71612488 71612788 301 * > 1008 32071 7038 uc003ytw.3 chr12 71612488 71612788 301 * > 1009 68728 63934 uc002qnd.3 chr14 24809972 24810272 301 * > 1010 68729 63934 uc002qne.3 chr14 24809972 24810272 301 * > > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Homo.sapiens_1.1.1 > [2] GO.db_2.9.0 > [3] OrganismDbi_1.2.0 > [4] org.Hs.eg.db_2.9.0 > [5] RSQLite_0.11.4 > [6] DBI_0.2-7 > [7] VariantAnnotation_1.6.6 > [8] Rsamtools_1.12.3 > [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [10] BSgenome_1.28.0 > [11] Biostrings_2.28.0 > [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [13] GenomicFeatures_1.12.2 > [14] AnnotationDbi_1.22.6 > [15] Biobase_2.20.0 > [16] GenomicRanges_1.12.4 > [17] IRanges_1.18.1 > [18] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 graph_1.38.2 RBGL_1.36.2 > [5] RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 > [9] XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 12.5 years ago James W. MacDonald 68k

0

Entering edit mode

Murli ▴ 90

@murli-5770

Last seen 8.0 years ago

I have tried to correct my merge, I think I have it correct. Would like your comments please... mrg.data1=merge(trans.names.df, trans.info.df, by.x="ENTREZID", by.y="GENEID") mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") I have the following output. Thanks for your help. Cheers../Murli > mrg.data1=merge(trans.names.df, trans.info.df, by.x="ENTREZID", by.y="GENEID") > mrg.data1 ENTREZID GENENAME SYMBOL TXID TXNAME 1 63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3 2 63934 zinc finger protein 667 ZNF667 68729 uc002qne.3 3 63934 zinc finger protein 667 ZNF667 68730 uc010etm.3 4 63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3 5 63934 zinc finger protein 667 ZNF667 68729 uc002qne.3 6 63934 zinc finger protein 667 ZNF667 68730 uc010etm.3 7 63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3 8 63934 zinc finger protein 667 ZNF667 68729 uc002qne.3 9 63934 zinc finger protein 667 ZNF667 68730 uc010etm.3 10 7038 thyroglobulin TG 32071 uc003ytw.3 > mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > mrg.data2 TXID ENTREZID GENENAME SYMBOL TXNAME seqnames start 1 32071 7038 thyroglobulin TG uc003ytw.3 chr8 133898989 2 68728 63934 zinc finger protein 667 ZNF667 uc002qnd.3 chr19 56953674 3 68728 63934 zinc finger protein 667 ZNF667 uc002qnd.3 chr19 56953674 4 68728 63934 zinc finger protein 667 ZNF667 uc002qnd.3 chr19 56953674 5 68729 63934 zinc finger protein 667 ZNF667 uc002qne.3 chr19 56953674 6 68729 63934 zinc finger protein 667 ZNF667 uc002qne.3 chr19 56953674 7 68729 63934 zinc finger protein 667 ZNF667 uc002qne.3 chr19 56953674 8 68730 63934 zinc finger protein 667 ZNF667 uc010etm.3 chr19 56953674 9 68730 63934 zinc finger protein 667 ZNF667 uc010etm.3 chr19 56953674 10 68730 63934 zinc finger protein 667 ZNF667 uc010etm.3 chr19 56953674 end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC 1 133899289 301 + 1372 1672 301 458, 558 2 56953974 301 - 390 690 301 130, 230 3 56953974 301 - 390 690 301 130, 230 4 56953974 301 - 390 690 301 130, 230 5 56953974 301 - 390 690 301 130, 230 6 56953974 301 - 390 690 301 130, 230 7 56953974 301 - 390 690 301 130, 230 8 56953974 301 - 219 519 301 73, 173 9 56953974 301 - 219 519 301 73, 173 10 56953974 301 - 219 519 301 73, 173 QUERYID CDSID 1 693 97562 2 528 204531 3 528 204531 4 528 204531 5 528 204531 6 528 204531 7 528 204531 8 528 204531 9 528 204531 10 528 204531 -----Original Message----- From: Nair, Murlidharan T Sent: Saturday, June 22, 2013 11:09 AM To: 'Marc Carlson'; Murli [guest] Cc: bioconductor at r-project.org Subject: RE: TXNAME mapping Hi Marc/James, Many thanks for your prompt reply. My apologies for not posting the code. Here is code. I guess, I messed up when I tried to merge it. What I want to achieve is to determine what the reads corresponds to, i.e. whether it is in the coding region, promoter region, UTR as well as determine if there are any transcription factors that bind to the reads. bf.data= readGappedAlignments(bam_file, param=ScanBamParam(what=scanBamWhat())) mate.pairs=table(mcols(bf.data)$qname) onlyPairs=names(mate.pairs)[mate.pairs==2] mappedPairs=bf.data[mcols(bf.data)$qname %in% onlyPairs] mate1=mappedPairs[c(T,F)] mate2=mappedPairs[c(F,T)] isSameCzome= (seqnames(mate1)==seqnames(mate2)) offset=150 txdb = TxDb.Hsapiens.UCSC.hg19.knownGene mate.range= GRanges(seqnames(mate1[isSameCzome])[1:1000],IRanges(start (mate1[isSameCzome])[1:1000]-offset,start(mate1[isSameCzome])[1:1000]+ offset)) codingRegions = refLocsToLocalLocs(mate.range, txdb) trans.info=select(txdb, key=values(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) mate.range.df=as.data.frame(mate.range) trans.info.df=as.data.frame(trans.info.df) trans.names.df=as.data.frame(trans.names) mrg.data=merge(trans.info.df,mate.range.df) mrg.data=merge(mrg.data, trans.names.df) Thanks for your help. Cheers../murli -----Original Message----- From: Marc Carlson [mailto:mcarlson@fhcrc.org] Sent: Saturday, June 22, 2013 12:07 AM To: Murli [guest] Cc: bioconductor at r-project.org; Nair, Murlidharan T Subject: Re: TXNAME mapping Hi Murli, I have no idea what you did since you didn't give me an example. In the future, you might find it helpful to look at the posting guide which you can find on our web site here: http://www.bioconductor.org/help/mailing-list/posting-guide/ But from what you did tell me, my guess is that you just wanted to extract the information you listed. Here is how I would do something like this: library(Homo.sapiens) select(Homo.sapiens, keys=c(63934,7038), cols=c("TXID","GENEID","TXNAME","TXSTART","TXEND","TXCHROM","TXSTRAND" ), keytype="ENTREZID") Hope that this helps you, Marc On 06/21/2013 07:16 PM, Murli [guest] wrote: > Hi, > > I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8 13515402 13515702 301 and chr12 71612488 71612788 301. I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME. > > Thanks ../Murli > > > > >> mrg.data[1000:1100,] > TXID GENEID TXNAME seqnames start end width strand > 1000 32071 7038 uc003ytw.3 chr8 13515402 13515702 301 * > 1001 68728 63934 uc002qnd.3 chr8 14339379 14339679 301 * > 1002 68729 63934 uc002qne.3 chr8 14339379 14339679 301 * > 1003 68730 63934 uc010etm.3 chr8 14339379 14339679 301 * > 1004 32071 7038 uc003ytw.3 chr8 14339379 14339679 301 * > 1005 68728 63934 uc002qnd.3 chr12 71612488 71612788 301 * > 1006 68729 63934 uc002qne.3 chr12 71612488 71612788 301 * > 1007 68730 63934 uc010etm.3 chr12 71612488 71612788 301 * > 1008 32071 7038 uc003ytw.3 chr12 71612488 71612788 301 * > 1009 68728 63934 uc002qnd.3 chr14 24809972 24810272 301 * > 1010 68729 63934 uc002qne.3 chr14 24809972 24810272 301 * > > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Homo.sapiens_1.1.1 > [2] GO.db_2.9.0 > [3] OrganismDbi_1.2.0 > [4] org.Hs.eg.db_2.9.0 > [5] RSQLite_0.11.4 > [6] DBI_0.2-7 > [7] VariantAnnotation_1.6.6 > [8] Rsamtools_1.12.3 > [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [10] BSgenome_1.28.0 > [11] Biostrings_2.28.0 > [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [13] GenomicFeatures_1.12.2 > [14] AnnotationDbi_1.22.6 > [15] Biobase_2.20.0 > [16] GenomicRanges_1.12.4 > [17] IRanges_1.18.1 > [18] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 graph_1.38.2 RBGL_1.36.2 > [5] RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 > [9] XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 12.5 years ago Murli ▴ 90

0

Entering edit mode

Murli ▴ 90

@murli-5770

Last seen 8.0 years ago

Hope the way I have used the merge below is correct. Appreciate if you could let me know. Cheers../Murli -----Original Message----- From: Nair, Murlidharan T Sent: Saturday, June 22, 2013 2:02 PM To: 'Marc Carlson'; 'Murli [guest]' Cc: 'bioconductor at r-project.org' Subject: RE: TXNAME mapping I have tried to correct my merge, I think I have it correct. Would like your comments please... mrg.data1=merge(trans.names.df, trans.info.df, by.x="ENTREZID", by.y="GENEID") mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") I have the following output. Thanks for your help. Cheers../Murli > mrg.data1=merge(trans.names.df, trans.info.df, by.x="ENTREZID", by.y="GENEID") > mrg.data1 ENTREZID GENENAME SYMBOL TXID TXNAME 1 63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3 2 63934 zinc finger protein 667 ZNF667 68729 uc002qne.3 3 63934 zinc finger protein 667 ZNF667 68730 uc010etm.3 4 63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3 5 63934 zinc finger protein 667 ZNF667 68729 uc002qne.3 6 63934 zinc finger protein 667 ZNF667 68730 uc010etm.3 7 63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3 8 63934 zinc finger protein 667 ZNF667 68729 uc002qne.3 9 63934 zinc finger protein 667 ZNF667 68730 uc010etm.3 10 7038 thyroglobulin TG 32071 uc003ytw.3 > mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID") > mrg.data2 TXID ENTREZID GENENAME SYMBOL TXNAME seqnames start 1 32071 7038 thyroglobulin TG uc003ytw.3 chr8 133898989 2 68728 63934 zinc finger protein 667 ZNF667 uc002qnd.3 chr19 56953674 3 68728 63934 zinc finger protein 667 ZNF667 uc002qnd.3 chr19 56953674 4 68728 63934 zinc finger protein 667 ZNF667 uc002qnd.3 chr19 56953674 5 68729 63934 zinc finger protein 667 ZNF667 uc002qne.3 chr19 56953674 6 68729 63934 zinc finger protein 667 ZNF667 uc002qne.3 chr19 56953674 7 68729 63934 zinc finger protein 667 ZNF667 uc002qne.3 chr19 56953674 8 68730 63934 zinc finger protein 667 ZNF667 uc010etm.3 chr19 56953674 9 68730 63934 zinc finger protein 667 ZNF667 uc010etm.3 chr19 56953674 10 68730 63934 zinc finger protein 667 ZNF667 uc010etm.3 chr19 56953674 end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC 1 133899289 301 + 1372 1672 301 458, 558 2 56953974 301 - 390 690 301 130, 230 3 56953974 301 - 390 690 301 130, 230 4 56953974 301 - 390 690 301 130, 230 5 56953974 301 - 390 690 301 130, 230 6 56953974 301 - 390 690 301 130, 230 7 56953974 301 - 390 690 301 130, 230 8 56953974 301 - 219 519 301 73, 173 9 56953974 301 - 219 519 301 73, 173 10 56953974 301 - 219 519 301 73, 173 QUERYID CDSID 1 693 97562 2 528 204531 3 528 204531 4 528 204531 5 528 204531 6 528 204531 7 528 204531 8 528 204531 9 528 204531 10 528 204531 -----Original Message----- From: Nair, Murlidharan T Sent: Saturday, June 22, 2013 11:09 AM To: 'Marc Carlson'; Murli [guest] Cc: bioconductor at r-project.org Subject: RE: TXNAME mapping Hi Marc/James, Many thanks for your prompt reply. My apologies for not posting the code. Here is code. I guess, I messed up when I tried to merge it. What I want to achieve is to determine what the reads corresponds to, i.e. whether it is in the coding region, promoter region, UTR as well as determine if there are any transcription factors that bind to the reads. bf.data= readGappedAlignments(bam_file, param=ScanBamParam(what=scanBamWhat())) mate.pairs=table(mcols(bf.data)$qname) onlyPairs=names(mate.pairs)[mate.pairs==2] mappedPairs=bf.data[mcols(bf.data)$qname %in% onlyPairs] mate1=mappedPairs[c(T,F)] mate2=mappedPairs[c(F,T)] isSameCzome= (seqnames(mate1)==seqnames(mate2)) offset=150 txdb = TxDb.Hsapiens.UCSC.hg19.knownGene mate.range= GRanges(seqnames(mate1[isSameCzome])[1:1000],IRanges(start (mate1[isSameCzome])[1:1000]-offset,start(mate1[isSameCzome])[1:1000]+ offset)) codingRegions = refLocsToLocalLocs(mate.range, txdb) trans.info=select(txdb, key=values(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID") trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL")) mate.range.df=as.data.frame(mate.range) trans.info.df=as.data.frame(trans.info.df) trans.names.df=as.data.frame(trans.names) mrg.data=merge(trans.info.df,mate.range.df) mrg.data=merge(mrg.data, trans.names.df) Thanks for your help. Cheers../murli -----Original Message----- From: Marc Carlson [mailto:mcarlson@fhcrc.org] Sent: Saturday, June 22, 2013 12:07 AM To: Murli [guest] Cc: bioconductor at r-project.org; Nair, Murlidharan T Subject: Re: TXNAME mapping Hi Murli, I have no idea what you did since you didn't give me an example. In the future, you might find it helpful to look at the posting guide which you can find on our web site here: http://www.bioconductor.org/help/mailing-list/posting-guide/ But from what you did tell me, my guess is that you just wanted to extract the information you listed. Here is how I would do something like this: library(Homo.sapiens) select(Homo.sapiens, keys=c(63934,7038), cols=c("TXID","GENEID","TXNAME","TXSTART","TXEND","TXCHROM","TXSTRAND" ), keytype="ENTREZID") Hope that this helps you, Marc On 06/21/2013 07:16 PM, Murli [guest] wrote: > Hi, > > I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8 13515402 13515702 301 and chr12 71612488 71612788 301. I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME. > > Thanks ../Murli > > > > >> mrg.data[1000:1100,] > TXID GENEID TXNAME seqnames start end width strand > 1000 32071 7038 uc003ytw.3 chr8 13515402 13515702 301 * > 1001 68728 63934 uc002qnd.3 chr8 14339379 14339679 301 * > 1002 68729 63934 uc002qne.3 chr8 14339379 14339679 301 * > 1003 68730 63934 uc010etm.3 chr8 14339379 14339679 301 * > 1004 32071 7038 uc003ytw.3 chr8 14339379 14339679 301 * > 1005 68728 63934 uc002qnd.3 chr12 71612488 71612788 301 * > 1006 68729 63934 uc002qne.3 chr12 71612488 71612788 301 * > 1007 68730 63934 uc010etm.3 chr12 71612488 71612788 301 * > 1008 32071 7038 uc003ytw.3 chr12 71612488 71612788 301 * > 1009 68728 63934 uc002qnd.3 chr14 24809972 24810272 301 * > 1010 68729 63934 uc002qne.3 chr14 24809972 24810272 301 * > > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Homo.sapiens_1.1.1 > [2] GO.db_2.9.0 > [3] OrganismDbi_1.2.0 > [4] org.Hs.eg.db_2.9.0 > [5] RSQLite_0.11.4 > [6] DBI_0.2-7 > [7] VariantAnnotation_1.6.6 > [8] Rsamtools_1.12.3 > [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [10] BSgenome_1.28.0 > [11] Biostrings_2.28.0 > [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 > [13] GenomicFeatures_1.12.2 > [14] AnnotationDbi_1.22.6 > [15] Biobase_2.20.0 > [16] GenomicRanges_1.12.4 > [17] IRanges_1.18.1 > [18] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 graph_1.38.2 RBGL_1.36.2 > [5] RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 > [9] XML_3.98-1.1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 12.5 years ago Murli ▴ 90

Login before adding your answer.