DNAStringsSet - remove multiple entries with same name
2
0
Entering edit mode
deepti anand ▴ 50
@deepti-anand-6724
Last seen 8.4 years ago
Hi All, I am trying to extract promoter sequences for a few ENTREZ IDS. The problem I am having is that there exists multiple transcripts for same gene. So this gives me multiple promoter sequences for same gene. Can I filter out the redundant promoter sequences? Here is my code: ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates of transcript #####>grl <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene") [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus, upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)> promoter.seqs A DNAStringSet instance of length 8 width seq names [1] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGA TTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCT C 67665.67665[2] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCC GTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[3] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGT CAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[4] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGT CAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[5] 1500 CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGC TTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA 13198.13198[6] 1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGA GATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[7] 1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTAT GAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[8] 1500 GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGT TGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC 15368.15368 In example above, there are four sequences for same gene '67665.67665'. How can I remove these entries? I would appreciate any help Dips [[alternative HTML version deleted]]
• 852 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States
Hi Dips, On 09/03/2014 10:30 AM, deepti anand wrote: > Hi All, > > I am trying to extract promoter sequences for a few ENTREZ IDS. The problem I am having is that there exists multiple transcripts for same gene. So this gives me multiple promoter sequences for same gene. Can I filter out the redundant promoter sequences? > > Here is my code: > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates of transcript #####>grl <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene") [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus, upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)> promoter.seqs A DNAStringSet instance of length 8 width seq names [1] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGA TTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCT C 67665.67665[2] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCC GTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[3] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGT CAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[4] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGT CAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC C! > TC 67665.67665[5] 1500 CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCC CAAAGGCTTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA 13198.13198[6] 1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGA GATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[7] 1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTAT GAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[8] 1500 GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGT TGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC 15368.15368 > > In example above, there are four sequences for same gene '67665.67665'. How can I remove these entries? > I would appreciate any help In this particular example, the TSS (transcription starting site) is the same for all the transcripts in any of your genes. This explains why you get the same promoter sequence. However, this won't always be the case. So in order to handle the general situation, we need to choose between 2 behaviors: 1) Return 1 sequence per gene (even for genes with more than 1 unique TSS). 2) Return 1 sequence per unique TSS per gene. To do 1) we need to choose a particular TSS for each gene. A common choice is to pick-up the most upstream TSS. In this case, genes() can be used to extract the gene ranges. genes() will return a GRanges object with 1 genomic range per gene. This genomic range is the union of all the transcript ranges in the gene. library(TxDb.Mmusculus.UCSC.mm10.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene ids.ok <- c("67665", "13198", "110196", "15368") gn <- genes(txdb)[ids.ok] Then: > gn GRanges with 4 ranges and 1 metadata column: seqnames ranges strand | gene_id <rle> <iranges> <rle> | <characterlist> 67665 chr18 [ 60526221, 60558762] + | 67665 13198 chr10 [127290793, 127296288] + | 13198 110196 chr3 [ 89093588, 89101967] - | 110196 15368 chr8 [ 75093618, 75100593] + | 15368 --- seqlengths: chr1 chr2 ... chrUn_JH584304 195471971 182113224 ... 114452 Note that, by default, genes() will exclude genes that don't have all their transcripts on the same chromosome and strand (because in that case the gene cannot be represented with a unique genomic range). See ?genes for more information. Then use extractUpstreamSeqs() to extract the upstream sequences for those genes: library(BSgenome.Mmusculus.UCSC.mm10) genome <- BSgenome.Mmusculus.UCSC.mm10 up_seqs <- extractUpstreamSeqs(genome, gn, width=1500) Then: > gn_up_seqs A DNAStringSet instance of length 4 width seq names [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665 [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198 [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196 [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368 For doing 2) we can use transcriptsBy() to extract all transcript ranges grouped by gene: tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok] Then we can obtain the unique TSS for each gene by first resizing each list element in 'tx_by_gn' and then by passing the result thru unique(): TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1)) (Note that we use endoapply() here because we don't have a "resize" method for GRangesList objects. However we should definitely add one because doing endoapply() is not efficient.) Then we can get the upstream sequence for each TSS with: TSS_up_seqs <- relist(extractUpstreamSeqs( genome, unlist(TSS_by_gn, use.names=FALSE), width=1500), TSS_by_gn) This time the upstream sequences are returned in a DNAStringSetList object with 1 list element (DNAStringSet) per gene: > TSS_up_seqs DNAStringSetList of length 4 [["67665"]] CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT... [["13198"]] CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC... [["110196"]] CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT... [["15368"]] GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG... The number of sequences in each list element is the number of unique TSS in the gene. Of course, there is no benefit to this second approach in *your* case because none of your genes has more than 1 unique TSS. But some genes do have more than 1 unique TSS (e.g. 100017, 100019, 100036521, and many more...). Hope this helps, H. > Dips > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States
Hi Dips, I'm cc'ing the mailing list so others can benefit. On 09/04/2014 11:34 AM, deepti anand wrote: > Hi H, > > Thank you for the code. I run the code but the extractUpstremSeqs() does > not seem to work. Here are the codes: > >> ids.ok > [1] "67665" "13198" "110196" "15368" > > txdb<- TxDb.Mmusculus.UCSC.mm10.knownGene > > gn<- genes(txdb)[ids.ok] > > gn > GRanges with 4 ranges and 1 metadata column: > seqnames ranges strand | gene_id > <rle> <iranges> <rle> | <characterlist> > 67665 chr18 [ 60526221, 60558762] + | 67665 > 13198 chr10 [127290793, 127296288] + | 13198 > 110196 chr3 [ 89093588, 89101967] - | 110196 > 15368 chr8 [ 75093618, 75100593] + | 15368 > --- > seqlengths: > chr1 chr2 chr3 ... > chrUn_GL456394 chrUn_GL456396 chrUn_JH584304 > 195471971 182113224 160039680 ... > 24323 21240 114452 > > genome <- BSgenome.Mmusculus.UCSC.mm10 > > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500) > Error: could not find function "extractUpstreamSeqs" > > Could you suggest how should I proceed further? Oops, extractUpstreamSeqs() is only available in the current devel version of Bioconductor (BioC 3.0, which will be released in October). Sorry. With the current release version (BioC 2.14), you can obtain 'gn_up_seqs' with: gn_up_seqs <- getPromoterSeq(gn, genome, upstream=1500, downstream=0) And 'TSS_up_seqs' with: TSS_up_seqs <- relist(getPromoterSeq( unlist(TSS_by_gn, use.names=FALSE), genome, upstream=1500, downstream=0), TSS_by_gn) Please let me know if you run into other issues. Thanks, H. PS: Yesterday I added a "resize" method for GRangesList objects in BioC devel. It's in GenomicRanges 1.17.37 (which will become available via biocLite() in the next 24 hours or so). > > Best, > > -Dips- > > > > Date: Wed, 3 Sep 2014 12:14:49 -0700 > > From: hpages at fhcrc.org > > To: anand.deepti at outlook.com; bioconductor at r-project.org > > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same > name > > > > Hi Dips, > > > > On 09/03/2014 10:30 AM, deepti anand wrote: > > > Hi All, > > > > > > I am trying to extract promoter sequences for a few ENTREZ IDS. The > problem I am having is that there exists multiple transcripts for same > gene. So this gives me multiple promoter sequences for same gene. Can I > filter out the redundant promoter sequences? > > > > > > Here is my code: > > > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates > of transcript #####>grl <- > transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene") > [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus, > upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)> > promoter.seqs A DNAStringSet instance of length 8 width seq names [1] > 1500 > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC > 67665.67665[2] 1500 > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC > 67665.67665[3] 1500 > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC > 67665.67665[4] 1500 > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC > > C! > > > TC 67665.67665[5] 1500 > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAG GAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA > 13198.13198[6] 1500 > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGC CGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG > 110196.110196[7] 1500 > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGC CGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG > 110196.110196[8] 1500 > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCAC CACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC > 15368.15368 > > > > > > In example above, there are four sequences for same gene > '67665.67665'. How can I remove these entries? > > > I would appreciate any help > > > > In this particular example, the TSS (transcription starting site) is > > the same for all the transcripts in any of your genes. This explains > > why you get the same promoter sequence. However, this won't always be > > the case. So in order to handle the general situation, we need to > > choose between 2 behaviors: > > > > 1) Return 1 sequence per gene (even for genes with more than 1 unique > > TSS). > > > > 2) Return 1 sequence per unique TSS per gene. > > > > To do 1) we need to choose a particular TSS for each gene. A common > > choice is to pick-up the most upstream TSS. In this case, genes() > > can be used to extract the gene ranges. genes() will return a GRanges > > object with 1 genomic range per gene. This genomic range is the union > > of all the transcript ranges in the gene. > > > > library(TxDb.Mmusculus.UCSC.mm10.knownGene) > > txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene > > ids.ok <- c("67665", "13198", "110196", "15368") > > gn <- genes(txdb)[ids.ok] > > > > Then: > > > > > gn > > GRanges with 4 ranges and 1 metadata column: > > seqnames ranges strand | gene_id > > <rle> <iranges> <rle> | <characterlist> > > 67665 chr18 [ 60526221, 60558762] + | 67665 > > 13198 chr10 [127290793, 127296288] + | 13198 > > 110196 chr3 [ 89093588, 89101967] - | 110196 > > 15368 chr8 [ 75093618, 75100593] + | 15368 > > --- > > seqlengths: > > chr1 chr2 ... chrUn_JH584304 > > 195471971 182113224 ... 114452 > > > > Note that, by default, genes() will exclude genes that don't have all > > their transcripts on the same chromosome and strand (because in that > > case the gene cannot be represented with a unique genomic range). > > See ?genes for more information. > > > > Then use extractUpstreamSeqs() to extract the upstream sequences for > > those genes: > > > > library(BSgenome.Mmusculus.UCSC.mm10) > > genome <- BSgenome.Mmusculus.UCSC.mm10 > > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500) > > > > Then: > > > > > gn_up_seqs > > A DNAStringSet instance of length 4 > > width seq names > > > > [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665 > > [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198 > > [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196 > > [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368 > > > > For doing 2) we can use transcriptsBy() to extract all transcript ranges > > grouped by gene: > > > > tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok] > > > > Then we can obtain the unique TSS for each gene by first resizing each > > list element in 'tx_by_gn' and then by passing the result thru unique(): > > > > TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1)) > > > > (Note that we use endoapply() here because we don't have a "resize" > > method for GRangesList objects. However we should definitely add > > one because doing endoapply() is not efficient.) > > > > Then we can get the upstream sequence for each TSS with: > > > > TSS_up_seqs <- relist(extractUpstreamSeqs( > > genome, > > unlist(TSS_by_gn, use.names=FALSE), > > width=1500), > > TSS_by_gn) > > > > This time the upstream sequences are returned in a DNAStringSetList > > object with 1 list element (DNAStringSet) per gene: > > > > > TSS_up_seqs > > DNAStringSetList of length 4 > > [["67665"]] > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT... > > [["13198"]] > > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC... > > [["110196"]] > > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT... > > [["15368"]] > > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG... > > > > The number of sequences in each list element is the number of unique > > TSS in the gene. Of course, there is no benefit to this second approach > > in *your* case because none of your genes has more than 1 unique TSS. > > But some genes do have more than 1 unique TSS (e.g. 100017, 100019, > > 100036521, and many more...). > > > > Hope this helps, > > H. > > > > > Dips > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Thanks Herve, It worked. -Dips- > Date: Thu, 4 Sep 2014 12:15:43 -0700 > From: hpages at fhcrc.org > To: anand.deepti at outlook.com > CC: Bioconductor at r-project.org > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same name > > Hi Dips, > > I'm cc'ing the mailing list so others can benefit. > > On 09/04/2014 11:34 AM, deepti anand wrote: > > Hi H, > > > > Thank you for the code. I run the code but the extractUpstremSeqs() does > > not seem to work. Here are the codes: > > > >> ids.ok > > [1] "67665" "13198" "110196" "15368" > > > txdb<- TxDb.Mmusculus.UCSC.mm10.knownGene > > > gn<- genes(txdb)[ids.ok] > > > gn > > GRanges with 4 ranges and 1 metadata column: > > seqnames ranges strand | gene_id > > <rle> <iranges> <rle> | <characterlist> > > 67665 chr18 [ 60526221, 60558762] + | 67665 > > 13198 chr10 [127290793, 127296288] + | 13198 > > 110196 chr3 [ 89093588, 89101967] - | 110196 > > 15368 chr8 [ 75093618, 75100593] + | 15368 > > --- > > seqlengths: > > chr1 chr2 chr3 ... > > chrUn_GL456394 chrUn_GL456396 chrUn_JH584304 > > 195471971 182113224 160039680 ... > > 24323 21240 114452 > > > genome <- BSgenome.Mmusculus.UCSC.mm10 > > > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500) > > Error: could not find function "extractUpstreamSeqs" > > > > Could you suggest how should I proceed further? > > Oops, extractUpstreamSeqs() is only available in the current devel > version of Bioconductor (BioC 3.0, which will be released in October). > Sorry. With the current release version (BioC 2.14), you can obtain > 'gn_up_seqs' with: > > gn_up_seqs <- getPromoterSeq(gn, genome, upstream=1500, downstream=0) > > And 'TSS_up_seqs' with: > > TSS_up_seqs <- relist(getPromoterSeq( > unlist(TSS_by_gn, use.names=FALSE), > genome, > upstream=1500, downstream=0), > TSS_by_gn) > > Please let me know if you run into other issues. > > Thanks, > H. > > PS: Yesterday I added a "resize" method for GRangesList objects in > BioC devel. It's in GenomicRanges 1.17.37 (which will become available > via biocLite() in the next 24 hours or so). > > > > > Best, > > > > -Dips- > > > > > > > Date: Wed, 3 Sep 2014 12:14:49 -0700 > > > From: hpages at fhcrc.org > > > To: anand.deepti at outlook.com; bioconductor at r-project.org > > > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same > > name > > > > > > Hi Dips, > > > > > > On 09/03/2014 10:30 AM, deepti anand wrote: > > > > Hi All, > > > > > > > > I am trying to extract promoter sequences for a few ENTREZ IDS. The > > problem I am having is that there exists multiple transcripts for same > > gene. So this gives me multiple promoter sequences for same gene. Can I > > filter out the redundant promoter sequences? > > > > > > > > Here is my code: > > > > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates > > of transcript #####>grl <- > > transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene") > > [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus, > > upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)> > > promoter.seqs A DNAStringSet instance of length 8 width seq names [1] > > 1500 > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAG CGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC > > 67665.67665[2] 1500 > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAG CGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC > > 67665.67665[3] 1500 > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAG CGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC > > 67665.67665[4] 1500 > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAG CGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC > > > C! > > > > TC 67665.67665[5] 1500 > > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCC AGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA > > 13198.13198[6] 1500 > > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCAT GCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG > > 110196.110196[7] 1500 > > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCAT GCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG > > 110196.110196[8] 1500 > > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCC ACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC > > 15368.15368 > > > > > > > > In example above, there are four sequences for same gene > > '67665.67665'. How can I remove these entries? > > > > I would appreciate any help > > > > > > In this particular example, the TSS (transcription starting site) is > > > the same for all the transcripts in any of your genes. This explains > > > why you get the same promoter sequence. However, this won't always be > > > the case. So in order to handle the general situation, we need to > > > choose between 2 behaviors: > > > > > > 1) Return 1 sequence per gene (even for genes with more than 1 unique > > > TSS). > > > > > > 2) Return 1 sequence per unique TSS per gene. > > > > > > To do 1) we need to choose a particular TSS for each gene. A common > > > choice is to pick-up the most upstream TSS. In this case, genes() > > > can be used to extract the gene ranges. genes() will return a GRanges > > > object with 1 genomic range per gene. This genomic range is the union > > > of all the transcript ranges in the gene. > > > > > > library(TxDb.Mmusculus.UCSC.mm10.knownGene) > > > txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene > > > ids.ok <- c("67665", "13198", "110196", "15368") > > > gn <- genes(txdb)[ids.ok] > > > > > > Then: > > > > > > > gn > > > GRanges with 4 ranges and 1 metadata column: > > > seqnames ranges strand | gene_id > > > <rle> <iranges> <rle> | <characterlist> > > > 67665 chr18 [ 60526221, 60558762] + | 67665 > > > 13198 chr10 [127290793, 127296288] + | 13198 > > > 110196 chr3 [ 89093588, 89101967] - | 110196 > > > 15368 chr8 [ 75093618, 75100593] + | 15368 > > > --- > > > seqlengths: > > > chr1 chr2 ... chrUn_JH584304 > > > 195471971 182113224 ... 114452 > > > > > > Note that, by default, genes() will exclude genes that don't have all > > > their transcripts on the same chromosome and strand (because in that > > > case the gene cannot be represented with a unique genomic range). > > > See ?genes for more information. > > > > > > Then use extractUpstreamSeqs() to extract the upstream sequences for > > > those genes: > > > > > > library(BSgenome.Mmusculus.UCSC.mm10) > > > genome <- BSgenome.Mmusculus.UCSC.mm10 > > > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500) > > > > > > Then: > > > > > > > gn_up_seqs > > > A DNAStringSet instance of length 4 > > > width seq names > > > > > > [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665 > > > [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198 > > > [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196 > > > [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368 > > > > > > For doing 2) we can use transcriptsBy() to extract all transcript ranges > > > grouped by gene: > > > > > > tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok] > > > > > > Then we can obtain the unique TSS for each gene by first resizing each > > > list element in 'tx_by_gn' and then by passing the result thru unique(): > > > > > > TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1)) > > > > > > (Note that we use endoapply() here because we don't have a "resize" > > > method for GRangesList objects. However we should definitely add > > > one because doing endoapply() is not efficient.) > > > > > > Then we can get the upstream sequence for each TSS with: > > > > > > TSS_up_seqs <- relist(extractUpstreamSeqs( > > > genome, > > > unlist(TSS_by_gn, use.names=FALSE), > > > width=1500), > > > TSS_by_gn) > > > > > > This time the upstream sequences are returned in a DNAStringSetList > > > object with 1 list element (DNAStringSet) per gene: > > > > > > > TSS_up_seqs > > > DNAStringSetList of length 4 > > > [["67665"]] > > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT... > > > [["13198"]] > > > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC... > > > [["110196"]] > > > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT... > > > [["15368"]] > > > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG... > > > > > > The number of sequences in each list element is the number of unique > > > TSS in the gene. Of course, there is no benefit to this second approach > > > in *your* case because none of your genes has more than 1 unique TSS. > > > But some genes do have more than 1 unique TSS (e.g. 100017, 100019, > > > 100036521, and many more...). > > > > > > Hope this helps, > > > H. > > > > > > > Dips > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor at r-project.org > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -- > > > Hervé Pagès > > > > > > Program in Computational Biology > > > Division of Public Health Sciences > > > Fred Hutchinson Cancer Research Center > > > 1100 Fairview Ave. N, M1-B514 > > > P.O. Box 19024 > > > Seattle, WA 98109-1024 > > > > > > E-mail: hpages at fhcrc.org > > > Phone: (206) 667-5791 > > > Fax: (206) 667-1319 > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6