extract introns
3
0
Entering edit mode
Yating Cheng ▴ 90
@yating-cheng-4953
Last seen 9.6 years ago
Dear Bioconductor Memebers, Now I have to extract intron sequences, I have already exon+intron, intron sequences. Someone told me that I can use Biostring. But I tried, it failed. Do you know how to use Biostring to solve this problem, or is there any other possibility to solve this problem? Thanks very much. Best Regards Yating Cheng
• 3.5k views
ADD COMMENT
0
Entering edit mode
@ivan-gregoretti-3975
Last seen 9.6 years ago
Canada
Hello Yating, For example, this is how you get a sequence of 50 nucleotides from murine chromosome 1: library(Biostrings) library(BSgenome.Mmusculus.UCSC.mm9) Views(Mmusculus[["chr1"]],IRanges(start=3000001,end=3000050)) Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAAGAATTTGGTATTAAACTTAAA ACTGGAATTC views: start end width [1] 3000001 3000050 50 [GAATTCTTTTCTATGATTTAGTTTAATATGTTTTCTGGGTGTTTCAGCTG] Warning message: In Views(Mmusculus[["chr1"]], IRanges(start = 3000001, end = 3000050)) : masks were dropped One more thing. Next time, before you ask a question about a package, please make sure you have read the package documentation entirely. I am responding to you this time because I once asked this same question before reading all the documentation! Ivan On Wed, Nov 9, 2011 at 6:22 AM, Yating Cheng <yating.cheng at="" charite.de=""> wrote: > Dear Bioconductor Memebers, > > Now I have to extract intron sequences, I have already exon+intron, intron > sequences. Someone told me that I can use Biostring. But I tried, it > failed. > > Do you know how to use Biostring to solve this problem, or is there any > other possibility to solve this problem? > > Thanks very much. > > Best Regards > > Yating Cheng > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
On 11/09/2011 03:22 AM, Yating Cheng wrote: > Dear Bioconductor Memebers, > > Now I have to extract intron sequences, I have already exon+intron, intron > sequences. Someone told me that I can use Biostring. But I tried, it > failed. > > Do you know how to use Biostring to solve this problem, or is there any > other possibility to solve this problem? Steve and Ivan mention GRanges + BSgenome. I wanted to mention Rsamtools::FaFile with indexFa, which allows input of DNA sequence from fasta files rather than BSgenome packages. You gain flexibility in terms of where the sequence comes from, but lose BSgenome features like masking and data management. Martin > > Thanks very much. > > Best Regards > > Yating Cheng > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, On Thu, Nov 10, 2011 at 10:37 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 11/09/2011 03:22 AM, Yating Cheng wrote: >> >> Dear Bioconductor Memebers, >> >> Now I have to extract intron sequences, I have already exon+intron, intron >> sequences. Someone told me that I can use Biostring. But I tried, it >> failed. >> >> Do you know how to use Biostring to solve this problem, or is there any >> other possibility to solve this problem? > > Steve and Ivan mention GRanges + BSgenome. I wanted to mention > Rsamtools::FaFile with indexFa, which allows input of DNA sequence from > fasta files rather than BSgenome packages. You gain flexibility in terms of > where the sequence comes from, but lose BSgenome features like masking and > data management. That's super cool! Thanks for adding that This means that we have to keep those *.fa files uncompressed tho, right? I wonder if it's worth thinking about (for R 2.15 ;-) handling compressed *.fa files using BGZF along these lines, too: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better- gzip.html -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
On Thu, Nov 10, 2011 at 8:33 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi Martin, > > On Thu, Nov 10, 2011 at 10:37 PM, Martin Morgan <mtmorgan@fhcrc.org> > wrote: > > On 11/09/2011 03:22 AM, Yating Cheng wrote: > >> > >> Dear Bioconductor Memebers, > >> > >> Now I have to extract intron sequences, I have already exon+intron, > intron > >> sequences. Someone told me that I can use Biostring. But I tried, it > >> failed. > >> > >> Do you know how to use Biostring to solve this problem, or is there any > >> other possibility to solve this problem? > > > > Steve and Ivan mention GRanges + BSgenome. I wanted to mention > > Rsamtools::FaFile with indexFa, which allows input of DNA sequence from > > fasta files rather than BSgenome packages. You gain flexibility in terms > of > > where the sequence comes from, but lose BSgenome features like masking > and > > data management. > > That's super cool! Thanks for adding that > > This means that we have to keep those *.fa files uncompressed tho, > right? I wonder if it's worth thinking about (for R 2.15 ;-) handling > compressed *.fa files using BGZF along these lines, too: > > http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better- gzip.html > > Actually, the FaFile stuff uses RAZF compression, another extension of gzip. Just to add another plug, the rtracklayer package now supports sequence from 2bit files. This was mostly motivated by the fact that the RAZF compression is not well supported in Java, so Java tools typically support 2bit. It also supports a single mask, although importing of the mask is not yet supported in rtracklayer. Not sure about how the two formats compare in terms of compactness, but 2bit is obviously highly compact (2 bits per base). While size does not matter so much these days, it may make a difference in query speed. There, 2bit has the advantage of random access, whereas the RAZF stuff sounds more complex. -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi Yating, Brief note: when replying to emails form the bioconductor list, make sure to use "reply all" so the list is cc'd, this way others can benefit from the help but also they can jump in to help in better ways. Ok, so: On Thu, Nov 10, 2011 at 5:16 PM, Yating Cheng <yating.cheng at="" charite.de=""> wrote: > Hi Steve > > Thanks very much for your answer. I am really a new hand in bioinfo. > > Actually now I have problem to create the GRanges object. The genes are > from Apis mellifera.The data I have is like: > "ensembl_gene_id" "chromosome_name" "exon_chrom_start" "exon_chrom_end" > "strand" > "1" "GB17244" "Group5.28" 95590 95776 1 > "2" "GB17244" "Group5.28" 95865 96055 1 > "3" "GB17244" "Group5.28" 96189 96428 1 > "4" "GB17244" "Group5.28" 100525 100632 1 > "5" "GB17244" "Group5.28" 100689 100751 1 > "6" "GB17244" "Group5.28" 100835 101145 1 > "7" "GB17244" "Group5.28" 101265 101547 1 > "8" "GB17244" "Group5.28" 101632 101673 1 > "9" "GB12201" "Group13.1" 374539 374712 -1 > "10" "GB12201" "Group13.1" 366611 366778 -1 > "11" "GB12201" "Group13.1" 366400 366513 -1 > "12" "GB12201" "Group13.1" 366115 366201 -1 > "13" "GB16402" "GroupUn.2" 208099 208254 1 > Do you know how to convert it into GRanges object. Interesting names from chromosomes ... are they really "GBXXXX"? Anyway, assuming they are, this is how you can convert that table into a GRanges object. Let's assume the table you listed above is in a data.frame called `xcripts` R> gr.exons <- with(xcripts, { GRanges(chromosome_name, IRanges(exon_chrom_start, exon_chrom_end), ifelse(strand == 1, '+', '-')) }) HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Also, just to make sure the i's are dotted, and t's are crossed. This code I gave: R> gr.exons <- with(xcripts, { ?GRanges(chromosome_name, ? ?IRanges(exon_chrom_start, exon_chrom_end), ? ?ifelse(strand == 1, '+', '-')) }) Assumes you've already loaded the GenomicRanges library via a call like so: R> library(GenomicRanges) You mentioned you're still a bit green with bioconductor, so I just wanted to make sure you knew where to find "GRanges" ... sorry if that's too obvious. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, everyone, Now I am facing new problem. The following file I got from biomart. So, GB*** is ensemble_gene_id.and Group*** is chromosome_name. The first problem is that, in the file is exon_chrom_start and end, but should I need intron_start and end ? since I need intron sequences. "ensembl_gene_id" "chromosome_name" "exon_chrom_start" "exon_chrom_end" "strand" "1" "GB17244" "Group5.28" 95590 95776 1 "2" "GB17244" "Group5.28" 95865 96055 1 "3" "GB17244" "Group5.28" 96189 96428 1 "4" "GB17244" "Group5.28" 100525 100632 1 "5" "GB17244" "Group5.28" 100689 100751 1 "6" "GB17244" "Group5.28" 100835 101145 1 "7" "GB17244" "Group5.28" 101265 101547 1 "8" "GB17244" "Group5.28" 101632 101673 1 "9" "GB12201" "Group13.1" 374539 374712 -1 "10" "GB12201" "Group13.1" 366611 366778 -1 And another prolem is in the script. library(Biostrings) library(GenomicRanges) source("http://ftp:/biocLite.R") biocLite("BSgenome.Amellifera.BeeBase.assembly4") library(BSgenome.Amellifera.BeeBase.assembly4) library(BSgenome.Amellifera.BeeBase.assembly4) #Loading required package: BSgenome #Warning message: #package 'BSgenome' was built under R version 2.13.2 intron_ranges<-read.table("intron_ranges.txt") gr.exons <- with(intron_ranges, { GRanges(chromosome_name, IRanges(exon_chrom_start, exon_chrom_end), ifelse(strand == 1, '+', '-'))}) intron.seqs <-getSeq(Amellifera, gr.exons, as.character=FALSE) #Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence Group5.28 not found My guess is that the chromosome names I input is not the same stored in the BS genome. But that is the only way I can get the IRanges. Or does anyone know how to get IRanges direct from BS genome. Thank you very much. Yating Molecular Medicine Master's ProgramCharit? Universit?tsmedizin Berlin > Also, just to make sure the i's are dotted, and t's are crossed. > > This code I gave: > R> gr.exons <- with(xcripts, { > ?GRanges(chromosome_name, > ? ?IRanges(exon_chrom_start, exon_chrom_end), > ? ?ifelse(strand == 1, '+', '-')) > }) > > Assumes you've already loaded the GenomicRanges library via a call like > so: > > R> library(GenomicRanges) > > You mentioned you're still a bit green with bioconductor, so I just > wanted to make sure you knew where to find "GRanges" ... sorry if > that's too obvious. > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
ADD REPLY
0
Entering edit mode
Hi, everyone, Now I am facing new problem. The following file I got from biomart. So, GB*** is ensemble_gene_id.and Group*** is chromosome_name. The first problem is that, in the file is exon_chrom_start and end, but should I need intron_start and end ? since I need intron sequences. "ensembl_gene_id" "chromosome_name" "exon_chrom_start" "exon_chrom_end" "strand" "1" "GB17244" "Group5.28" 95590 95776 1 "2" "GB17244" "Group5.28" 95865 96055 1 "3" "GB17244" "Group5.28" 96189 96428 1 "4" "GB17244" "Group5.28" 100525 100632 1 "5" "GB17244" "Group5.28" 100689 100751 1 "6" "GB17244" "Group5.28" 100835 101145 1 "7" "GB17244" "Group5.28" 101265 101547 1 "8" "GB17244" "Group5.28" 101632 101673 1 "9" "GB12201" "Group13.1" 374539 374712 -1 "10" "GB12201" "Group13.1" 366611 366778 -1 And another prolem is in the script. library(Biostrings) library(GenomicRanges) source("http://ftp:/biocLite.R") biocLite("BSgenome.Amellifera.BeeBase.assembly4") library(BSgenome.Amellifera.BeeBase.assembly4) library(BSgenome.Amellifera.BeeBase.assembly4) #Loading required package: BSgenome #Warning message: #package 'BSgenome' was built under R version 2.13.2 intron_ranges<-read.table("intron_ranges.txt") gr.exons <- with(intron_ranges, { GRanges(chromosome_name, IRanges(exon_chrom_start, exon_chrom_end), ifelse(strand == 1, '+', '-'))}) intron.seqs <-getSeq(Amellifera, gr.exons, as.character=FALSE) #Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence Group5.28 not found My guess is that the chromosome names I input is not the same stored in the BS genome. But that is the only way I can get the IRanges. Or does anyone know how to get IRanges direct from BS genome. Thank you very much. Yating Molecular Medicine Master's ProgramCharit? Universit?tsmedizin Berlin > Also, just to make sure the i's are dotted, and t's are crossed. > > This code I gave: > R> gr.exons <- with(xcripts, { > GRanges(chromosome_name, > IRanges(exon_chrom_start, exon_chrom_end), > ifelse(strand == 1, '+', '-')) > }) > > Hi Yating, > > Brief note: when replying to emails form the bioconductor list, make > sure to use "reply all" so the list is cc'd, this way others can > benefit from the help but also they can jump in to help in better > ways. > > Ok, so: > > On Thu, Nov 10, 2011 at 5:16 PM, Yating Cheng <yating.cheng at="" charite.de=""> > wrote: >> Hi Steve >> >> Thanks very much for your answer. I am really a new hand in bioinfo. >> >> Actually now I have problem to create the GRanges object. The genes are >> from Apis mellifera.The data I have is like: >> "ensembl_gene_id" "chromosome_name" "exon_chrom_start" "exon_chrom_end" >> "strand" >> "1" "GB17244" "Group5.28" 95590 95776 1 >> "2" "GB17244" "Group5.28" 95865 96055 1 >> "3" "GB17244" "Group5.28" 96189 96428 1 >> "4" "GB17244" "Group5.28" 100525 100632 1 >> "5" "GB17244" "Group5.28" 100689 100751 1 >> "6" "GB17244" "Group5.28" 100835 101145 1 >> "7" "GB17244" "Group5.28" 101265 101547 1 >> "8" "GB17244" "Group5.28" 101632 101673 1 >> "9" "GB12201" "Group13.1" 374539 374712 -1 >> "10" "GB12201" "Group13.1" 366611 366778 -1 >> "11" "GB12201" "Group13.1" 366400 366513 -1 >> "12" "GB12201" "Group13.1" 366115 366201 -1 >> "13" "GB16402" "GroupUn.2" 208099 208254 1 >> Do you know how to convert it into GRanges object. > > Interesting names from chromosomes ... are they really "GBXXXX"? > > Anyway, assuming they are, this is how you can convert that table into > a GRanges object. Let's assume the table you listed above is in a > data.frame called `xcripts` > > R> gr.exons <- with(xcripts, { > GRanges(chromosome_name, > IRanges(exon_chrom_start, exon_chrom_end), > ifelse(strand == 1, '+', '-')) > }) > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
ADD REPLY
0
Entering edit mode
Hi, On Sun, Nov 13, 2011 at 9:46 AM, Yating Cheng <yating.cheng at="" charite.de=""> wrote: > Hi, everyone, > > Now I am facing new problem. > > The following file I got from biomart. So, GB*** is ensemble_gene_id.and > Group*** is chromosome_name. > > The first problem is that, in the file is exon_chrom_start and end, but > should I need intron_start and end ? since I need intron sequences. The `gaps` function will help with that ... however, if there are intron start/end columns you can fetch from biomart, feel free to go that way ... > "ensembl_gene_id" "chromosome_name" "exon_chrom_start" "exon_chrom_end" > "strand" > "1" "GB17244" "Group5.28" 95590 95776 1 > "2" "GB17244" "Group5.28" 95865 96055 1 > "3" "GB17244" "Group5.28" 96189 96428 1 > "4" "GB17244" "Group5.28" 100525 100632 1 > "5" "GB17244" "Group5.28" 100689 100751 1 > "6" "GB17244" "Group5.28" 100835 101145 1 > "7" "GB17244" "Group5.28" 101265 101547 1 > "8" "GB17244" "Group5.28" 101632 101673 1 > "9" "GB12201" "Group13.1" 374539 374712 -1 > "10" "GB12201" "Group13.1" 366611 366778 -1 > > And another prolem is in the script. > > library(Biostrings) > > library(GenomicRanges) > > ?source("http://ftp:/biocLite.R") > ? ?biocLite("BSgenome.Amellifera.BeeBase.assembly4") > > library(BSgenome.Amellifera.BeeBase.assembly4) > ?library(BSgenome.Amellifera.BeeBase.assembly4) > #Loading required package: BSgenome > #Warning message: > #package 'BSgenome' was built under R version 2.13.2 > > intron_ranges<-read.table("intron_ranges.txt") > > gr.exons <- with(intron_ranges, { > ?GRanges(chromosome_name, > ? IRanges(exon_chrom_start, exon_chrom_end), > ? ifelse(strand == 1, '+', '-'))}) > > intron.seqs <-getSeq(Amellifera, gr.exons, as.character=FALSE) > > #Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, > width, ?: ? sequence Group5.28 not found > > My guess is that the chromosome names I input is not the same stored in > the BS genome. You don't have to guess, what is the output of `seqlevels(Amellifera)`? You'll get a list of chromosome names there. It's your job to translate your "Group5.28" names to the ones listed in the Amellifera genome. > But that is the only way I can get the IRanges. Or does > anyone know how to get IRanges direct from BS genome. You can extract one chromosome at a time from the BSgenome object, then just query it w/ an IRanges object. For example, in human: R> library(BSgenome.Hsapiens.UCSC.hg19) R> chr1 <- unmasked(Hsapiens$chr1) R> some.seqs <- Views(chr1, IRanges(sample(249250600, 10), width=10)) R> some.seqs some.seqs Views on a 249250621-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNN views: start end width [1] 12274027 12274036 10 [AAAGAGCAAC] [2] 149890737 149890746 10 [AAAGCCATAA] [3] 49569412 49569421 10 [AAATTTGGAT] [4] 230817984 230817993 10 [TTAGATTTTG] [5] 23827107 23827116 10 [TTTCCTGTAT] [6] 126211483 126211492 10 [NNNNNNNNNN] [7] 230032024 230032033 10 [GTGAATAATT] [8] 182185491 182185500 10 [TGCATTTGTC] [9] 229816410 229816419 10 [TTCCTTAATG] [10] 99265270 99265279 10 [CACATTTTCT] HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, I checked the function"gaps". But I am not sure how to use it. Is someone know how to use it to get the positions of introns? Thank you. Yating > Hi, > > On Sun, Nov 13, 2011 at 9:46 AM, Yating Cheng <yating.cheng at="" charite.de=""> > wrote: >> Hi, everyone, >> >> Now I am facing new problem. >> >> The following file I got from biomart. So, GB*** is ensemble_gene_id.and >> Group*** is chromosome_name. >> >> The first problem is that, in the file is exon_chrom_start and end, but >> should I need intron_start and end ? since I need intron sequences. > > The `gaps` function will help with that ... however, if there are > intron start/end columns you can fetch from biomart, feel free to go > that way ... > >> "ensembl_gene_id" "chromosome_name" "exon_chrom_start" "exon_chrom_end" >> "strand" >> "1" "GB17244" "Group5.28" 95590 95776 1 >> "2" "GB17244" "Group5.28" 95865 96055 1 >> "3" "GB17244" "Group5.28" 96189 96428 1 >> "4" "GB17244" "Group5.28" 100525 100632 1 >> "5" "GB17244" "Group5.28" 100689 100751 1 >> "6" "GB17244" "Group5.28" 100835 101145 1 >> "7" "GB17244" "Group5.28" 101265 101547 1 >> "8" "GB17244" "Group5.28" 101632 101673 1 >> "9" "GB12201" "Group13.1" 374539 374712 -1 >> "10" "GB12201" "Group13.1" 366611 366778 -1 >> >> And another prolem is in the script. >> >> library(Biostrings) >> >> library(GenomicRanges) >> >> ?source("http://ftp:/biocLite.R") >> ? ?biocLite("BSgenome.Amellifera.BeeBase.assembly4") >> >> library(BSgenome.Amellifera.BeeBase.assembly4) >> ?library(BSgenome.Amellifera.BeeBase.assembly4) >> #Loading required package: BSgenome >> #Warning message: >> #package 'BSgenome' was built under R version 2.13.2 >> >> intron_ranges<-read.table("intron_ranges.txt") >> >> gr.exons <- with(intron_ranges, { >> ?GRanges(chromosome_name, >> ? IRanges(exon_chrom_start, exon_chrom_end), >> ? ifelse(strand == 1, '+', '-'))}) >> >> intron.seqs <-getSeq(Amellifera, gr.exons, as.character=FALSE) >> >> #Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, >> width, ?: ? sequence Group5.28 not found >> >> My guess is that the chromosome names I input is not the same stored in >> the BS genome. > > You don't have to guess, what is the output of > `seqlevels(Amellifera)`? You'll get a list of chromosome names there. > > It's your job to translate your "Group5.28" names to the ones listed > in the Amellifera genome. > >> But that is the only way I can get the IRanges. Or does >> anyone know how to get IRanges direct from BS genome. > > You can extract one chromosome at a time from the BSgenome object, > then just query it w/ an IRanges object. > > For example, in human: > > R> library(BSgenome.Hsapiens.UCSC.hg19) > R> chr1 <- unmasked(Hsapiens$chr1) > R> some.seqs <- Views(chr1, IRanges(sample(249250600, 10), width=10)) > R> some.seqs > some.seqs > Views on a 249250621-letter DNAString subject > subject: > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNN > views: > start end width > [1] 12274027 12274036 10 [AAAGAGCAAC] > [2] 149890737 149890746 10 [AAAGCCATAA] > [3] 49569412 49569421 10 [AAATTTGGAT] > [4] 230817984 230817993 10 [TTAGATTTTG] > [5] 23827107 23827116 10 [TTTCCTGTAT] > [6] 126211483 126211492 10 [NNNNNNNNNN] > [7] 230032024 230032033 10 [GTGAATAATT] > [8] 182185491 182185500 10 [TGCATTTGTC] > [9] 229816410 229816419 10 [TTCCTTAATG] > [10] 99265270 99265279 10 [CACATTTTCT] > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
ADD REPLY
0
Entering edit mode
Hi, On Tue, Nov 15, 2011 at 12:19 PM, Yating Cheng <yating.cheng at="" charite.de=""> wrote: > Hi, > > I checked the function"gaps". But I am not sure how to use it. Is someone > know how to use it to get the positions of introns? Imagine a simple gene with two exons defined by these regions: R> (exons <- GRanges('chr1', IRanges(c(10, 100), width=21), '+')) GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 10, 30] + [2] chr1 [100, 120] + Look for its introns here: R> gaps(exons) GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 1, 9] + [2] chr1 [31, 99] + In this case, you will have to remove the [1,9] range (maybe that's intergenic region), which you can do with over fun interval operations -- perhaps you can subsetByOverlaps the ranges returned from `gaps` with the transcription start and end of your gene. Or, if you're working w/ a GenomicFeatures TranscriptDb, there is always the `intronsByTranscript` method ... HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, steve Thanks for your answer. About converting the names of chromosome, it is kind of difficult. I need the intron sequences for every gene, but the chromosome names of some genes I got from biomart are the same. I am not sure about how can I get specific genes from BS genome, and match them to gene-IDs. Thanks . Yating > Hi, > > On Tue, Nov 15, 2011 at 12:19 PM, Yating Cheng <yating.cheng at="" charite.de=""> > wrote: >> Hi, >> >> I checked the function"gaps". But I am not sure how to use it. Is >> someone >> know how to use it to get the positions of introns? > > Imagine a simple gene with two exons defined by these regions: > > R> (exons <- GRanges('chr1', IRanges(c(10, 100), width=21), '+')) > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 10, 30] + > [2] chr1 [100, 120] + > > Look for its introns here: > > R> gaps(exons) > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 1, 9] + > [2] chr1 [31, 99] + > > In this case, you will have to remove the [1,9] range (maybe that's > intergenic region), which you can do with over fun interval operations > -- perhaps you can subsetByOverlaps the ranges returned from `gaps` > with the transcription start and end of your gene. > > Or, if you're working w/ a GenomicFeatures TranscriptDb, there is > always the `intronsByTranscript` method ... > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
ADD REPLY
0
Entering edit mode
HI Yating, On Tue, Nov 15, 2011 at 4:48 PM, Yating Cheng <yating.cheng at="" charite.de=""> wrote: > Hi, steve > > Thanks for your answer. About converting the names of chromosome, it is > kind of difficult. > > I need the intron sequences for every gene, but the chromosome names of > some genes I got from biomart are the same. Sorry, I'm confused. What organism are you working with? Can you construct a TranscriptDb for it using the GenomicFeatures package (you'll have to read through the GenomicFeatures vignette). > I am not sure about how can I get specific genes from BS genome, and match > them to gene-IDs. BSgenome packages do not have gene specific information, they only contain sequence information for different organisms. The fact that you can use a BSgenome.* package for your organism of interest suggests that you should be able to build a TranscriptDb for it, so that's good news ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi Steve, I am working with Apis mellifera. Actually I am also trying to use the alignment function to get the intron sequence.which is much easier for me to understand. But there are some problem with that also. rm(list=ls(all=TRUE)) # set working directory setwd("F:/Yating/introns") # load Biostrings library library(Biostrings) # load data genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE) exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE) # create object for storage of extracted introns introns<-data.frame() # loop over all genes for(i in 1:length(genebody[,1])){ tmp.id<-as.vector(genebody[i,1]) # get gene id tmp.subject<-as.vector(genebody[i,2]) # get gene sequence tmp.exons<-exons[which(exons[,1]==tmp.id),] # get exons of the selected genes tmp.pattern<-as.vector(tmp.exons[,3]) # define exons as patterns for alignment tmp.align<-pairwiseAlignment(pattern=tmp.pattern, subject=tmp.subject,type="local") # align all exons pairwise to gene sequence tmp.start<-start(subject(tmp.align)) # vector of all alignment starts tmp.end<-end(subject(tmp.align)) # vector of all alignment ends for(ex in 1:(length(tmp.end)-1)){ # extract introns tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1) introns<-rbind(introns,cbindtmp.id,tmp.end[ex]+1,tmp. start[ex+1]-1,tmp.intron)) } } } # define useful names for columns colnames(introns)<-c("gene.id","pos.start","pos.end","intron") # write output write.table(introns,"Introns.txt", sep="\t", quote=FALSE, row.names=FALSE) The problem is the unvalid arguments in the substr.... Best Wishes Yating > HI Yating, > > On Tue, Nov 15, 2011 at 4:48 PM, Yating Cheng <yating.cheng at="" charite.de=""> > wrote: >> Hi, steve >> >> Thanks for your answer. About converting the names of chromosome, it is >> kind of difficult. >> >> I need the intron sequences for every gene, but the chromosome names of >> some genes I got from biomart are the same. > > Sorry, I'm confused. > > What organism are you working with? > > Can you construct a TranscriptDb for it using the GenomicFeatures > package (you'll have to read through the GenomicFeatures vignette). > >> I am not sure about how can I get specific genes from BS genome, and >> match >> them to gene-IDs. > > BSgenome packages do not have gene specific information, they only > contain sequence information for different organisms. > > The fact that you can use a BSgenome.* package for your organism of > interest suggests that you should be able to build a TranscriptDb for > it, so that's good news ... > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
ADD REPLY
0
Entering edit mode
You should be able to use makeTranscriptDbFromUCSC to get the relevant annotations from UCSC, under the "genscan" table: txdb <- makeTranscriptDbFromUCSC("apiMel1", "genscan") Then you can simply run intronsByTranscript(txdb) to get your introns. Michael On Tue, Nov 15, 2011 at 2:57 PM, Yating Cheng <yating.cheng@charite.de>wrote: > Hi Steve, > > I am working with Apis mellifera. > > Actually I am also trying to use the alignment function to get the intron > sequence.which is much easier for me to understand. But there are some > problem with that also. > > rm(list=ls(all=TRUE)) > > # set working directory > setwd("F:/Yating/introns") > > # load Biostrings library > library(Biostrings) > > # load data > genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE) > exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE) > > # create object for storage of extracted introns > introns<-data.frame() > > # loop over all genes > for(i in 1:length(genebody[,1])){ > > tmp.id<-as.vector(genebody[i,1]) > # get gene id > tmp.subject<-as.vector(genebody[i,2]) > # get gene sequence > tmp.exons<-exons[which(exons[,1]==tmp.id),] > # get exons of the > selected genes > > tmp.pattern<-as.vector(tmp.exons[,3]) > # define exons as patterns > for alignment > tmp.align<-pairwiseAlignment(pattern=tmp.pattern, > subject=tmp.subject,type="local") # align all exons > pairwise to gene > sequence > tmp.start<-start(subject(tmp.align)) > # vector of all alignment > starts > tmp.end<-end(subject(tmp.align)) > # vector of all > alignment ends > > > > for(ex in 1:(length(tmp.end)-1)){ > # extract introns > > > tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1) > introns<-rbind(introns,cbindtmp.id > ,tmp.end[ex]+1,tmp.start[ex+1]-1,tmp.intron)) > > } > > } > > } > > # define useful names for columns > colnames(introns)<-c("gene.id","pos.start","pos.end","intron") > > # write output > write.table(introns,"Introns.txt", sep="\t", quote=FALSE, row.names=FALSE) > > The problem is the unvalid arguments in the substr.... > > Best Wishes > > Yating > > > > > HI Yating, > > > > On Tue, Nov 15, 2011 at 4:48 PM, Yating Cheng <yating.cheng@charite.de> > > wrote: > >> Hi, steve > >> > >> Thanks for your answer. About converting the names of chromosome, it is > >> kind of difficult. > >> > >> I need the intron sequences for every gene, but the chromosome names of > >> some genes I got from biomart are the same. > > > > Sorry, I'm confused. > > > > What organism are you working with? > > > > Can you construct a TranscriptDb for it using the GenomicFeatures > > package (you'll have to read through the GenomicFeatures vignette). > > > >> I am not sure about how can I get specific genes from BS genome, and > >> match > >> them to gene-IDs. > > > > BSgenome packages do not have gene specific information, they only > > contain sequence information for different organisms. > > > > The fact that you can use a BSgenome.* package for your organism of > > interest suggests that you should be able to build a TranscriptDb for > > it, so that's good news ... > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng <yating.cheng at="" charite.de=""> wrote: > Hi Steve, > > I am working with Apis mellifera. > > Actually I am also trying to use the alignment function to get the intron > sequence.which is much easier for me to understand. ?But there are some > problem with that also. I'm not sure why you want to align sequences to find them on the chromosome, when you already know where they are on the chromosome? > rm(list=ls(all=TRUE)) > > # set working directory > setwd("F:/Yating/introns") > > # load Biostrings library > library(Biostrings) > > # load data > genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE) > exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE) > > # create object for storage of extracted introns > introns<-data.frame() > > # loop over all genes > for(i in 1:length(genebody[,1])){ > > ? ? ? ?tmp.id<-as.vector(genebody[i,1]) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# get gene id > ? ? ? ?tmp.subject<-as.vector(genebody[i,2]) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # get gene sequence > ? ? ? ?tmp.exons<-exons[which(exons[,1]==tmp.id),] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # get exons of the > selected genes > > ? ? ? ?tmp.pattern<-as.vector(tmp.exons[,3]) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # define exons as patterns > for alignment > ? ? ? ?tmp.align<-pairwiseAlignment(pattern=tmp.pattern, > subject=tmp.subject,type="local") ? ? ? ? ? ? ? ? ? ? ? # align all exons pairwise to gene > sequence > ? ? ? ?tmp.start<-start(subject(tmp.align)) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# vector of all alignment > starts > ? ? ? ?tmp.end<-end(subject(tmp.align)) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# vector of all alignment ends > > > > ? ? ? ?for(ex in 1:(length(tmp.end)-1)){ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # extract introns > > ? ? ? ? ? ? ? ?tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1) > ? ? ? ? ? ? ? ?introns<-rbind(introns,cbindtmp.id,tmp.end[ex]+1,tmp .start[ex+1]-1,tmp.intron)) > > ? ? ? ? ? ? ? ?} > > ? ? ? ? ? ? ? ?} > > ? ? ? ?} > > # define useful names for columns > colnames(introns)<-c("gene.id","pos.start","pos.end","intron") > > # write output > write.table(introns,"Introns.txt", sep="\t", quote=FALSE, row.names=FALSE) > > The problem is the unvalid arguments in the substr.... Well I can't tell for sure as you haven't told us what errors you are getting, but the names of your variables suggest that you are using substr with a number in its 2nd arg that is larger than the number in its 3rd arg. I still think your life will be easier if you use GenomicFeatures to download your gene annotations and then play with them that way. You will have to invest sometime to learn a bit more about the functionality provided by the IRanges, GRanges, GenomicFeatures packages, but it will be time well spent since you will likely finding yourself reinventing the functionality it provides anyway. For instance: R> library(GenomicFeatures) R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2", tablename="ensGene") R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE) R> head(introns) GRangesList of length 6: $ENSAPMT00000020635 GRanges with 1 range and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] Group1 [745, 821] + $ENSAPMT00000003962 GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand [1] Group1 [2946, 3009] - [2] Group1 [3155, 3331] - And to get the intronic sequence from "ENSAPMT00000020635" you can do: R> library(BSgenome.Amellifera.UCSC.apiMel2) R> getSeq(Amellifera, introns[[1]]) You can even get the sequences of all introns at once: R> all.introns <- unlist(introns) R> all.seqs <- getSeq(Amellifera, all.introns) HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, It is lucky that I finally I got the sequences. But it is strange that I got two different results with two scripts. And another problem is that I can only get the sequences but how can I match them to the geneID. Because I am not sure whether these sequences are from the same gene. I have to get the intron sequences from a whole gene. The first one I tried R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2", tablename="ensGene") R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE) R> library(BSgenome.Amellifera.UCSC.apiMel2) R> all.introns <- unlist(introns) R> all.seqs <- getSeq(Amellifera, all.introns) With this script I got 148077 sequeces The second one was library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC("apiMel2", "genscan") introns<-intronsByTranscript(txdb) introns <- unlist(introns) intron.seqs <-(getSeq(Amellifera,introns,as.character=FALSE)) intron_seqs<as.character(intron.seqs) with="" this="" script="" i="" got="" 80053="" sequences.="" i="" think="" the="" first="" one="" makes="" sense,="" they="" should="" be="" the="" seperated="" introns,="" but="" i="" need="" the="" attached="" gene="" names,="" so="" that="" i="" can="" do="" further="" calculation.="" regarding="" the="" second="" one,="" i="" think="" i="" may="" did="" something="" wrong,="" but="" i="" could="" not="" figure="" it="" out.="" thanks="" yating=""> Hi, > > On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng <yating.cheng at="" charite.de=""> > wrote: >> Hi Steve, >> >> I am working with Apis mellifera. >> >> Actually I am also trying to use the alignment function to get the >> intron >> sequence.which is much easier for me to understand. ?But there are some >> problem with that also. > > I'm not sure why you want to align sequences to find them on the > chromosome, when you already know where they are on the chromosome? > >> rm(list=ls(all=TRUE)) >> >> # set working directory >> setwd("F:/Yating/introns") >> >> # load Biostrings library >> library(Biostrings) >> >> # load data >> genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE) >> exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE) >> >> # create object for storage of extracted introns >> introns<-data.frame() >> >> # loop over all genes >> for(i in 1:length(genebody[,1])){ >> >> ? ? ? ?tmp.id<-as.vector(genebody[i,1]) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# get gene id >> ? ? ? ?tmp.subject<-as.vector(genebody[i,2]) ? ? ? ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # get gene sequence >> ? ? ? ?tmp.exons<-exons[which(exons[,1]==tmp.id),] ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # get exons of the >> selected genes >> >> ? ? ? ?tmp.pattern<-as.vector(tmp.exons[,3]) ? ? ? ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # define exons as patterns >> for alignment >> ? ? ? ?tmp.align<-pairwiseAlignment(pattern=tmp.pattern, >> subject=tmp.subject,type="local") ? ? ? ? ? ? ? ? ? ? ? # align all >> exons pairwise to gene >> sequence >> ? ? ? ?tmp.start<-start(subject(tmp.align)) ? ? ? ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# vector of all alignment >> starts >> ? ? ? ?tmp.end<-end(subject(tmp.align)) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# vector of all >> alignment ends >> >> >> >> ? ? ? ?for(ex in 1:(length(tmp.end)-1)){ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # extract introns >> >> ? ? ? ? ? ? ? >> ?tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1) >> ? ? ? ? ? ? ? >> ?introns<-rbind(introns,cbindtmp.id,tmp.end[ex]+1,tmp.start[ex+1]- 1,tmp.intron)) >> >> ? ? ? ? ? ? ? ?} >> >> ? ? ? ? ? ? ? ?} >> >> ? ? ? ?} >> >> # define useful names for columns >> colnames(introns)<-c("gene.id","pos.start","pos.end","intron") >> >> # write output >> write.table(introns,"Introns.txt", sep="\t", quote=FALSE, >> row.names=FALSE) >> >> The problem is the unvalid arguments in the substr.... > > Well I can't tell for sure as you haven't told us what errors you are > getting, but the names of your variables suggest that you are using > substr with a number in its 2nd arg that is larger than the number in > its 3rd arg. > > I still think your life will be easier if you use GenomicFeatures to > download your gene annotations and then play with them that way. You > will have to invest sometime to learn a bit more about the > functionality provided by the IRanges, GRanges, GenomicFeatures > packages, but it will be time well spent since you will likely finding > yourself reinventing the functionality it provides anyway. > > For instance: > > R> library(GenomicFeatures) > R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2", > tablename="ensGene") > R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE) > R> head(introns) > GRangesList of length 6: > $ENSAPMT00000020635 > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] Group1 [745, 821] + > > $ENSAPMT00000003962 > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > [1] Group1 [2946, 3009] - > [2] Group1 [3155, 3331] - > > And to get the intronic sequence from "ENSAPMT00000020635" you can do: > > R> library(BSgenome.Amellifera.UCSC.apiMel2) > R> getSeq(Amellifera, introns[[1]]) > > You can even get the sequences of all introns at once: > > R> all.introns <- unlist(introns) > R> all.seqs <- getSeq(Amellifera, all.introns) > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
ADD REPLY
0
Entering edit mode
Hi Yating, On 11-11-16 09:26 AM, Yating Cheng wrote: > Hi, It is lucky that I finally I got the sequences. But it is strange that > I got two different results with two scripts. And another problem is that > I can only get the sequences but how can I match them to the geneID. > Because I am not sure whether these sequences are from the same gene. I > have to get the intron sequences from a whole gene. > > The first one I tried > > R> bee.txdb<- makeTranscriptDbFromUCSC(genome="apiMel2", > tablename="ensGene") > R> introns<- intronsByTranscript(bee.txdb, use.names=TRUE) > R> library(BSgenome.Amellifera.UCSC.apiMel2) > R> all.introns<- unlist(introns) > R> all.seqs<- getSeq(Amellifera, all.introns) > > With this script I got 148077 sequeces > > The second one was > library(GenomicFeatures) > > > txdb<- makeTranscriptDbFromUCSC("apiMel2", "genscan") > introns<-intronsByTranscript(txdb) > introns<- unlist(introns) > > intron.seqs<-(getSeq(Amellifera,introns,as.character=FALSE)) > > intron_seqs<as.character(intron.seqs)> > With this script I got 80053 sequences. > > I think the first one makes sense, They should be the seperated introns, > but I need the attached gene Names, so that I can do further calculation. 1. First extract the mapping between transcript ids and gene ids: tx <- transcripts(bee.txdb, columns=c("tx_id", "tx_name", "gene_id")) > elementMetadata(tx)[1:3, ] DataFrame with 3 rows and 3 columns tx_id tx_name gene_id <integer> <character> <compressedcharacterlist> 1 1 ENSAPMT00000020635 ENSAPMG00000012500 2 5 ENSAPMT00000025701 ENSAPMG00000016403 3 3 ENSAPMT00000028256 ENSAPMG00000016403 Let's make sure that every transcript is mapped to exactly 1 gene: > anyDuplicated(elementMetadata(tx)$tx_name) [1] 0 > table(elementLengths(elementMetadata(tx)$gene)) 1 27755 Good! So: tx2gene <- unlist(elementMetadata(tx)$gene_id) names(tx2gene) <- elementMetadata(tx)$tx_name The tx-to-gene mapping is a named character vector: > tx2gene[1:3] ENSAPMT00000020635 ENSAPMT00000025701 ENSAPMT00000028256 "ENSAPMG00000012500" "ENSAPMG00000016403" "ENSAPMG00000016403" 2. Use this mapping to replace the names of your 'introns' object: introns <- intronsByTranscript(bee.txdb, use.names=TRUE) names(introns) <- tx2gene[names(introns)] 3. Unlist 'introns" and propagate the names "by hand": all.introns <- unlist(introns, use.names=FALSE) names(all.introns) <- rep(names(introns), elementLengths(introns)) 4. When you extract the introns sequences, propagate the names again: all.seqs <- getSeq(Amellifera, all.introns) names(all.seqs) <- names(all.introns) > > Regarding the second one, I think I may did something wrong, but I could > not figure it out. I don't think you are doing anything wrong. You are using 2 different UCSC tracks i.e. 2 different sets of annotations. Each of them is coming from a different organization, that is, Ensembl for the "ensGene" track, and the Genscan program (written by Chris Burge) for the "genscan" track. UCSC provides details about those tracks: http://genome.ucsc.edu/cgi- bin/hgTrackUi?hgsid=223923545&c=GroupUn&g=ensGene http://genome.ucsc.edu/cgi- bin/hgTrackUi?hgsid=223923545&c=GroupUn&g=genscan That's life with annotations, I suppose. If there was 1 authoritative set of annotations for a given organism that everybody agrees on, Bioinformatics would be a very different world... Another thing to keep in mind if you're doing sequence extraction from a genome based on a set of annotations is that the reference genome and the annotations must *match*, which was probably not the case in your first attempts (when you started this thread) i.e. it seems that you were using annotations you got from Ensembl (thru BiomaRt) without checking which reference genome they were based on and then using a BSgenome package for Bee that was not necessarily the correct one. As Steve mentioned, by using makeTranscriptDbFromUCSC() and the corresponding BSgenome package, you're safe. Cheers, H. > > Thanks > > Yating > > >> Hi, >> >> On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng<yating.cheng at="" charite.de=""> >> wrote: >>> Hi Steve, >>> >>> I am working with Apis mellifera. >>> >>> Actually I am also trying to use the alignment function to get the >>> intron >>> sequence.which is much easier for me to understand. But there are some >>> problem with that also. >> >> I'm not sure why you want to align sequences to find them on the >> chromosome, when you already know where they are on the chromosome? >> >>> rm(list=ls(all=TRUE)) >>> >>> # set working directory >>> setwd("F:/Yating/introns") >>> >>> # load Biostrings library >>> library(Biostrings) >>> >>> # load data >>> genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE) >>> exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE) >>> >>> # create object for storage of extracted introns >>> introns<-data.frame() >>> >>> # loop over all genes >>> for(i in 1:length(genebody[,1])){ >>> >>> tmp.id<-as.vector(genebody[i,1]) >>> # get gene id >>> tmp.subject<-as.vector(genebody[i,2]) >>> # get gene sequence >>> tmp.exons<-exons[which(exons[,1]==tmp.id),] >>> # get exons of the >>> selected genes >>> >>> tmp.pattern<-as.vector(tmp.exons[,3]) >>> # define exons as patterns >>> for alignment >>> tmp.align<-pairwiseAlignment(pattern=tmp.pattern, >>> subject=tmp.subject,type="local") # align all >>> exons pairwise to gene >>> sequence >>> tmp.start<-start(subject(tmp.align)) >>> # vector of all alignment >>> starts >>> tmp.end<-end(subject(tmp.align)) >>> # vector of all >>> alignment ends >>> >>> >>> >>> for(ex in 1:(length(tmp.end)-1)){ >>> # extract introns >>> >>> >>> tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1) >>> >>> introns<-rbind(introns,cbindtmp.id,tmp.end[ex]+1,tmp.start[ex+1 ]-1,tmp.intron)) >>> >>> } >>> >>> } >>> >>> } >>> >>> # define useful names for columns >>> colnames(introns)<-c("gene.id","pos.start","pos.end","intron") >>> >>> # write output >>> write.table(introns,"Introns.txt", sep="\t", quote=FALSE, >>> row.names=FALSE) >>> >>> The problem is the unvalid arguments in the substr.... >> >> Well I can't tell for sure as you haven't told us what errors you are >> getting, but the names of your variables suggest that you are using >> substr with a number in its 2nd arg that is larger than the number in >> its 3rd arg. >> >> I still think your life will be easier if you use GenomicFeatures to >> download your gene annotations and then play with them that way. You >> will have to invest sometime to learn a bit more about the >> functionality provided by the IRanges, GRanges, GenomicFeatures >> packages, but it will be time well spent since you will likely finding >> yourself reinventing the functionality it provides anyway. >> >> For instance: >> >> R> library(GenomicFeatures) >> R> bee.txdb<- makeTranscriptDbFromUCSC(genome="apiMel2", >> tablename="ensGene") >> R> introns<- intronsByTranscript(bee.txdb, use.names=TRUE) >> R> head(introns) >> GRangesList of length 6: >> $ENSAPMT00000020635 >> GRanges with 1 range and 0 elementMetadata values: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] Group1 [745, 821] + >> >> $ENSAPMT00000003962 >> GRanges with 2 ranges and 0 elementMetadata values: >> seqnames ranges strand >> [1] Group1 [2946, 3009] - >> [2] Group1 [3155, 3331] - >> >> And to get the intronic sequence from "ENSAPMT00000020635" you can do: >> >> R> library(BSgenome.Amellifera.UCSC.apiMel2) >> R> getSeq(Amellifera, introns[[1]]) >> >> You can even get the sequences of all introns at once: >> >> R> all.introns<- unlist(introns) >> R> all.seqs<- getSeq(Amellifera, all.introns) >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6