Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.
1
0
Entering edit mode
Paul Geeleher ★ 1.3k
@paul-geeleher-2679
Last seen 10.3 years ago
Hi, I've been going through this RNA-seq use case (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) with some data I have and I'm wondering about section 2.4 where they calculate gene expression by counting the number of reads that alight to within the boundaries of a genes, then normalize these based on the length of the gene. Some of the code is as follows: dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% levels(seqnames(alnRanges))] head(dmGeneBounds, 3) dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) My question is, is this actually correct, could you publish using this method or is this just meant as a simple example? I'm interested in the ranks of the genes in the samples for a subsequent analysis, but I would have assumed that you'd have to count the number of reads that map to the EXONS of each gene and normalize by the length of the EXONS, rather then the gene itself? If this is the case I wonder if there a tutorial that shows how to do that... -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
• 1.4k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 22 months ago
United States
Hi, On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: > Hi, > > I've been going through this RNA-seq use case > (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) > with some data I have and I'm wondering about section 2.4 where they > calculate gene expression by counting the number of reads that alight > to within the boundaries of a genes, then normalize these based on the > length of the gene. Some of the code is as follows: > > dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) > dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% > levels(seqnames(alnRanges))] > head(dmGeneBounds, 3) > dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) > dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) > > My question is, is this actually correct, could you publish using this > method or is this just meant as a simple example? > > I'm interested in the ranks of the genes in the samples for a > subsequent analysis, but I would have assumed that you'd have to count > the number of reads that map to the EXONS of each gene and normalize > by the length of the EXONS, rather then the gene itself? > > If this is the case I wonder if there a tutorial that shows how to do that... If you get a bit comfortable with the GenomicFeatures package, you can do it pretty easily. Assume you have your reads stored in some GRanges object: txdb <- loadFeatures('/path/to/your/txdb') e <- exonsBy(txdb, 'gene') normedCounts <- lapply(e, function(x) { countOverlaps(reads, x) / sum(width(x)) }) normedCounts now has the number of reads falling in each exon per gene divided by the total exon-length of the gene. You can run with that example to calculate a "proper" RPKM ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Okay great, thanks Steve, this works nicely. It does seem to take a long (2 days!) time though but I guess this is to be expected! (I'm assuming this is why they used the simple version in the tutorial... Paul On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >> Hi, >> >> I've been going through this RNA-seq use case >> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >> with some data I have and I'm wondering about section 2.4 where they >> calculate gene expression by counting the number of reads that alight >> to within the boundaries of a genes, then normalize these based on the >> length of the gene. Some of the code is as follows: >> >> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >> levels(seqnames(alnRanges))] >> head(dmGeneBounds, 3) >> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >> >> My question is, is this actually correct, could you publish using this >> method or is this just meant as a simple example? >> >> I'm interested in the ranks of the genes in the samples for a >> subsequent analysis, but I would have assumed that you'd have to count >> the number of reads that map to the EXONS of each gene and normalize >> by the length of the EXONS, rather then the gene itself? >> >> If this is the case I wonder if there a tutorial that shows how to do that... > > If you get a bit comfortable with the GenomicFeatures package, you can > do it pretty easily. Assume you have your reads stored in some GRanges > object: > > txdb <- loadFeatures('/path/to/your/txdb') > e <- exonsBy(txdb, 'gene') > normedCounts <- lapply(e, function(x) { > ?countOverlaps(reads, x) / sum(width(x)) > }) > > normedCounts now has the number of reads falling in each exon per gene > divided by the total exon-length of the gene. > > You can run with that example to calculate a "proper" RPKM ... > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
ADD REPLY
0
Entering edit mode
Hi Paul, On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: > Okay great, thanks Steve, this works nicely. It does seem to take a > long (2 days!) time though but I guess this is to be expected! (I'm > assuming this is why they used the simple version in the tutorial... Wow ... that's ... a long time. What type of machine are you running this on, by the by? You know, I suspect it's so slow due to the slow speed involved in lapply-ing over GRangesList objects, which has been brought up before: http://thread.gmane.org/gmane.science.biology.informatics.conductor/30 564 If you do this sort of thing often, you might consider converting your GRangesList into a "normal" list of GRanges (or an appropriate IRangesList -- but you'd have to handle chromosome/strand info by yourself). I'll leave that as an exercise for the reader :-) -steve > > Paul > > On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> Hi, >> >> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>> Hi, >>> >>> I've been going through this RNA-seq use case >>> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >>> with some data I have and I'm wondering about section 2.4 where they >>> calculate gene expression by counting the number of reads that alight >>> to within the boundaries of a genes, then normalize these based on the >>> length of the gene. Some of the code is as follows: >>> >>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >>> levels(seqnames(alnRanges))] >>> head(dmGeneBounds, 3) >>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >>> >>> My question is, is this actually correct, could you publish using this >>> method or is this just meant as a simple example? >>> >>> I'm interested in the ranks of the genes in the samples for a >>> subsequent analysis, but I would have assumed that you'd have to count >>> the number of reads that map to the EXONS of each gene and normalize >>> by the length of the EXONS, rather then the gene itself? >>> >>> If this is the case I wonder if there a tutorial that shows how to do that... >> >> If you get a bit comfortable with the GenomicFeatures package, you can >> do it pretty easily. Assume you have your reads stored in some GRanges >> object: >> >> txdb <- loadFeatures('/path/to/your/txdb') >> e <- exonsBy(txdb, 'gene') >> normedCounts <- lapply(e, function(x) { >> ?countOverlaps(reads, x) / sum(width(x)) >> }) >> >> normedCounts now has the number of reads falling in each exon per gene >> divided by the total exon-length of the gene. >> >> You can run with that example to calculate a "proper" RPKM ... >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > -- > www.bioinformaticstutorials.com > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Running it on a high end cluster, I estimated 41 hours based on how long it took to do the first 10 genes. It's not a majorly big deal as I can presumably run it in parallel for the different samples. P On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi Paul, > > On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >> Okay great, thanks Steve, this works nicely. It does seem to take a >> long (2 days!) time though but I guess this is to be expected! (I'm >> assuming this is why they used the simple version in the tutorial... > > Wow ... that's ... a long time. > > What type of machine are you running this on, by the by? > > You know, I suspect it's so slow due to the slow speed involved in > lapply-ing over GRangesList objects, which has been brought up before: > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 30564 > > If you do this sort of thing often, you might consider converting your > GRangesList into a "normal" list of GRanges (or an appropriate > IRangesList -- but you'd have to handle chromosome/strand info by > yourself). I'll leave that as an exercise for the reader :-) > > -steve > > >> >> Paul >> >> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou >> <mailinglist.honeypot at="" gmail.com=""> wrote: >>> Hi, >>> >>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>> Hi, >>>> >>>> I've been going through this RNA-seq use case >>>> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >>>> with some data I have and I'm wondering about section 2.4 where they >>>> calculate gene expression by counting the number of reads that alight >>>> to within the boundaries of a genes, then normalize these based on the >>>> length of the gene. Some of the code is as follows: >>>> >>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >>>> levels(seqnames(alnRanges))] >>>> head(dmGeneBounds, 3) >>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >>>> >>>> My question is, is this actually correct, could you publish using this >>>> method or is this just meant as a simple example? >>>> >>>> I'm interested in the ranks of the genes in the samples for a >>>> subsequent analysis, but I would have assumed that you'd have to count >>>> the number of reads that map to the EXONS of each gene and normalize >>>> by the length of the EXONS, rather then the gene itself? >>>> >>>> If this is the case I wonder if there a tutorial that shows how to do that... >>> >>> If you get a bit comfortable with the GenomicFeatures package, you can >>> do it pretty easily. Assume you have your reads stored in some GRanges >>> object: >>> >>> txdb <- loadFeatures('/path/to/your/txdb') >>> e <- exonsBy(txdb, 'gene') >>> normedCounts <- lapply(e, function(x) { >>> ?countOverlaps(reads, x) / sum(width(x)) >>> }) >>> >>> normedCounts now has the number of reads falling in each exon per gene >>> divided by the total exon-length of the gene. >>> >>> You can run with that example to calculate a "proper" RPKM ... >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> ?| Memorial Sloan-Kettering Cancer Center >>> ?| Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >> >> >> -- >> Paul Geeleher >> School of Mathematics, Statistics and Applied Mathematics >> National University of Ireland >> Galway >> Ireland >> -- >> www.bioinformaticstutorials.com >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
ADD REPLY
0
Entering edit mode
On 09/27/2010 07:28 AM, Paul Geeleher wrote: > Running it on a high end cluster, I estimated 41 hours based on how > long it took to do the first 10 genes. It's not a majorly big deal as > I can presumably run it in parallel for the different samples. > > P > > On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> Hi Paul, >> >> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>> Okay great, thanks Steve, this works nicely. It does seem to take a >>> long (2 days!) time though but I guess this is to be expected! (I'm >>> assuming this is why they used the simple version in the tutorial... >> >> Wow ... that's ... a long time. >> >> What type of machine are you running this on, by the by? >> >> You know, I suspect it's so slow due to the slow speed involved in >> lapply-ing over GRangesList objects, which has been brought up before: >> http://thread.gmane.org/gmane.science.biology.informatics.conductor /30564 'Slow' might be measured on the minutes rather than days time frame, unless you're running out of memory. Clusters often have relatively small nodes and so the approach is to work with smaller chunks on each node. If you've got lots of memory and it's still taking days, then posting (simplified but still slow) code might suggest some insights. We'd like to address the GRangesList performance issue in time for the forthcoming release. Martin >> >> If you do this sort of thing often, you might consider converting your >> GRangesList into a "normal" list of GRanges (or an appropriate >> IRangesList -- but you'd have to handle chromosome/strand info by >> yourself). I'll leave that as an exercise for the reader :-) >> >> -steve >> >> >>> >>> Paul >>> >>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou >>> <mailinglist.honeypot at="" gmail.com=""> wrote: >>>> Hi, >>>> >>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>>> Hi, >>>>> >>>>> I've been going through this RNA-seq use case >>>>> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >>>>> with some data I have and I'm wondering about section 2.4 where they >>>>> calculate gene expression by counting the number of reads that alight >>>>> to within the boundaries of a genes, then normalize these based on the >>>>> length of the gene. Some of the code is as follows: >>>>> >>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >>>>> levels(seqnames(alnRanges))] >>>>> head(dmGeneBounds, 3) >>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >>>>> >>>>> My question is, is this actually correct, could you publish using this >>>>> method or is this just meant as a simple example? >>>>> >>>>> I'm interested in the ranks of the genes in the samples for a >>>>> subsequent analysis, but I would have assumed that you'd have to count >>>>> the number of reads that map to the EXONS of each gene and normalize >>>>> by the length of the EXONS, rather then the gene itself? >>>>> >>>>> If this is the case I wonder if there a tutorial that shows how to do that... >>>> >>>> If you get a bit comfortable with the GenomicFeatures package, you can >>>> do it pretty easily. Assume you have your reads stored in some GRanges >>>> object: >>>> >>>> txdb <- loadFeatures('/path/to/your/txdb') >>>> e <- exonsBy(txdb, 'gene') >>>> normedCounts <- lapply(e, function(x) { >>>> countOverlaps(reads, x) / sum(width(x)) >>>> }) >>>> >>>> normedCounts now has the number of reads falling in each exon per gene >>>> divided by the total exon-length of the gene. >>>> >>>> You can run with that example to calculate a "proper" RPKM ... >>>> >>>> -steve >>>> >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> | Memorial Sloan-Kettering Cancer Center >>>> | Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >>> >>> >>> >>> -- >>> Paul Geeleher >>> School of Mathematics, Statistics and Applied Mathematics >>> National University of Ireland >>> Galway >>> Ireland >>> -- >>> www.bioinformaticstutorials.com >>> >> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > >
ADD REPLY
0
Entering edit mode
You can also do all of this using Genominator, including dealing with overlapping genes. There should be ample documentation in the vignettes, but you want to focus on summaryByAnnotation with the groupBy argument for aggregating from exon to gene level makeGeneRepresentation in order to filter out the part of the gene that overlaps another one Kasper On Mon, Sep 27, 2010 at 10:28 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: > Running it on a high end cluster, I estimated 41 hours based on how > long it took to do the first 10 genes. It's not a majorly big deal as > I can presumably run it in parallel for the different samples. > > P > > On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> Hi Paul, >> >> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>> Okay great, thanks Steve, this works nicely. It does seem to take a >>> long (2 days!) time though but I guess this is to be expected! (I'm >>> assuming this is why they used the simple version in the tutorial... >> >> Wow ... that's ... a long time. >> >> What type of machine are you running this on, by the by? >> >> You know, I suspect it's so slow due to the slow speed involved in >> lapply-ing over GRangesList objects, which has been brought up before: >> http://thread.gmane.org/gmane.science.biology.informatics.conductor /30564 >> >> If you do this sort of thing often, you might consider converting your >> GRangesList into a "normal" list of GRanges (or an appropriate >> IRangesList -- but you'd have to handle chromosome/strand info by >> yourself). I'll leave that as an exercise for the reader :-) >> >> -steve >> >> >>> >>> Paul >>> >>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou >>> <mailinglist.honeypot at="" gmail.com=""> wrote: >>>> Hi, >>>> >>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>>> Hi, >>>>> >>>>> I've been going through this RNA-seq use case >>>>> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >>>>> with some data I have and I'm wondering about section 2.4 where they >>>>> calculate gene expression by counting the number of reads that alight >>>>> to within the boundaries of a genes, then normalize these based on the >>>>> length of the gene. Some of the code is as follows: >>>>> >>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >>>>> levels(seqnames(alnRanges))] >>>>> head(dmGeneBounds, 3) >>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >>>>> >>>>> My question is, is this actually correct, could you publish using this >>>>> method or is this just meant as a simple example? >>>>> >>>>> I'm interested in the ranks of the genes in the samples for a >>>>> subsequent analysis, but I would have assumed that you'd have to count >>>>> the number of reads that map to the EXONS of each gene and normalize >>>>> by the length of the EXONS, rather then the gene itself? >>>>> >>>>> If this is the case I wonder if there a tutorial that shows how to do that... >>>> >>>> If you get a bit comfortable with the GenomicFeatures package, you can >>>> do it pretty easily. Assume you have your reads stored in some GRanges >>>> object: >>>> >>>> txdb <- loadFeatures('/path/to/your/txdb') >>>> e <- exonsBy(txdb, 'gene') >>>> normedCounts <- lapply(e, function(x) { >>>> ?countOverlaps(reads, x) / sum(width(x)) >>>> }) >>>> >>>> normedCounts now has the number of reads falling in each exon per gene >>>> divided by the total exon-length of the gene. >>>> >>>> You can run with that example to calculate a "proper" RPKM ... >>>> >>>> -steve >>>> >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> ?| Memorial Sloan-Kettering Cancer Center >>>> ?| Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >>> >>> >>> >>> -- >>> Paul Geeleher >>> School of Mathematics, Statistics and Applied Mathematics >>> National University of Ireland >>> Galway >>> Ireland >>> -- >>> www.bioinformaticstutorials.com >>> >> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > -- > www.bioinformaticstutorials.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
I guess my next question is if there is a convenient way of filtering out the part of the gene that overlaps another one using GenomicRanges? It'd be a shame to have to rewrite all my code using Python or Genominator. Paul. On Mon, Sep 27, 2010 at 6:59 PM, Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> wrote: > You can also do all of this using Genominator, including dealing with > overlapping genes. ?There should be ample documentation in the > vignettes, but you want to focus on > ?summaryByAnnotation with the groupBy argument for aggregating from > exon to gene level > ?makeGeneRepresentation in order to filter out the part of the gene > that overlaps another one > > Kasper > > On Mon, Sep 27, 2010 at 10:28 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >> Running it on a high end cluster, I estimated 41 hours based on how >> long it took to do the first 10 genes. It's not a majorly big deal as >> I can presumably run it in parallel for the different samples. >> >> P >> >> On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou >> <mailinglist.honeypot at="" gmail.com=""> wrote: >>> Hi Paul, >>> >>> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>> Okay great, thanks Steve, this works nicely. It does seem to take a >>>> long (2 days!) time though but I guess this is to be expected! (I'm >>>> assuming this is why they used the simple version in the tutorial... >>> >>> Wow ... that's ... a long time. >>> >>> What type of machine are you running this on, by the by? >>> >>> You know, I suspect it's so slow due to the slow speed involved in >>> lapply-ing over GRangesList objects, which has been brought up before: >>> http://thread.gmane.org/gmane.science.biology.informatics.conducto r/30564 >>> >>> If you do this sort of thing often, you might consider converting your >>> GRangesList into a "normal" list of GRanges (or an appropriate >>> IRangesList -- but you'd have to handle chromosome/strand info by >>> yourself). I'll leave that as an exercise for the reader :-) >>> >>> -steve >>> >>> >>>> >>>> Paul >>>> >>>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou >>>> <mailinglist.honeypot at="" gmail.com=""> wrote: >>>>> Hi, >>>>> >>>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>>>> Hi, >>>>>> >>>>>> I've been going through this RNA-seq use case >>>>>> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >>>>>> with some data I have and I'm wondering about section 2.4 where they >>>>>> calculate gene expression by counting the number of reads that alight >>>>>> to within the boundaries of a genes, then normalize these based on the >>>>>> length of the gene. Some of the code is as follows: >>>>>> >>>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >>>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >>>>>> levels(seqnames(alnRanges))] >>>>>> head(dmGeneBounds, 3) >>>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >>>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >>>>>> >>>>>> My question is, is this actually correct, could you publish using this >>>>>> method or is this just meant as a simple example? >>>>>> >>>>>> I'm interested in the ranks of the genes in the samples for a >>>>>> subsequent analysis, but I would have assumed that you'd have to count >>>>>> the number of reads that map to the EXONS of each gene and normalize >>>>>> by the length of the EXONS, rather then the gene itself? >>>>>> >>>>>> If this is the case I wonder if there a tutorial that shows how to do that... >>>>> >>>>> If you get a bit comfortable with the GenomicFeatures package, you can >>>>> do it pretty easily. Assume you have your reads stored in some GRanges >>>>> object: >>>>> >>>>> txdb <- loadFeatures('/path/to/your/txdb') >>>>> e <- exonsBy(txdb, 'gene') >>>>> normedCounts <- lapply(e, function(x) { >>>>> ?countOverlaps(reads, x) / sum(width(x)) >>>>> }) >>>>> >>>>> normedCounts now has the number of reads falling in each exon per gene >>>>> divided by the total exon-length of the gene. >>>>> >>>>> You can run with that example to calculate a "proper" RPKM ... >>>>> >>>>> -steve >>>>> >>>>> -- >>>>> Steve Lianoglou >>>>> Graduate Student: Computational Systems Biology >>>>> ?| Memorial Sloan-Kettering Cancer Center >>>>> ?| Weill Medical College of Cornell University >>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>>> >>>> >>>> >>>> >>>> -- >>>> Paul Geeleher >>>> School of Mathematics, Statistics and Applied Mathematics >>>> National University of Ireland >>>> Galway >>>> Ireland >>>> -- >>>> www.bioinformaticstutorials.com >>>> >>> >>> >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> ?| Memorial Sloan-Kettering Cancer Center >>> ?| Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >> >> >> -- >> Paul Geeleher >> School of Mathematics, Statistics and Applied Mathematics >> National University of Ireland >> Galway >> Ireland >> -- >> www.bioinformaticstutorials.com >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
ADD REPLY
0
Entering edit mode
On 09/28/2010 08:09 AM, Paul Geeleher wrote: > I guess my next question is if there is a convenient way of filtering > out the part of the gene > that overlaps another one using GenomicRanges? The ingredients are the 'type' argument to countOverlaps and the ranged based operations, nicely summarized on slide 14 and later of http://bioconductor.org/help/course- materials/2009/SeattleNov09/IRanges/ Two examples (I would not call myself an IRanges expert) that do not directly address your question but might be helpful anyway: library(GenomicRanges) ## ## Example 1 ## ## two 40mer reads, at 1070 and 2070 query <- GRanges("chr1", IRanges(c(1070, 2070) , width=40)) ## three genes spanning 280 nt at 1000 (2 exons), 2000, 3000 subject <- GRangesList( GRanges("chr1", IRanges(c(1000, 1200), width=80)), GRanges("chr1", IRanges(2000, 2279)), GRanges("chr1", IRanges(3000, 3279))) ## simple count o <- findOverlaps(query, subject) (n0 <- tabulate(subjectHits(o), length(subject))) ## reads aligning to gaps don't count gap <- endoapply(subject, function(x) gaps(x, min(start(x)))) o <- findOverlaps(query, gap) (n1 <- n0 - tabulate(subjectHits(o), length(subject))) ## ## Example 2 ## ## two 40mer reads, at 1070 and 2070 query <- GRanges("chr1", IRanges(c(1070, 2070) , width=40)) ## two overlapping genes, spanning 280 nt at 1000 (2 exons), 1080 subject <- GRangesList( GRanges("chr1", IRanges(c(1000, 1200), width=80), "+"), GRanges("chr1", IRanges(1080, width=280), "-")) ## simple count -- read counted twice (probably not a good idea) o <- findOverlaps(query, subject) (n2 <- tabulate(subjectHits(o), length(subject))) ## simple count -- read divided (equally) between matches o <- findOverlaps(query, subject) nqhits <- tabulate(queryHits(o), length(query)) wt <- queryHits(o) / nqhits[queryHits(o)] wtHits <- lapply(split(wt, subjectHits(o)), sum) (n3 <- local({ x <- numeric(length(subject)) x[as.integer(names(wtHits))] <- as.numeric(wtHits) x })) Martin > > It'd be a shame to have to rewrite all my code using Python or Genominator. > > Paul. > > On Mon, Sep 27, 2010 at 6:59 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> You can also do all of this using Genominator, including dealing with >> overlapping genes. There should be ample documentation in the >> vignettes, but you want to focus on >> summaryByAnnotation with the groupBy argument for aggregating from >> exon to gene level >> makeGeneRepresentation in order to filter out the part of the gene >> that overlaps another one >> >> Kasper >> >> On Mon, Sep 27, 2010 at 10:28 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>> Running it on a high end cluster, I estimated 41 hours based on how >>> long it took to do the first 10 genes. It's not a majorly big deal as >>> I can presumably run it in parallel for the different samples. >>> >>> P >>> >>> On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou >>> <mailinglist.honeypot at="" gmail.com=""> wrote: >>>> Hi Paul, >>>> >>>> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>>> Okay great, thanks Steve, this works nicely. It does seem to take a >>>>> long (2 days!) time though but I guess this is to be expected! (I'm >>>>> assuming this is why they used the simple version in the tutorial... >>>> >>>> Wow ... that's ... a long time. >>>> >>>> What type of machine are you running this on, by the by? >>>> >>>> You know, I suspect it's so slow due to the slow speed involved in >>>> lapply-ing over GRangesList objects, which has been brought up before: >>>> http://thread.gmane.org/gmane.science.biology.informatics.conduct or/30564 >>>> >>>> If you do this sort of thing often, you might consider converting your >>>> GRangesList into a "normal" list of GRanges (or an appropriate >>>> IRangesList -- but you'd have to handle chromosome/strand info by >>>> yourself). I'll leave that as an exercise for the reader :-) >>>> >>>> -steve >>>> >>>> >>>>> >>>>> Paul >>>>> >>>>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou >>>>> <mailinglist.honeypot at="" gmail.com=""> wrote: >>>>>> Hi, >>>>>> >>>>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I've been going through this RNA-seq use case >>>>>>> (http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf) >>>>>>> with some data I have and I'm wondering about section 2.4 where they >>>>>>> calculate gene expression by counting the number of reads that alight >>>>>>> to within the boundaries of a genes, then normalize these based on the >>>>>>> length of the gene. Some of the code is as follows: >>>>>>> >>>>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb) >>>>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in% >>>>>>> levels(seqnames(alnRanges))] >>>>>>> head(dmGeneBounds, 3) >>>>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges) >>>>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds) >>>>>>> >>>>>>> My question is, is this actually correct, could you publish using this >>>>>>> method or is this just meant as a simple example? >>>>>>> >>>>>>> I'm interested in the ranks of the genes in the samples for a >>>>>>> subsequent analysis, but I would have assumed that you'd have to count >>>>>>> the number of reads that map to the EXONS of each gene and normalize >>>>>>> by the length of the EXONS, rather then the gene itself? >>>>>>> >>>>>>> If this is the case I wonder if there a tutorial that shows how to do that... >>>>>> >>>>>> If you get a bit comfortable with the GenomicFeatures package, you can >>>>>> do it pretty easily. Assume you have your reads stored in some GRanges >>>>>> object: >>>>>> >>>>>> txdb <- loadFeatures('/path/to/your/txdb') >>>>>> e <- exonsBy(txdb, 'gene') >>>>>> normedCounts <- lapply(e, function(x) { >>>>>> countOverlaps(reads, x) / sum(width(x)) >>>>>> }) >>>>>> >>>>>> normedCounts now has the number of reads falling in each exon per gene >>>>>> divided by the total exon-length of the gene. >>>>>> >>>>>> You can run with that example to calculate a "proper" RPKM ... >>>>>> >>>>>> -steve >>>>>> >>>>>> -- >>>>>> Steve Lianoglou >>>>>> Graduate Student: Computational Systems Biology >>>>>> | Memorial Sloan-Kettering Cancer Center >>>>>> | Weill Medical College of Cornell University >>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Paul Geeleher >>>>> School of Mathematics, Statistics and Applied Mathematics >>>>> National University of Ireland >>>>> Galway >>>>> Ireland >>>>> -- >>>>> www.bioinformaticstutorials.com >>>>> >>>> >>>> >>>> >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> | Memorial Sloan-Kettering Cancer Center >>>> | Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >>> >>> >>> >>> -- >>> Paul Geeleher >>> School of Mathematics, Statistics and Applied Mathematics >>> National University of Ireland >>> Galway >>> Ireland >>> -- >>> www.bioinformaticstutorials.com >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > >
ADD REPLY
0
Entering edit mode
Hi I was never too happy about what we wrote there in Lab 8. Paul has pointed out one major issue. The other is overlapping genes: Standard RNA-Seq does not recover information about which strand a transcript is from, and in more crowded genomes, it does happen not that rarely that exons of two different genes on opposite strands overlap. The code in the lab does not address this. If genes A and B overlap, then every read that maps onto this overlap will be counted for both genes. If now gene A is differentially expressed and gene B is not, then the extra counts from gene A that get counted for gene B as well might cause gene B to be called differentially expressed, too. All this is not likely to have large effects on results, but it would be nicer to do it properly. As you have already noticed, it is not exactly trivial to code something like this in a correct and efficient manner in R. At least I think so. I'm sure the IRanges gurus on the list will now jump on me with examples on how easy it would have been, but I found it much easier to code this in Python. The script I made for this purpose is available at http://www-huber.embl.de/users/anders/HTSeq/doc/count.html It is actually part of a larger framework to make coding such stuff in Python easy. Have a look: http://www-huber.embl.de/users/anders/HTSeq Simon +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sanders at fs.tum.de
ADD REPLY

Login before adding your answer.

Traffic: 550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6