Rsubread vs. BWA, Bowtie, etc. and RPKM vs. normalized counts

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 3.6 years ago

United States

A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I want to align them, and I couldn't really make heads or tails of the options, so I listened to what Phil Green told me at a conference and looked around for a sensible word-nucleated aligner like he described. It seems that Rsubread works this way? http://sourceforge.net/projects/subread/ I would like BAM files as intermediate output, but my real interest is differential exon usage in differentiating cells. Given that the reads I have to align are relatively short (36bp, SE), is there an advantage or disadvantage in using subread compared to other options? And when I'm done trimming and aligning, I could choose raw counts, conditional quantile normalized counts, or something like RPKM to summarize how often a given exon seems to have been transcribed. I read this: http://seqanswers.com/forums/showthread.php?t=586 and I see that packages using a Gamma prior for the dispersion of a Poisson count model benefit from having raw counts. If I am after correlated changes in exon usage depending on other sequence features, is it reasonable to use (say) 'cqn' on the raw counts, then log-transform and work with those normalized counts? Thanks for any suggestions, -- Tim Triche, Jr. USC Biostatistics [[alternative HTML version deleted]]

• 3.3k views

ADD COMMENT • link updated 12.5 years ago by Wei Shi ★ 3.6k • written 12.6 years ago by Tim Triche ★ 4.2k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

Hi, Tim. For aligning RNA-seq reads, the major questions include: 1) Paired-end or not? 2) Interested in junction reads? 3) Allow indels or not? 4) Interested in known transcripts/junctions or also interested in novel transcripts? If you answer those questions, you can usually limit your choices significantly. As for stats, consider following the DEseq or edgeR workflows. Both rely on raw counts and can perform exon usage statistics as well as potentially using junction reads for similar calculations. Sean On Sun, Oct 9, 2011 at 8:07 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote: > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I want > to align them, and I couldn't really make heads or tails of the options, so > I listened to what Phil Green told me at a conference and looked around for > a sensible word-nucleated aligner like he described. ?It seems that Rsubread > works this way? > > http://sourceforge.net/projects/subread/ > > I would like BAM files as intermediate output, but my real interest is > differential exon usage in differentiating cells. ?Given that the reads I > have to align are relatively short (36bp, SE), is there an advantage or > disadvantage in using subread compared to other options? ?And when I'm done > trimming and aligning, I could choose raw counts, conditional quantile > normalized counts, or something like RPKM to summarize how often a given > exon seems to have been transcribed. ?I read this: > > http://seqanswers.com/forums/showthread.php?t=586 > > and I see that packages using a Gamma prior for the dispersion of a Poisson > count model benefit from having raw counts. ?If I am after correlated > changes in exon usage depending on other sequence features, is it reasonable > to use (say) 'cqn' on the raw counts, then log-transform and work with those > normalized counts? > > Thanks for any suggestions, > > -- > Tim Triche, Jr. > USC Biostatistics > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.6 years ago Sean Davis 21k

0

Entering edit mode

1) single end sequencing, 36 cycles on an Illumina Genome Analyzer (not sure what kind) 2) Probably will be interested in junction reads later but don't know what to do with them yet. I want exons to fall into nice little mail-slots so that I can model their counts and the within/between correlations with other genomic features, even though I am becoming aware that this is not how it actually works. 3) These are pooled samples so I think I have to allow for indels. Unpooled data is coming, but I won't have to align that. 4) Mostly interested in known transcripts, but I realized that I don't see any poly-A selection here, so maybe ncRNA too? Although I imagine that 36bp reads are too short to effectively do that? (Triplicates of similar data are being generated; really I just want a starting point for comparing some affected samples to, at high enough resolution to match their data) On Sun, Oct 9, 2011 at 5:33 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: > Hi, Tim. > > For aligning RNA-seq reads, the major questions include: > > 1) Paired-end or not? > 2) Interested in junction reads? > 3) Allow indels or not? > 4) Interested in known transcripts/junctions or also interested in > novel transcripts? > > If you answer those questions, you can usually limit your choices > significantly. > > As for stats, consider following the DEseq or edgeR workflows. Both > rely on raw counts and can perform exon usage statistics as well as > potentially using junction reads for similar calculations. > > Sean > > On Sun, Oct 9, 2011 at 8:07 PM, Tim Triche, Jr. <tim.triche@gmail.com> > wrote: > > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I > want > > to align them, and I couldn't really make heads or tails of the options, > so > > I listened to what Phil Green told me at a conference and looked around > for > > a sensible word-nucleated aligner like he described. It seems that > Rsubread > > works this way? > > > > http://sourceforge.net/projects/subread/ > > > > I would like BAM files as intermediate output, but my real interest is > > differential exon usage in differentiating cells. Given that the reads I > > have to align are relatively short (36bp, SE), is there an advantage or > > disadvantage in using subread compared to other options? And when I'm > done > > trimming and aligning, I could choose raw counts, conditional quantile > > normalized counts, or something like RPKM to summarize how often a given > > exon seems to have been transcribed. I read this: > > > > http://seqanswers.com/forums/showthread.php?t=586 > > > > and I see that packages using a Gamma prior for the dispersion of a > Poisson > > count model benefit from having raw counts. If I am after correlated > > changes in exon usage depending on other sequence features, is it > reasonable > > to use (say) 'cqn' on the raw counts, then log-transform and work with > those > > normalized counts? > > > > Thanks for any suggestions, > > > > -- > > Tim Triche, Jr. > > USC Biostatistics > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is. John von Neumann<http: www-groups.dcs.st-="" and.ac.uk="" ~history="" biographies="" von_neumann.html=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.6 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 16 days ago

Australia/Melbourne/Olivia Newton-John …

Dear Tim, Yes, Subread performs read mapping by mapping a set of 16 mers extracted from each read to the genome, counting the number of mapped 16 mers at each candidate location and then choosing the one which is mapped to by the majority of the 16 mers as the mapping location of the read. The fundamental difference between Subread and other aligners is that it uses a voting method rather than a extension method to determine the mapping locations of the reads, which makes it a lot faster and more sensitive. Subread has a both a C version and an R version. The C version is freely available from sourceforge and the R version is included in the Rsubread package in Bioc. The Rsubread package also includes a function called featureCounts, which can be used to count the number of reads for each exon or gene. So this function will be useful for you to look at the differential expression at both gene level and exon level. Another function which might be useful for your data analysis is the subjunc funtion, which is designed to discover exon junctions. Subjunc uses an idea similar to that of Subread. Our preliminary results showed that subjunc outperformed competing junction detectors in terms of speed, sensitivity and accuracy. The devel version of Rsubread package includes a lot of our recent development for both Subread aligner and Subjunc junction detector, so I would recommend using the devel version if you want to try the Rsubread package. Hope this helps. Cheers, Wei > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I > want > to align them, and I couldn't really make heads or tails of the options, > so > I listened to what Phil Green told me at a conference and looked around > for > a sensible word-nucleated aligner like he described. It seems that > Rsubread > works this way? > > http://sourceforge.net/projects/subread/ > > I would like BAM files as intermediate output, but my real interest is > differential exon usage in differentiating cells. Given that the reads I > have to align are relatively short (36bp, SE), is there an advantage or > disadvantage in using subread compared to other options? And when I'm > done > trimming and aligning, I could choose raw counts, conditional quantile > normalized counts, or something like RPKM to summarize how often a given > exon seems to have been transcribed. I read this: > > http://seqanswers.com/forums/showthread.php?t=586 > > and I see that packages using a Gamma prior for the dispersion of a > Poisson > count model benefit from having raw counts. If I am after correlated > changes in exon usage depending on other sequence features, is it > reasonable > to use (say) 'cqn' on the raw counts, then log-transform and work with > those > normalized counts? > > Thanks for any suggestions, > > -- > Tim Triche, Jr. > USC Biostatistics > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Oh, cool. So I pulled the devel version, indexed hg19 with 7400MB of RAM, and went to align my reads... 29813949 reads were processed in 716.9 seconds. Percentage of mapped reads is 41.04%. That was with indels=1. I am told that this is fairly low, as far as mapped counts go. These are 36-base single-end reads. One piece of advice was "run TopHat on it and see if it aligns any more of the reads", so I'm doing that, but as you pointed out, a very nice feature of Rsubread for me is featureCounts. It's also really fast, so if I can get away with using it, I am thinking that I will do so from now on... Thanks for a fast and easy to use alignment package, --Tim On Mon, Oct 10, 2011 at 3:26 PM, Wei Shi <shi@wehi.edu.au> wrote: > Dear Tim, > > Yes, Subread performs read mapping by mapping a set of 16 mers extracted > from each read to the genome, counting the number of mapped 16 mers at > each candidate location and then choosing the one which is mapped to by > the majority of the 16 mers as the mapping location of the read. The > fundamental difference between Subread and other aligners is that it uses > a voting method rather than a extension method to determine the mapping > locations of the reads, which makes it a lot faster and more sensitive. > > Subread has a both a C version and an R version. The C version is freely > available from sourceforge and the R version is included in the Rsubread > package in Bioc. > > The Rsubread package also includes a function called featureCounts, which > can be used to count the number of reads for each exon or gene. So this > function will be useful for you to look at the differential expression at > both gene level and exon level. > > Another function which might be useful for your data analysis is the > subjunc funtion, which is designed to discover exon junctions. Subjunc > uses an idea similar to that of Subread. Our preliminary results showed > that subjunc outperformed competing junction detectors in terms of speed, > sensitivity and accuracy. > > The devel version of Rsubread package includes a lot of our recent > development for both Subread aligner and Subjunc junction detector, so I > would recommend using the devel version if you want to try the Rsubread > package. > > > Hope this helps. > > Cheers, > Wei > > > > > > > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I > > want > > to align them, and I couldn't really make heads or tails of the options, > > so > > I listened to what Phil Green told me at a conference and looked around > > for > > a sensible word-nucleated aligner like he described. It seems that > > Rsubread > > works this way? > > > > http://sourceforge.net/projects/subread/ > > > > I would like BAM files as intermediate output, but my real interest is > > differential exon usage in differentiating cells. Given that the reads I > > have to align are relatively short (36bp, SE), is there an advantage or > > disadvantage in using subread compared to other options? And when I'm > > done > > trimming and aligning, I could choose raw counts, conditional quantile > > normalized counts, or something like RPKM to summarize how often a given > > exon seems to have been transcribed. I read this: > > > > http://seqanswers.com/forums/showthread.php?t=586 > > > > and I see that packages using a Gamma prior for the dispersion of a > > Poisson > > count model benefit from having raw counts. If I am after correlated > > changes in exon usage depending on other sequence features, is it > > reasonable > > to use (say) 'cqn' on the raw counts, then log-transform and work with > > those > > normalized counts? > > > > Thanks for any suggestions, > > > > -- > > Tim Triche, Jr. > > USC Biostatistics > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:17}}

ADD REPLY • link 12.5 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim, For the reads of 36bp long, Subread extracts seven 16 mers from each read. The default cutoff for calling a hit is to require at least three consensus 16 mers (mapped to the same location). But this setting assumes that ten 16 mers are extracted from the read. For your data, you should use TH1=2 rather than the default setting, because the reads are very short and you can not extract ten 16 mers from each of them. On the other hand, you might check the sequencing quality of your data. If the quality is poor, it will be hard to achieve a high mapping percentage without losing mapping accuracy. The qualityScores function in Rsubread can help you to retrieve the quality scores from your data. Cheers, Wei > Oh, cool. So I pulled the devel version, indexed hg19 with 7400MB of RAM, > and went to align my reads... > > 29813949 reads were processed in 716.9 seconds. > Percentage of mapped reads is 41.04%. > > That was with indels=1. I am told that this is fairly low, as far as > mapped > counts go. These are 36-base single-end reads. One piece of advice was > "run TopHat on it and see if it aligns any more of the reads", so I'm > doing > that, but as you pointed out, a very nice feature of Rsubread for me is > featureCounts. It's also really fast, so if I can get away with using it, > I > am thinking that I will do so from now on... > > Thanks for a fast and easy to use alignment package, > > --Tim > > On Mon, Oct 10, 2011 at 3:26 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Dear Tim, >> >> Yes, Subread performs read mapping by mapping a set of 16 mers extracted >> from each read to the genome, counting the number of mapped 16 mers at >> each candidate location and then choosing the one which is mapped to by >> the majority of the 16 mers as the mapping location of the read. The >> fundamental difference between Subread and other aligners is that it >> uses >> a voting method rather than a extension method to determine the mapping >> locations of the reads, which makes it a lot faster and more sensitive. >> >> Subread has a both a C version and an R version. The C version is freely >> available from sourceforge and the R version is included in the Rsubread >> package in Bioc. >> >> The Rsubread package also includes a function called featureCounts, >> which >> can be used to count the number of reads for each exon or gene. So this >> function will be useful for you to look at the differential expression >> at >> both gene level and exon level. >> >> Another function which might be useful for your data analysis is the >> subjunc funtion, which is designed to discover exon junctions. Subjunc >> uses an idea similar to that of Subread. Our preliminary results showed >> that subjunc outperformed competing junction detectors in terms of >> speed, >> sensitivity and accuracy. >> >> The devel version of Rsubread package includes a lot of our recent >> development for both Subread aligner and Subjunc junction detector, so I >> would recommend using the devel version if you want to try the Rsubread >> package. >> >> >> Hope this helps. >> >> Cheers, >> Wei >> >> >> >> >> >> > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and >> I >> > want >> > to align them, and I couldn't really make heads or tails of the >> options, >> > so >> > I listened to what Phil Green told me at a conference and looked >> around >> > for >> > a sensible word-nucleated aligner like he described. It seems that >> > Rsubread >> > works this way? >> > >> > http://sourceforge.net/projects/subread/ >> > >> > I would like BAM files as intermediate output, but my real interest is >> > differential exon usage in differentiating cells. Given that the >> reads I >> > have to align are relatively short (36bp, SE), is there an advantage >> or >> > disadvantage in using subread compared to other options? And when I'm >> > done >> > trimming and aligning, I could choose raw counts, conditional quantile >> > normalized counts, or something like RPKM to summarize how often a >> given >> > exon seems to have been transcribed. I read this: >> > >> > http://seqanswers.com/forums/showthread.php?t=586 >> > >> > and I see that packages using a Gamma prior for the dispersion of a >> > Poisson >> > count model benefit from having raw counts. If I am after correlated >> > changes in exon usage depending on other sequence features, is it >> > reasonable >> > to use (say) 'cqn' on the raw counts, then log-transform and work with >> > those >> > normalized counts? >> > >> > Thanks for any suggestions, >> > >> > -- >> > Tim Triche, Jr. >> > USC Biostatistics >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the >> addressee. >> You must not disclose, forward, print or use it without the permission >> of >> the sender. >> ______________________________________________________________________ >> > > > > -- > If people do not believe that mathematics is simple, > it is only because they do not realize how complicated life is. John von > Neumann<http: www-groups.dcs.st-="" and.ac.uk="" ~history="" biographies="" von_neumann.html=""> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 12.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Tim, The featureCounts() function in Rsubread counts the number of reads mapped to genes or exons and returns a list which includes a read count matrix and annotation information. The returned list object can be easily transformed to a DGEList object which can then be fed into downstream expression analysis packages like edgeR or limma. The Exact tests or GLM likelihood ratio tests in edgeR, or voom() function in limma, can then be used to perform differential expression analysis for the RNA-seq data. We have established a pipeline to perform complete analysis for RNA- seq data using R packages including Rsubread, edgeR and limma. This pipeline starts with the FASTQ/FASTA files, which is the output of next-generation sequencers, through to read alignment, read count summarization for genes or exons, and to biological variation estimation and statistical testing of differential expression, and finally yields lists of differentially expressed genes. The repertoire of functions included in limma and edgeR can be readily used for further downstream analysis such as gene set tests and so on. This is a quick and easy-to-use pipeline and it is now being used in our routine RNA-seq data analysis. Hope you will find it useful. Cheers, Wei On Oct 11, 2011, at 9:32 AM, Tim Triche, Jr. wrote: > but as you pointed out, a very nice feature of Rsubread for me is featureCounts. It's also really fast, so if I can get away with using it, I am thinking that I will do so from now on... > > Thanks for a fast and easy to use alignment package, > > --Tim > > On Mon, Oct 10, 2011 at 3:26 PM, Wei Shi <shi@wehi.edu.au> wrote: > Dear Tim, > > Yes, Subread performs read mapping by mapping a set of 16 mers extracted > from each read to the genome, counting the number of mapped 16 mers at > each candidate location and then choosing the one which is mapped to by > the majority of the 16 mers as the mapping location of the read. The > fundamental difference between Subread and other aligners is that it uses > a voting method rather than a extension method to determine the mapping > locations of the reads, which makes it a lot faster and more sensitive. > > Subread has a both a C version and an R version. The C version is freely > available from sourceforge and the R version is included in the Rsubread > package in Bioc. > > The Rsubread package also includes a function called featureCounts, which > can be used to count the number of reads for each exon or gene. So this > function will be useful for you to look at the differential expression at > both gene level and exon level. > > Another function which might be useful for your data analysis is the > subjunc funtion, which is designed to discover exon junctions. Subjunc > uses an idea similar to that of Subread. Our preliminary results showed > that subjunc outperformed competing junction detectors in terms of speed, > sensitivity and accuracy. > > The devel version of Rsubread package includes a lot of our recent > development for both Subread aligner and Subjunc junction detector, so I > would recommend using the devel version if you want to try the Rsubread > package. > > > Hope this helps. > > Cheers, > Wei > > > > > > > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I > > want > > to align them, and I couldn't really make heads or tails of the options, > > so > > I listened to what Phil Green told me at a conference and looked around > > for > > a sensible word-nucleated aligner like he described. It seems that > > Rsubread > > works this way? > > > > http://sourceforge.net/projects/subread/ > > > > I would like BAM files as intermediate output, but my real interest is > > differential exon usage in differentiating cells. Given that the reads I > > have to align are relatively short (36bp, SE), is there an advantage or > > disadvantage in using subread compared to other options? And when I'm > > done > > trimming and aligning, I could choose raw counts, conditional quantile > > normalized counts, or something like RPKM to summarize how often a given > > exon seems to have been transcribed. I read this: > > > > http://seqanswers.com/forums/showthread.php?t=586 > > > > and I see that packages using a Gamma prior for the dispersion of a > > Poisson > > count model benefit from having raw counts. If I am after correlated > > changes in exon usage depending on other sequence features, is it > > reasonable > > to use (say) 'cqn' on the raw counts, then log-transform and work with > > those > > normalized counts? > > > > Thanks for any suggestions, > > > > -- > > Tim Triche, Jr. > > USC Biostatistics > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:25}}

ADD REPLY • link 12.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Cool! I started using featureCounts() by 'gene' and by 'exon' almost immediately. However, when we run the same RNA-seq data through our local Bowtie/Tophat pipeline, we find that one of the samples has extensive PCR contamination (lots and lots of duplicate alignments). Is there a simple way to detect the same in Rsubread? The latter is MUCH faster... Thanks again for a great package and a very fast aligner. --t On Wed, Oct 12, 2011 at 10:39 PM, Wei Shi <shi@wehi.edu.au> wrote: > Hi Tim, > > The featureCounts() function in Rsubread counts the number of reads mapped > to genes or exons and returns a list which includes a read count matrix and > annotation information. The returned list object can be easily transformed > to a DGEList object which can then be fed into downstream expression > analysis packages like edgeR or limma. The Exact tests or GLM likelihood > ratio tests in edgeR, or voom() function in limma, can then be used to > perform differential expression analysis for the RNA-seq data. > > We have established a pipeline to perform complete analysis for RNA- seq > data using R packages including Rsubread, edgeR and limma. This pipeline > starts with the FASTQ/FASTA files, which is the output of next- generation > sequencers, through to read alignment, read count summarization for genes or > exons, and to biological variation estimation and statistical testing of > differential expression, and finally yields lists of differentially > expressed genes. The repertoire of functions included in limma and edgeR can > be readily used for further downstream analysis such as gene set tests and > so on. > > This is a quick and easy-to-use pipeline and it is now being used in our > routine RNA-seq data analysis. Hope you will find it useful. > > Cheers, > Wei > > On Oct 11, 2011, at 9:32 AM, Tim Triche, Jr. wrote: > > but as you pointed out, a very nice feature of Rsubread for me is > featureCounts. It's also really fast, so if I can get away with using it, I > am thinking that I will do so from now on... > > Thanks for a fast and easy to use alignment package, > > --Tim > > On Mon, Oct 10, 2011 at 3:26 PM, Wei Shi <shi@wehi.edu.au> wrote: > >> Dear Tim, >> >> Yes, Subread performs read mapping by mapping a set of 16 mers extracted >> from each read to the genome, counting the number of mapped 16 mers at >> each candidate location and then choosing the one which is mapped to by >> the majority of the 16 mers as the mapping location of the read. The >> fundamental difference between Subread and other aligners is that it uses >> a voting method rather than a extension method to determine the mapping >> locations of the reads, which makes it a lot faster and more sensitive. >> >> Subread has a both a C version and an R version. The C version is freely >> available from sourceforge and the R version is included in the Rsubread >> package in Bioc. >> >> The Rsubread package also includes a function called featureCounts, which >> can be used to count the number of reads for each exon or gene. So this >> function will be useful for you to look at the differential expression at >> both gene level and exon level. >> >> Another function which might be useful for your data analysis is the >> subjunc funtion, which is designed to discover exon junctions. Subjunc >> uses an idea similar to that of Subread. Our preliminary results showed >> that subjunc outperformed competing junction detectors in terms of speed, >> sensitivity and accuracy. >> >> The devel version of Rsubread package includes a lot of our recent >> development for both Subread aligner and Subjunc junction detector, so I >> would recommend using the devel version if you want to try the Rsubread >> package. >> >> >> Hope this helps. >> >> Cheers, >> Wei >> >> >> >> >> >> > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I >> > want >> > to align them, and I couldn't really make heads or tails of the options, >> > so >> > I listened to what Phil Green told me at a conference and looked around >> > for >> > a sensible word-nucleated aligner like he described. It seems that >> > Rsubread >> > works this way? >> > >> > http://sourceforge.net/projects/subread/ >> > >> > I would like BAM files as intermediate output, but my real interest is >> > differential exon usage in differentiating cells. Given that the reads >> I >> > have to align are relatively short (36bp, SE), is there an advantage or >> > disadvantage in using subread compared to other options? And when I'm >> > done >> > trimming and aligning, I could choose raw counts, conditional quantile >> > normalized counts, or something like RPKM to summarize how often a given >> > exon seems to have been transcribed. I read this: >> > >> > http://seqanswers.com/forums/showthread.php?t=586 >> > >> > and I see that packages using a Gamma prior for the dispersion of a >> > Poisson >> > count model benefit from having raw counts. If I am after correlated >> > changes in exon usage depending on other sequence features, is it >> > reasonable >> > to use (say) 'cqn' on the raw counts, then log-transform and work with >> > those >> > normalized counts? >> > >> > Thanks for any suggestions, >> > >> > -- >> > Tim Triche, Jr. >> > USC Biostatistics >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > > > -- > If people do not believe that mathematics is simple, > it is only because they do not realize how complicated life is. John von > Neumann<http: www-groups.dcs.st-="" and.ac.uk="" ~history="" biographies="" von_neumann.html=""> > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:17}}

ADD REPLY • link 12.5 years ago Tim Triche ★ 4.2k

0

Entering edit mode

There is no such function in Rsubread currently, although we might add it in the future. But what I do for looking at duplicated reads was writing a simple C program to retrieve read sequences from the FASTQ file (you can do this using R as well and the running time will not be much different because the major computational cost here is input and output), and then run unix commands 'sort' and 'uniq -c' to get the number of occurrences of each read sequence. Similarly, you can write a simple program to retrieve the mapping locations (chr id + coordinate) of each read from the alignment results (SAM file), save them to a file and run 'sort' and 'uniq -c' commands to get the number of mapped reads for each chromosomal location. But you should be aware that not all duplicate alignments could come from PCR contamination, because RNA-seq data have much higher coverage than gDNA-seq data (with the same library sizes) since the transcriptome size is only about 2% of the genome size. An RNA-seq dataset which includes 10M 100 bp SE reads has a ~17X coverage. So you will expect to see quite a few duplicate alignments in your RNA-seq data. Hope this help. Wei On Oct 13, 2011, at 7:37 PM, Tim Triche, Jr. wrote: > Cool! I started using featureCounts() by 'gene' and by 'exon' almost immediately. However, when we run the same RNA-seq data through our local Bowtie/Tophat pipeline, we find that one of the samples has extensive PCR contamination (lots and lots of duplicate alignments). Is there a simple way to detect the same in Rsubread? The latter is MUCH faster... > > Thanks again for a great package and a very fast aligner. > > --t > > On Wed, Oct 12, 2011 at 10:39 PM, Wei Shi <shi@wehi.edu.au> wrote: > Hi Tim, > > The featureCounts() function in Rsubread counts the number of reads mapped to genes or exons and returns a list which includes a read count matrix and annotation information. The returned list object can be easily transformed to a DGEList object which can then be fed into downstream expression analysis packages like edgeR or limma. The Exact tests or GLM likelihood ratio tests in edgeR, or voom() function in limma, can then be used to perform differential expression analysis for the RNA-seq data. > > We have established a pipeline to perform complete analysis for RNA- seq data using R packages including Rsubread, edgeR and limma. This pipeline starts with the FASTQ/FASTA files, which is the output of next-generation sequencers, through to read alignment, read count summarization for genes or exons, and to biological variation estimation and statistical testing of differential expression, and finally yields lists of differentially expressed genes. The repertoire of functions included in limma and edgeR can be readily used for further downstream analysis such as gene set tests and so on. > > This is a quick and easy-to-use pipeline and it is now being used in our routine RNA-seq data analysis. Hope you will find it useful. > > Cheers, > Wei > > On Oct 11, 2011, at 9:32 AM, Tim Triche, Jr. wrote: >> but as you pointed out, a very nice feature of Rsubread for me is featureCounts. It's also really fast, so if I can get away with using it, I am thinking that I will do so from now on... >> >> Thanks for a fast and easy to use alignment package, >> >> --Tim >> >> On Mon, Oct 10, 2011 at 3:26 PM, Wei Shi <shi@wehi.edu.au> wrote: >> Dear Tim, >> >> Yes, Subread performs read mapping by mapping a set of 16 mers extracted >> from each read to the genome, counting the number of mapped 16 mers at >> each candidate location and then choosing the one which is mapped to by >> the majority of the 16 mers as the mapping location of the read. The >> fundamental difference between Subread and other aligners is that it uses >> a voting method rather than a extension method to determine the mapping >> locations of the reads, which makes it a lot faster and more sensitive. >> >> Subread has a both a C version and an R version. The C version is freely >> available from sourceforge and the R version is included in the Rsubread >> package in Bioc. >> >> The Rsubread package also includes a function called featureCounts, which >> can be used to count the number of reads for each exon or gene. So this >> function will be useful for you to look at the differential expression at >> both gene level and exon level. >> >> Another function which might be useful for your data analysis is the >> subjunc funtion, which is designed to discover exon junctions. Subjunc >> uses an idea similar to that of Subread. Our preliminary results showed >> that subjunc outperformed competing junction detectors in terms of speed, >> sensitivity and accuracy. >> >> The devel version of Rsubread package includes a lot of our recent >> development for both Subread aligner and Subjunc junction detector, so I >> would recommend using the devel version if you want to try the Rsubread >> package. >> >> >> Hope this helps. >> >> Cheers, >> Wei >> >> >> >> >> >> > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and I >> > want >> > to align them, and I couldn't really make heads or tails of the options, >> > so >> > I listened to what Phil Green told me at a conference and looked around >> > for >> > a sensible word-nucleated aligner like he described. It seems that >> > Rsubread >> > works this way? >> > >> > http://sourceforge.net/projects/subread/ >> > >> > I would like BAM files as intermediate output, but my real interest is >> > differential exon usage in differentiating cells. Given that the reads I >> > have to align are relatively short (36bp, SE), is there an advantage or >> > disadvantage in using subread compared to other options? And when I'm >> > done >> > trimming and aligning, I could choose raw counts, conditional quantile >> > normalized counts, or something like RPKM to summarize how often a given >> > exon seems to have been transcribed. I read this: >> > >> > http://seqanswers.com/forums/showthread.php?t=586 >> > >> > and I see that packages using a Gamma prior for the dispersion of a >> > Poisson >> > count model benefit from having raw counts. If I am after correlated >> > changes in exon usage depending on other sequence features, is it >> > reasonable >> > to use (say) 'cqn' on the raw counts, then log-transform and work with >> > those >> > normalized counts? >> > >> > Thanks for any suggestions, >> > >> > -- >> > Tim Triche, Jr. >> > USC Biostatistics >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ >> >> >> >> -- >> If people do not believe that mathematics is simple, >> it is only because they do not realize how complicated life is. >> John von Neumann >> >> > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:25}}

ADD REPLY • link 12.5 years ago Wei Shi ★ 3.6k

Login before adding your answer.