package to count read prior to DESeq?

0

Entering edit mode

nac ▴ 280

@nac-4545

Last seen 11.4 years ago

Hi, Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? thanks Nat -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

DESeq DESeq • 1.9k views

ADD COMMENT • link updated 14.0 years ago by Wei Shi ★ 3.6k • written 14.0 years ago by nac ▴ 280

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 1 day ago

United States

Hi Nathalie, On Fri, Feb 3, 2012 at 10:39 AM, nathalie <nac at="" sanger.ac.uk=""> wrote: > Hi, > Is there a package to create a count table from bam files, which gives an > output compatible with DESeq analysis? The summarizeOverlaps function in the GenomicRanges library is probably a good place to start. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

delhomme@embl.de ★ 1.2k

@delhommeemblde-3232

Last seen 11.4 years ago

Hi Nat, That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 3 Feb 2012, at 16:39, nathalie wrote: > Hi, > Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? > thanks > Nat > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.0 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Hi Nico With respect to easyRNAeq package I am wondering if I supply it a transcript level gtf file can it produce gene level counts. Basically the file is the output from cufflinks. the last(attributes) column in the file does have the gene_id attribute. Best, -Abhi On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > Hi Nat, > > That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help. > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 3 Feb 2012, at 16:39, nathalie wrote: > >> Hi, >> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? >> thanks >> Nat >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.0 years ago Abhishek Pratap ▴ 410

0

Entering edit mode

Dear Abhi, Yes, we can certainly tweak it to do so, but I'm not sure how much sense that would make. First, by doing so would in end effect result in counting read several times; i.e. reads involved in the same exons of different transcripts will be accounted for several time, so you would bias your data. Second, easyRNASeq is meant to start with an unmodified bam file, i.e. after any quality pre-processing has been done but before any kind of normalization/summarization. What easyRNASeq does is to compute the number of reads per feature (exon, transcript, gene,...) and normalize these results (or not) into RPKM or by using the DESeq or edgeR packages. Both these packages require "raw" unmodified data. Transforming transcript "counts" (or FPKM) into genes count is definitely not a straightforward process. You would need to be able to deconvolute every transcript effect, e.g. for a gene with two transcript A and B, you would need to establish the proportion of the different effect from the part unique to transcript A (the part not overlapping transcript B), that of transcript B and their common part. Knowing that Illumina has a significant (and somewhat hard to predict) read coverage bias along any transcript (luckily that bias is reproducible...), it makes it IMO hardly possible to achieve. If you are interested in gene counts, I would suggest to use easyRNASeq on its own, starting from your unmodified bam files. Then depending on your experiment, e.g. if you're doing some kind of differential expression analysis, apply DESeq, baySeq or edgeR. I hope this helps, Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 3 Feb 2012, at 19:51, Abhishek Pratap wrote: > Hi Nico > > With respect to easyRNAeq package I am wondering if I supply it a > transcript level gtf file can it produce gene level counts. Basically > the file is the output from cufflinks. > > the last(attributes) column in the file does have the gene_id attribute. > > > Best, > -Abhi > > > > On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: >> Hi Nat, >> >> That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help. >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 3 Feb 2012, at 16:39, nathalie wrote: >> >>> Hi, >>> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? >>> thanks >>> Nat >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.0 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Hi Nico Thanks for a quick and detailed reply. I am a tiny bit confused, may be my question was not clear. Here is how I understand it. Please correct me if I got it wrong. What I would like to do is study differential gene expression at both gene and transcript level. As an input what I have is transcriptome model in a gtf file and mapped bam files(unique hits only: I have removed reads mapping to multiple positions in the genome). The gtf file lists all the possible transcripts/exons for a gene. What I need to get from this file two types of counts. 1) counts of reads that fall in each gene bin (strand specific) 2) count of reads that fall in each transcript (again strand sp) What I am not sure is the counting method of easyRNASeq. Given an annotation of gene with 2 transcripts (A, B) how does it do the counting. If it is counting at gene level then at some point in the calculation it should be merging the exons from transcripts A and B to gene level. And if it is counting at transcript level then we should get two counts one for each transcript. I understand in the latter case there will be double counting due to overlapping exons between the transcripts. It will help a lot if you could clear this. PS: I also did try to use htseq-count for doing the same but the counts I got dont quite match the avg number of reads we expect to see / gene or transcript. I guess I should start another thread on this with Simon. May be I am doing something wrong. Thanks! -Abhi On Fri, Feb 3, 2012 at 11:48 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > Dear Abhi, > > Yes, we can certainly tweak it to do so, but I'm not sure how much sense that would make. > > First, by doing so would in end effect result in counting read several times; i.e. reads involved in the same exons of different transcripts will be accounted for several time, so you would bias your data. > > Second, easyRNASeq is meant to start with an unmodified bam file, i.e. after any quality pre-processing has been done but before any kind of normalization/summarization. What easyRNASeq does is to compute the number of reads per feature (exon, transcript, gene,...) and normalize these results (or not) into RPKM or by using the DESeq or edgeR packages. Both these packages require "raw" unmodified data. > > Transforming transcript "counts" (or FPKM) into genes count is definitely not a straightforward process. You would need to be able to deconvolute every transcript effect, e.g. for a gene with two transcript A and B, you would need to establish the proportion of the different effect from the part unique to transcript A (the part not overlapping transcript B), that of transcript B and their common part. Knowing that Illumina has a significant (and somewhat hard to predict) read coverage bias along any transcript (luckily that bias is reproducible...), it makes it IMO hardly possible to achieve. > > If you are interested in gene counts, I would suggest to use easyRNASeq on its own, starting from your unmodified bam files. Then depending on your experiment, e.g. if you're doing some kind of differential expression analysis, apply DESeq, baySeq or edgeR. > > I hope this helps, > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 3 Feb 2012, at 19:51, Abhishek Pratap wrote: > >> Hi Nico >> >> With respect to easyRNAeq package I am wondering if I supply it a >> transcript level gtf file can it produce gene level counts. Basically >> the file is the output from cufflinks. >> >> the last(attributes) column in the file does have the gene_id attribute. >> >> >> Best, >> -Abhi >> >> >> >> On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: >>> Hi Nat, >>> >>> That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help. >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On 3 Feb 2012, at 16:39, nathalie wrote: >>> >>>> Hi, >>>> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? >>>> thanks >>>> Nat >>>> >>>> >>>> -- >>>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.0 years ago Abhishek Pratap ▴ 410

0

Entering edit mode

Hi Abhi, > Hi Nico > > Thanks for a quick and detailed reply. I am a tiny bit confused, may > be my question was not clear. Here is how I understand it. Please > correct me if I got it wrong. > > What I would like to do is study differential gene expression at both > gene and transcript level. As an input what I have is transcriptome > model in a gtf file and mapped bam files(unique hits only: I have > removed reads mapping to multiple positions in the genome). The gtf > file lists all the possible transcripts/exons for a gene. Sorry, I misunderstood what you meant by your cufflink gtf file. Indeed, you should be able to use it with easyRNASeq. > > What I need to get from this file two types of counts. > 1) counts of reads that fall in each gene bin (strand specific) > 2) count of reads that fall in each transcript (again strand sp) > I haven't had my hands on strand specific data so far, so at yet it is not supported but I'd be glad to add that into easyRNASeq (was still on my TODO list, don't get enough time... why are the days so short ;-)). Could you provide me with a small excerpt of your data, i.e. a region where you have reads on both strands coming from different transcripts? I could certainly simulate those, but real data is always better. > What I am not sure is the counting method of easyRNASeq. Given an > annotation of gene with 2 transcripts (A, B) how does it do the > counting. If it is counting at gene level then at some point in the > calculation it should be merging the exons from transcripts A and B to > gene level. And if it is counting at transcript level then we should > get two counts one for each transcript. I understand in the latter > case there will be double counting due to overlapping exons between > the transcripts. You're correct. You can precise for which features you want to count using the "count" argument. It takes one of four possible values: features, exons, transcripts, genes. features: Generic features can be defined, i.e. simple interval on the genome, which for example are useful to quantify reads present on enhancers (eRNA). exons: counts are summarized per exon transcripts: counts are summarized per transcript. This indeed would results in double counting exons common to different transcripts genes: in that case, you need to precise an additional argument: summarization. It takes one of two values: geneModels and bestExons. The later one is a relic when we were looking at the coverage bias observed across transcript. The former one calculates synthetic exons, i.e. it separates the different exons of a gene so that they do not contain any overlap. > > It will help a lot if you could clear this. > As a consequence I would not use the transcripts summarization (as warned by the package and mentioned in the vignette). In your case, you could get the geneModels from easyRNASeq using your gtf and then use that information to deconvolute the effect of every transcript in a gene, using the synthetic exons. Again this would be far from trivial. I think the best approach to your problem is the DEXSeq package that look at the differential expression of exons. Using it you could identify which exons are differentially expressed between conditions and trace it back to the causative set of transcripts. To do so, you need to first obtain the synthetic exons from your gtf file. You can either 1) use easyRNASeq to first get the geneModels and then rerun easyRNASeq to get the exon expression, or 2) you directly transform your gtf file into a GRangesList with one GRanges per gene. Then per GRanges, you use the disjoin function to get the synthetic ranges. This object you can provide as annotation to easyRNASeq to get the exon counts. Finally once you've got your exon count, you apply DEXSeq. > PS: I also did try to use htseq-count for doing the same but the > counts I got dont quite match the avg number of reads we expect to see > / gene or transcript. I guess I should start another thread on this > with Simon. May be I am doing something wrong. Well, Simon might chime in there for more details, but from what he told me htseq-count does the same as the easyRNASeq geneModels; it creates synthetic exons and summarize the reads per synthetic exons. Using DEXSeq is probably how I would do in your situation (I'm adding to my TODO list the possibility to create a DEXSeq object from easyRNASeq). The sketch I've described above is far from complete, but if you find it makes sense, I'd be glad to help you with the bolts and nuts :-) Cheers, Nico > > Thanks! > -Abhi > > > > > On Fri, Feb 3, 2012 at 11:48 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: >> Dear Abhi, >> >> Yes, we can certainly tweak it to do so, but I'm not sure how much sense that would make. >> >> First, by doing so would in end effect result in counting read several times; i.e. reads involved in the same exons of different transcripts will be accounted for several time, so you would bias your data. >> >> Second, easyRNASeq is meant to start with an unmodified bam file, i.e. after any quality pre-processing has been done but before any kind of normalization/summarization. What easyRNASeq does is to compute the number of reads per feature (exon, transcript, gene,...) and normalize these results (or not) into RPKM or by using the DESeq or edgeR packages. Both these packages require "raw" unmodified data. >> >> Transforming transcript "counts" (or FPKM) into genes count is definitely not a straightforward process. You would need to be able to deconvolute every transcript effect, e.g. for a gene with two transcript A and B, you would need to establish the proportion of the different effect from the part unique to transcript A (the part not overlapping transcript B), that of transcript B and their common part. Knowing that Illumina has a significant (and somewhat hard to predict) read coverage bias along any transcript (luckily that bias is reproducible...), it makes it IMO hardly possible to achieve. >> >> If you are interested in gene counts, I would suggest to use easyRNASeq on its own, starting from your unmodified bam files. Then depending on your experiment, e.g. if you're doing some kind of differential expression analysis, apply DESeq, baySeq or edgeR. >> >> I hope this helps, >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 3 Feb 2012, at 19:51, Abhishek Pratap wrote: >> >>> Hi Nico >>> >>> With respect to easyRNAeq package I am wondering if I supply it a >>> transcript level gtf file can it produce gene level counts. Basically >>> the file is the output from cufflinks. >>> >>> the last(attributes) column in the file does have the gene_id attribute. >>> >>> >>> Best, >>> -Abhi >>> >>> >>> >>> On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: >>>> Hi Nat, >>>> >>>> That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help. >>>> >>>> Cheers, >>>> >>>> Nico >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delhomme at embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> --------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> >>>> On 3 Feb 2012, at 16:39, nathalie wrote: >>>> >>>>> Hi, >>>>> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? >>>>> thanks >>>>> Nat >>>>> >>>>> >>>>> -- >>>>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>

ADD REPLY • link 14.0 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 24 days ago

Australia/Melbourne

Dear Natalie, The featureCounts function in Rsubread package can summarize read counts for genes or exons (each read will be assigned to at most one exon if reads are summarized to exon level). It uses the NCBI annotation for read summarization, but you can provide your own annotation as well. This function is very quick. It only takes a couple of minutes to summarize 10 million reads. Hope you will find it useful. Cheers, Wei On Feb 4, 2012, at 2:39 AM, nathalie wrote: > Hi, > Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis? > thanks > Nat > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered ...{{dropped:18}}

ADD COMMENT • link 14.0 years ago Wei Shi ★ 3.6k

Login before adding your answer.