Question

Estimating Size Factors in DESeq Package

0

Entering edit mode

@andres-eduardo-rodriguez-cubillos-5486

Last seen 9.6 years ago

Good night Dr. Simon Anders, I'm a master degree student at Universidad de los Andes (Bogot?, Colombia) currently doing some research on differential gene expression between three conditions (2 treated and 1 untreated). I'm using DESeq on R to compare the FPKMs of the genes under these three conditions with one replicate per condition. The results were obtained by single-end RNA-seq data and the expressed genes were aligned to a reference genome using Bowtie. Cufflinks and, subsequently, CuffCompare were used to obtain the reads mapped in a given sample to a given gene. Overall, I organized my tables quite similar to the pasilla_gene_counts.tsv file you used in your guide "Analyzing RNA-seq Data with the DESeq Package" from 2012. At first I had two tables: one per replicate containing FPKMs for the three conditions analyzed. I then merged the FPKMs of both files to obtain three tables; each one now with the FPKMs for both replicates compared between two of the three conditions (TreatedA vs Untreated; TreatedB vs Untreated; TreatedA vs TreatedB). The final format for each of these tables was as follows: Gene Treated1(Replicate1) Treated2(Replicate 2) Untreated1(Replicate1) Untreated2(Replicate2) tag_id FPKM FPKM FPKM FPKM tag_id FPKM FPKM FPKM FPKM ... I understand that DESeq must first normalize the expression values of each treatment by dividing each column with it's own size factor... however, when I want to estimate the size factors for any of my three final tables I get a "NA" or "Not Applicable" value for each treatment. It only happens with these tables that include both replicates but not with the two previous tables that only contain information for one replicate (results are attached). We don't know what might be causing this problem because the tables that contain information for one replicate have the same format than the tables that have both replicates (and in the example you use a table that contains replicates). I was planning to make two separate analysis using DESeq without any replicates (Section 3.3 of your guide), in spite of this problem, but I read that one must assume gene expression levels are quite similar between treatments (this is not our case). The idea I had in mind was to perform one analysis for each replicate, separately, and then compare the results to pick only genes that show differential expression on both analysis. Is this right, even though we know the expression levels vary between conditions? What could be the cause of the "NA" output when trying to estimate the size factors for the tables that contain both replicates? Can I analyze three treatments at once (all in one table) using DESeq, or only two conditions per table (analysis)? We would appreciate your help on this matter a lot because we do want to continue using DESeq for our differential gene expression analysis. Sincerely, Andr?s E. Rodr?guez C. Graduate Assistant LAMFU - Universidad de los Andes (Bogot? D.C., Colombia)

DESeq DESeq • 3.2k views

ADD COMMENT • link updated 11.6 years ago by Simon Anders ★ 3.7k • written 11.6 years ago by Andres Eduardo Rodriguez Cubillos ▴ 20

score 0 · Answer 1 · 2012-09-07

0

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.7 years ago

Zentrum für Molekularbiologie, Universi…

Dear Andres On 09/07/2012 05:05 PM, Andres Eduardo Rodriguez Cubillos wrote: > Overall, I organized my tables quite similar to the > pasilla_gene_counts.tsv file you used in your guide "Analyzing > RNA-seq Data with the DESeq Package" from 2012. At first I had two > tables: one per replicate containing FPKMs for the three conditions > analyzed. I then merged the FPKMs of both files to obtain three FPKM values are not suitable as input data for DESeq. Please see the vignette, which states, in Section 1: The count values must be raw counts of sequencing reads. This is important for DESeq's statistical model to hold, as only raw reads allow to assess the measurement precision correctly. (Hence, do not supply rounded values of normalized counts, or counts of covered base pairs.) Simon

ADD COMMENT • link 11.6 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, I have a distantly related question -- suppose we wanted to compute FPM, i.e. fragments per million, for paired-end RNAseq data. Like, say, if a person (me) was really really lazy and was less concerned about sensitivity, since they were just going to feed everything to voom() or fit robust linear models of log(RPM/FPM) against predictive factors. Is this (fragment-level computation) possible in an existing Bioconductor package at this point in time? Or using HTseq, for that matter? The nice thing about FPM or log1p(FPM) or the like is that it makes sense for the type of modeling that a lot of us do, a lot of the time. Thanks for any information and/or thoughts you might have on the topic, and for making DEXSeq available, --t On Fri, Sep 7, 2012 at 2:20 PM, Simon Anders <anders@embl.de> wrote: > Dear Andres > > > On 09/07/2012 05:05 PM, Andres Eduardo Rodriguez Cubillos wrote: > >> Overall, I organized my tables quite similar to the >> pasilla_gene_counts.tsv file you used in your guide "Analyzing >> RNA-seq Data with the DESeq Package" from 2012. At first I had two >> tables: one per replicate containing FPKMs for the three conditions >> analyzed. I then merged the FPKMs of both files to obtain three >> > > > FPKM values are not suitable as input data for DESeq. Please see the > vignette, which states, in Section 1: > > The count values must be raw counts of sequencing reads. This is > important for DESeq's statistical model to hold, as only raw reads allow > to assess the measurement precision correctly. (Hence, do not supply > rounded values of normalized counts, or counts of covered base pairs.) > > > Simon > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim, On Fri, Sep 7, 2012 at 5:45 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote: > Hi Simon, > > I have a distantly related question -- suppose we wanted to compute FPM, > i.e. fragments per million, for paired-end RNAseq data. Like, say, if a > person (me) was really really lazy and was less concerned about > sensitivity, since they were just going to feed everything to voom() or fit > robust linear models of log(RPM/FPM) against predictive factors. > > Is this (fragment-level computation) possible in an existing Bioconductor > package at this point in time? Or using HTseq, for that matter? Are you forgetting the K (eg. FP(K)M)? If not, I guess I'm missing something, because it's easy to compute from a count matrix you've jimmy'd together (cf. edgeR::cpm w/ normalized.lib.size=TRUE). Or, are you asking how to create such a count matrix with PE data? Or? Apologies if I'm totally in the woods, here. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Paired end -- fragments per million. The trick which I don't yet know how to do is the paired-end part. I can adjust for tx length later if need be. I usually work with RPM or log1p(RPM), since I tend to do my own alignments in Subread, but reading the Cufflinks guys' paper, it does seem like FP[K]M for paired-end data makes sense, more so with multiple isoforms. Since counting features in Subread is trivial, and with a splicegraph, I can go back and count against whatever subjunc or RSEM finds, the only other thing to fiddle with is fragment size. The rest I already implemented as coercions that take subread counts and return a SummarizedExperiment, which is how I get my RPMs from FASTQs. :-) Thanks! --t On Fri, Sep 7, 2012 at 4:09 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi Tim, > > On Fri, Sep 7, 2012 at 5:45 PM, Tim Triche, Jr. <tim.triche@gmail.com> > wrote: > > Hi Simon, > > > > I have a distantly related question -- suppose we wanted to compute FPM, > > i.e. fragments per million, for paired-end RNAseq data. Like, say, if a > > person (me) was really really lazy and was less concerned about > > sensitivity, since they were just going to feed everything to voom() or > fit > > robust linear models of log(RPM/FPM) against predictive factors. > > > > Is this (fragment-level computation) possible in an existing Bioconductor > > package at this point in time? Or using HTseq, for that matter? > > Are you forgetting the K (eg. FP(K)M)? > > If not, I guess I'm missing something, because it's easy to compute > from a count matrix you've jimmy'd together (cf. edgeR::cpm w/ > normalized.lib.size=TRUE). > > Or, are you asking how to create such a count matrix with PE data? > > Or? > > Apologies if I'm totally in the woods, here. > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago Tim Triche ★ 4.2k