mRNA-seq cross-species analysis, is it possible?
3
0
Entering edit mode
mali salmon ▴ 350
@mali-salmon-4532
Last seen 3.0 years ago
Israel
Dear list I would like to perform mRNA-seq cross-species comparison. In that case it would be necessary to account for the differences in gene length. I already got a reply from the author of DESeq (see below) that this is currently can't be done with DESeq. Is it possible to specify gene-specific normalization factor with edgeR? or to input read counts that have been normalized to gene length? Thanks Mali ---------- Forwarded message ---------- From: Simon Anders <anders@embl.de> Date: Wed, Aug 3, 2011 at 10:03 AM Subject: Re: DESeq between two plants with different gene length To: mali salmon <shalmom1@gmail.com> Hi Mali On 08/02/2011 09:54 PM, mali salmon wrote: > I have counts data of 2 plants, one is rice which have a genome, and the > other is non-model plant with no genome. In order to find the gene > counts for the unknown genome-plant I assembled the reads, and aligned > the contigs to the rice proteome. > Can I use DESeq to find DE genes between rice and the non-model plant? > The problem is that the genes length is different between the two > plants. Does the comparison still be valid? Would you suggest to > normalize to gene length before DESeq? > First the technical point: It might be appropriate to account for gene length, but with the current version of DESeq, you cannot specify gene-specific normalization factors, even though we'll add this feature at some point. In general, I'm hesitating to recommend using DESeq for a cross- species comparison, but I also wouldn't know of any other good method. Such comparisons are really difficult and proper interpretation is filled with methodological pitfalls. Differences in gene length and ambiguity in assigning orthologous genes are the main technical ones. Another one is the question what constitutes proper replication here. Should you grow both species under identical conditions in the lab? If so, which conditions, those good for rice (e.g., lot of water), or those good for the other species? Maybe, you should grow both species in both conditions, and consider the samples from the same species but different conditions as replicates, as this would capture as much of the environmental influence as possible. Otherwise, you could not say whether the differences in condition may be attributed to genetics (different species) or environment (different growth conditions) or an interaction of both (different level of adaption of the species to the chosen growth conditions). I know, there are papers that try to do such comparisons, but I haven't seen anything yet addressing these issues in a convincing manner. Simon [[alternative HTML version deleted]]
0
Entering edit mode
@davis-mccarthy-4138
Last seen 7.1 years ago
Hi Mali It _is_ possible and relatively straight-forward to specify a gene- specific normalization factor with edgeR. The GLM methods in edgeR take an argument called "offset", which can be a matrix (of same size as your matrix of counts) providing the appropriate offset for the negative binomial generalized linear model. This allows you to apply a gene-specific normalization factor. Say you have a regular DGEList object, d, which contains the matrix of raw read counts in d$counts, and a matrix of normalized read counts, x.norm, then the appropriate offset matrix is just as follows (we add 0.1 to the counts to deal with counts of zero): offset <- log(d$counts + 0.1) - log(x.norm + 0.1) This offset can be added to the DGEList object as the element "offset" (funnily enough): doffset <- offset >From there the software takes care of the rest, using this offset matrix as the default when estimating dispersion values and fitting NB models. Using normalization factors rather than normalized read counts is the appropriate approach for edgeR. The edgeR methods require the raw counts and any normalization should enter the model(s) as an offset. Normalized read counts should _not_ be substituted for the raw read counts in the "counts" element of a DGEList in edgeR. We have used this gene-specific normalization factor to try out things like quantile normalization on RNA-Seq data in house. To my knowledge, the new cqn package outputs gene-specific offsets that will plug in to edgeR to normalize data for (possibly among other things) gene length and GC bias. Best wishes Davis On Aug 3, 2011, at 11:48 PM, mali salmon wrote: > Dear list > I would like to perform mRNA-seq cross-species comparison. In that case it > would be necessary to account for the differences in gene length. > I already got a reply from the author of DESeq (see below) that this is > currently can't be done with DESeq. > Is it possible to specify gene-specific normalization factor with edgeR? or > to input read counts that have been normalized to gene length? > Thanks > Mali > > ---------- Forwarded message ---------- > From: Simon Anders <anders at="" embl.de=""> > Date: Wed, Aug 3, 2011 at 10:03 AM > Subject: Re: DESeq between two plants with different gene length > To: mali salmon <shalmom1 at="" gmail.com=""> > > > Hi Mali > > > On 08/02/2011 09:54 PM, mali salmon wrote: > >> I have counts data of 2 plants, one is rice which have a genome, and the >> other is non-model plant with no genome. In order to find the gene >> counts for the unknown genome-plant I assembled the reads, and aligned >> the contigs to the rice proteome. >> Can I use DESeq to find DE genes between rice and the non-model plant? >> The problem is that the genes length is different between the two >> plants. Does the comparison still be valid? Would you suggest to >> normalize to gene length before DESeq? >> > > First the technical point: It might be appropriate to account for gene > length, but with the current version of DESeq, you cannot specify > gene-specific normalization factors, even though we'll add this feature at > some point. > > In general, I'm hesitating to recommend using DESeq for a cross- species > comparison, but I also wouldn't know of any other good method. Such > comparisons are really difficult and proper interpretation is filled with > methodological pitfalls. > > Differences in gene length and ambiguity in assigning orthologous genes are > the main technical ones. Another one is the question what constitutes proper > replication here. Should you grow both species under identical conditions in > the lab? If so, which conditions, those good for rice (e.g., lot of water), > or those good for the other species? Maybe, you should grow both species in > both conditions, and consider the samples from the same species but > different conditions as replicates, as this would capture as much of the > environmental influence > as possible. Otherwise, you could not say whether the differences in > condition may be attributed to genetics (different species) or environment > (different growth conditions) or an interaction of both (different level of > adaption of the species to the chosen growth conditions). > > I know, there are papers that try to do such comparisons, but I haven't seen > anything yet addressing these issues in a convincing manner. > > > Simon > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------------------------------------------------------------------- -- Davis J McCarthy Research Technician Bioinformatics Division Walter and Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052, Australia dmccarthy at wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} ADD COMMENT 0 Entering edit mode On Wed, Aug 3, 2011 at 7:09 PM, Davis McCarthy <dmccarthy at="" wehi.edu.au=""> wrote: > We have used this gene-specific normalization factor to try out things like quantile normalization on RNA-Seq data in house. To my knowledge, the new cqn package outputs gene-specific offsets that will plug in to edgeR to normalize data for (possibly among other things) gene length and GC bias. True, but the interface to the cqn normalization method assumes that the data is ordered in a "genes" by samples matrix and that all samples have the same length/gc content for a given gene. The data we are discussing does not fit into this framework. However, it might be possible to hack the function to deal with this, by running cqn(..., sqn = FALSE) on each species separately, combining the output and then do a custom sqn normalization of the combined residuals. Hmm, this clearly requires more than 60 secs. of thinking (and probably some careful looking at the output to get an idea of whether this gives useful output or makes things worse). Kasper ADD REPLY 0 Entering edit mode Thanks Davis and Kasper for your reply. So if I understand correctly, after I add the offset to the DGEList, I continue with estimating common dispersion (so skipping the calcNormFactors step). And if so, would you suggest to do quantile normalization after adjusting the read counts to gene length, and before estimating the common dispersion? Mali On Thu, Aug 4, 2011 at 2:40 AM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > On Wed, Aug 3, 2011 at 7:09 PM, Davis McCarthy <dmccarthy@wehi.edu.au> > wrote: > > > We have used this gene-specific normalization factor to try out things > like quantile normalization on RNA-Seq data in house. To my knowledge, the > new cqn package outputs gene-specific offsets that will plug in to edgeR to > normalize data for (possibly among other things) gene length and GC bias. > > True, but the interface to the cqn normalization method assumes that > the data is ordered in a "genes" by samples matrix and that all > samples have the same length/gc content for a given gene. The data we > are discussing does not fit into this framework. However, it might be > possible to hack the function to deal with this, by running > cqn(..., sqn = FALSE) > on each species separately, combining the output and then do a custom > sqn normalization of the combined residuals. Hmm, this clearly > requires more than 60 secs. of thinking (and probably some careful > looking at the output to get an idea of whether this gives useful > output or makes things worse). > > Kasper > [[alternative HTML version deleted]] ADD REPLY 0 Entering edit mode @thomas-j-hardcastle-3860 Last seen 4.0 years ago United Kingdom > ---------------------------------------------------------------------- > > Message: 5 > Date: Wed, 3 Aug 2011 14:48:28 +0100 > From: mali salmon <shalmom1 at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] mRNA-seq cross-species analysis, is it possible? > Message-ID: > <cah5dn6xgokwmjdjmpmmrbv4iykz4nzc7pyjrkufahgckqghmpa at="" mail.gmail.com=""> > Content-Type: text/plain > > Dear list > I would like to perform mRNA-seq cross-species comparison. In that case it > would be necessary to account for the differences in gene length. > I already got a reply from the author of DESeq (see below) that this is > currently can't be done with DESeq. > Is it possible to specify gene-specific normalization factor with edgeR? or > to input read counts that have been normalized to gene length? > Thanks > Mali Hi Mali, It's also possible to use baySeq to analyse samples with different gene lengths. If you look at section 7 of the vignette for this package, you will see the 'seglens' slot of a countData object being given as a vector. In this case, baySeq uses, for each gene, the same length across all samples. However, you can instead pass a matrix to the seglens slot, of the same dimensions as your 'data' vector, that will use different gene lengths for different samples. For example, if the first gene were of length 1000 in one species, and 2500 in the other species, and you have two replicates of each condition, the first row of the seglens slot should be: [,1] 1000, 1000, 2500, 2500 Analysis then proceeds as normal. Tom -- Dr. Thomas J. Hardcastle Department of Plant Sciences University of Cambridge Downing Street Cambridge, CB2 3EA United Kingdom ADD COMMENT 0 Entering edit mode @davis-mccarthy-4794 Last seen 7.1 years ago Hi Mali > So if I understand correctly, after I add the offset to the DGEList, I continue with estimating common dispersion (so skipping the calcNormFactors step). Yes, that is correct. If you add your own offset to the DGEList then the normalization factors computed by calcNormFactors are not used. (doffset is always used, when not NULL, ahead of d$samples$norm.factors for the offset in the software). Once you've added the offsets to the DGEList object your proceed with dispersion estimation and DE testing as usual. > And if so, would you suggest to do quantile normalization after adjusting the read counts to gene length, and before estimating the common dispersion? Quantile normalization (which could itself account for gene lengths) may or may not be a sensible thing to do depending on your particular experiment and data. I can't really advise sensibly on whether you should or shouldn't do it. You can always try and see if it makes any sense for your dataset. Whatever normalization you do, it must be done before estimating the common dispersion or carrying out any further downstream inference on DE. Best wishes Davis ---------- Forwarded message ---------- From: mali salmon <shalmom1@gmail.com> To: Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> Date: Thu, 4 Aug 2011 09:00:45 +0300 Subject: Re: [BioC] mRNA-seq cross-species analysis, is it possible? Thanks Davis and Kasper for your reply. So if I understand correctly, after I add the offset to the DGEList, I continue with estimating common dispersion (so skipping the calcNormFactors step). And if so, would you suggest to do quantile normalization after adjusting the read counts to gene length, and before estimating the common dispersion? Mali On Thu, Aug 4, 2011 at 2:40 AM, Kasper Daniel Hansen < kasperdanielhansen at gmail.com> wrote: > On Wed, Aug 3, 2011 at 7:09 PM, Davis McCarthy <dmccarthy at="" wehi.edu.au=""> > wrote: > > > We have used this gene-specific normalization factor to try out things > like quantile normalization on RNA-Seq data in house. To my knowledge, the > new cqn package outputs gene-specific offsets that will plug in to edgeR to > normalize data for (possibly among other things) gene length and GC bias. > > True, but the interface to the cqn normalization method assumes that > the data is ordered in a "genes" by samples matrix and that all > samples have the same length/gc content for a given gene. The data we > are discussing does not fit into this framework. However, it might be > possible to hack the function to deal with this, by running > cqn(..., sqn = FALSE) > on each species separately, combining the output and then do a custom > sqn normalization of the combined residuals. Hmm, this clearly > requires more than 60 secs. of thinking (and probably some careful > looking at the output to get an idea of whether this gives useful > output or makes things worse). > > Kasper > [[alternative HTML version deleted]] ---------------------------------------------------------------------- ----- Davis J McCarthy Research Technician Bioinformatics Division Walter and Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052, Australia dmccarthy at wehi.edu.au http://www.wehi.edu.au