Dear Edger developers and users,
I would like to compare transcription levels of orthologous genes
belonging
to different species, in order to find significant species-dependent
changes in transcription levels. I though of using Edger for such
analysis.
Specifically, I have the read-counts data for several RNA-Seq samples,
for
2 different species (e.g., read counts produced by Htseq-count, and
Rsem).
I would like to ask:
1) because Edger uses CPM values, which are not normalized by gene-
length,
and because the length of orthologous genes differ, it would lead to a
serious length-dependet bias, and I would ask how to normalize for
that.
2) if the above length-bias can be eliminated, and the compared genes
are
true orthologs, are you aware of any other major problems that should
be
considered in the above case ?
Thanks in advance,
Assaf
[[alternative HTML version deleted]]
Dear Assaf,
In principle, one would incorporate gene length into the edgeR offset
matrix, and this would eliminate any length-bias.
Best wishes
Gordon
> Date: Mon, 25 Aug 2014 12:50:36 +0300
> From: assaf www <assafwww at="" gmail.com="">
> To: bioconductor at r-project.org
> Subject: [BioC] Interspecies differential expression of orthologs
with
> Edger
>
> Dear Edger developers and users,
>
> I would like to compare transcription levels of orthologous genes
belonging
> to different species, in order to find significant species-dependent
> changes in transcription levels. I though of using Edger for such
analysis.
> Specifically, I have the read-counts data for several RNA-Seq
samples, for
> 2 different species (e.g., read counts produced by Htseq-count, and
Rsem).
>
> I would like to ask:
> 1) because Edger uses CPM values, which are not normalized by gene-
length,
> and because the length of orthologous genes differ, it would lead to
a
> serious length-dependet bias, and I would ask how to normalize for
that.
> 2) if the above length-bias can be eliminated, and the compared
genes are
> true orthologs, are you aware of any other major problems that
should be
> considered in the above case ?
>
>
> Thanks in advance,
> Assaf
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
That doesn't look helpful to me. I suggested that you incorporate
gene
lengths into the offsets, not do quantile normalization of cpms.
Sorry, I just don't have time to develop a code example for you. I
hope
someone else will help.
The whole topic of interspecies differential expression is a can of
worms.
Even if you adjust for gene length, there will still be differences in
gc
content and mappability between the species.
Gordon
On Wed, 27 Aug 2014, assaf www wrote:
> Dear Gordon thanks,
>
> Suppose I start with the following matrices:
>
> # 'counts' is the Rsem filtered counts
>> counts[1:4,]
> h0 h1 h2 n0 n1 n2
> ENSRNOG00000000021 36 17 20 10 25 38
> ENSRNOG00000000024 1283 615 731 644 807 991
> ENSRNOG00000000028 26 12 11 18 23 28
> ENSRNOG00000000029 22 13 12 16 17 15
>
> # 'geneLength' is the species-specific gene lengths, for species 'h'
and
> 'n':
>> geneLength[1:3,]
> h0.length h1.length h2.length n0.length n1.length
> n2.length
> ENSRNOG00000000021 1200 1200 1200 1303 1303
> 1303
> ENSRNOG00000000024 1050 1050 1050 3080 3080
> 3080
> ENSRNOG00000000028 1047 1047 1047 1121 1121
> 1121
>
>
> does the following code look correct, and may allow allows across
species
> analysis ?:
> (technically it works, I checked)
>
> # quantile normalization: (as in here:
> http://davetang.org/wiki/tiki-
index.php?page=Analysing+digital+gene+expression
> )
>
> x1 = counts*1000/geneLength
> library(limma)
> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
> offset = log(counts+0.1)-log(x2+0.1)
>
> ...
>
> y <- estimateGLMCommonDisp(y,design,offset=offset)
> y <- estimateGLMTrendedDisp(y,design,offset=offset)
> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
> fit <- glmFit(y,design,offset=offset)
>
> ...
>
>
> Thanks in advance for any help..,
> Assaf
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Probably wrong, but the reason I thought of using quantile
normalization is
the need to correct both for the species-length, and library size.
On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
> That doesn't look helpful to me. I suggested that you incorporate
gene
> lengths into the offsets, not do quantile normalization of cpms.
>
> Sorry, I just don't have time to develop a code example for you. I
hope
> someone else will help.
>
> The whole topic of interspecies differential expression is a can of
worms.
> Even if you adjust for gene length, there will still be differences
in gc
> content and mappability between the species.
>
> Gordon
>
>
> On Wed, 27 Aug 2014, assaf www wrote:
>
> Dear Gordon thanks,
>>
>> Suppose I start with the following matrices:
>>
>> # 'counts' is the Rsem filtered counts
>>
>>> counts[1:4,]
>>>
>> h0 h1 h2 n0 n1 n2
>> ENSRNOG00000000021 36 17 20 10 25 38
>> ENSRNOG00000000024 1283 615 731 644 807 991
>> ENSRNOG00000000028 26 12 11 18 23 28
>> ENSRNOG00000000029 22 13 12 16 17 15
>>
>> # 'geneLength' is the species-specific gene lengths, for species
'h' and
>> 'n':
>>
>>> geneLength[1:3,]
>>>
>> h0.length h1.length h2.length n0.length n1.length
>> n2.length
>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>> 1303
>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>> 3080
>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>> 1121
>>
>>
>> does the following code look correct, and may allow allows across
species
>> analysis ?:
>> (technically it works, I checked)
>>
>> # quantile normalization: (as in here:
>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>> digital+gene+expression
>> )
>>
>> x1 = counts*1000/geneLength
>> library(limma)
>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>> offset = log(counts+0.1)-log(x2+0.1)
>>
>> ...
>>
>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>> fit <- glmFit(y,design,offset=offset)
>>
>> ...
>>
>>
>> Thanks in advance for any help..,
>> Assaf
>>
>>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:10}}
It works something like this:
library(edgeR)
y <- DGEList(counts=counts)
y <- calcNormFactors(y)
# Column correct log gene lengths
# Columns of gl should add to zero
gl <- log(geneLength)
gl <- t(t(gl)-colMeans(gl))
# Combine library sizes, norm factors and gene lengths:
offset <- expandAsMatrix(getOffset(y))
offset <- offset + gl
Then
y$offset <- offset
y <- estimateGLMCommonDisp(y,design)
etc.
Note that I have not tried this myself. It should work in principle
from
a differential expression point of view.
On the other hand, there may be side effects regarding dispersion
trend
estimation -- I do not have time to explore this.
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Wed, 27 Aug 2014, assaf www wrote:
> Probably wrong, but the reason I thought of using quantile
normalization is
> the need to correct both for the species-length, and library size.
>
>
> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>
>> That doesn't look helpful to me. I suggested that you incorporate
gene
>> lengths into the offsets, not do quantile normalization of cpms.
>>
>> Sorry, I just don't have time to develop a code example for you. I
hope
>> someone else will help.
>>
>> The whole topic of interspecies differential expression is a can of
worms.
>> Even if you adjust for gene length, there will still be differences
in gc
>> content and mappability between the species.
>>
>> Gordon
>>
>>
>> On Wed, 27 Aug 2014, assaf www wrote:
>>
>> Dear Gordon thanks,
>>>
>>> Suppose I start with the following matrices:
>>>
>>> # 'counts' is the Rsem filtered counts
>>>
>>>> counts[1:4,]
>>>>
>>> h0 h1 h2 n0 n1 n2
>>> ENSRNOG00000000021 36 17 20 10 25 38
>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>> ENSRNOG00000000028 26 12 11 18 23 28
>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>
>>> # 'geneLength' is the species-specific gene lengths, for species
'h' and
>>> 'n':
>>>
>>>> geneLength[1:3,]
>>>>
>>> h0.length h1.length h2.length n0.length
n1.length
>>> n2.length
>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>> 1303
>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>> 3080
>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>> 1121
>>>
>>>
>>> does the following code look correct, and may allow allows across
species
>>> analysis ?:
>>> (technically it works, I checked)
>>>
>>> # quantile normalization: (as in here:
>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>> digital+gene+expression
>>> )
>>>
>>> x1 = counts*1000/geneLength
>>> library(limma)
>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>> offset = log(counts+0.1)-log(x2+0.1)
>>>
>>> ...
>>>
>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>> fit <- glmFit(y,design,offset=offset)
>>>
>>> ...
>>>
>>>
>>> Thanks in advance for any help..,
>>> Assaf
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
This is very helpful for me, thanks.
A slight change that I made in the code you sent, to avoid some R
erros, is
# to replace:
offset = offset + gl
# with:
offset = sweep(gl,2,offset,"+")
In addition to differential expression tests, I wanted also to extract
FPKMs values (and/or normalized CPM values), that would take into
account
all components of the offset (which if I'm not mistaken are:
library_size +
TMM + gene_size).
I assume rpkm()/cpm() should correct only for library_size + TMM.
Is there a possibly "decent" solution for that ?
all the best, and thanks,
Assaf
On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
> It works something like this:
>
> library(edgeR)
> y <- DGEList(counts=counts)
> y <- calcNormFactors(y)
>
> # Column correct log gene lengths
> # Columns of gl should add to zero
>
> gl <- log(geneLength)
> gl <- t(t(gl)-colMeans(gl))
>
> # Combine library sizes, norm factors and gene lengths:
>
> offset <- expandAsMatrix(getOffset(y))
> offset <- offset + gl
>
> Then
>
> y$offset <- offset
> y <- estimateGLMCommonDisp(y,design)
>
> etc.
>
> Note that I have not tried this myself. It should work in principle
from
> a differential expression point of view.
>
> On the other hand, there may be side effects regarding dispersion
trend
> estimation -- I do not have time to explore this.
>
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.statsci.org/smyth
>
> On Wed, 27 Aug 2014, assaf www wrote:
>
> Probably wrong, but the reason I thought of using quantile
normalization
>> is
>> the need to correct both for the species-length, and library size.
>>
>>
>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>> wrote:
>>
>> That doesn't look helpful to me. I suggested that you incorporate
gene
>>> lengths into the offsets, not do quantile normalization of cpms.
>>>
>>> Sorry, I just don't have time to develop a code example for you.
I hope
>>> someone else will help.
>>>
>>> The whole topic of interspecies differential expression is a can
of
>>> worms.
>>> Even if you adjust for gene length, there will still be
differences in gc
>>> content and mappability between the species.
>>>
>>> Gordon
>>>
>>>
>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>
>>> Dear Gordon thanks,
>>>
>>>>
>>>> Suppose I start with the following matrices:
>>>>
>>>> # 'counts' is the Rsem filtered counts
>>>>
>>>> counts[1:4,]
>>>>>
>>>>> h0 h1 h2 n0 n1 n2
>>>> ENSRNOG00000000021 36 17 20 10 25 38
>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>>> ENSRNOG00000000028 26 12 11 18 23 28
>>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>>
>>>> # 'geneLength' is the species-specific gene lengths, for species
'h' and
>>>> 'n':
>>>>
>>>> geneLength[1:3,]
>>>>>
>>>>> h0.length h1.length h2.length n0.length
n1.length
>>>> n2.length
>>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>>> 1303
>>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>>> 3080
>>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>>> 1121
>>>>
>>>>
>>>> does the following code look correct, and may allow allows across
>>>> species
>>>> analysis ?:
>>>> (technically it works, I checked)
>>>>
>>>> # quantile normalization: (as in here:
>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>>> digital+gene+expression
>>>> )
>>>>
>>>> x1 = counts*1000/geneLength
>>>> library(limma)
>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>>> offset = log(counts+0.1)-log(x2+0.1)
>>>>
>>>> ...
>>>>
>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>>> fit <- glmFit(y,design,offset=offset)
>>>>
>>>> ...
>>>>
>>>>
>>>> Thanks in advance for any help..,
>>>> Assaf
>>>>
>>>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:10}}
Sorry I'm not clear:
for getting the FPKMs, I guess I only need to:
rpkm(y, gene.length=geneLength, normalized.lib.sizes=T, log=F)
But gene.length should here be a vector, while in this case its a
matrix.
Assaf
On Wed, Aug 27, 2014 at 6:14 PM, assaf www <assafwww at="" gmail.com="">
wrote:
> This is very helpful for me, thanks.
>
> A slight change that I made in the code you sent, to avoid some R
erros,
> is
>
> # to replace:
> offset = offset + gl
> # with:
> offset = sweep(gl,2,offset,"+")
>
> In addition to differential expression tests, I wanted also to
extract
> FPKMs values (and/or normalized CPM values), that would take into
account
> all components of the offset (which if I'm not mistaken are:
library_size +
> TMM + gene_size).
> I assume rpkm()/cpm() should correct only for library_size + TMM.
> Is there a possibly "decent" solution for that ?
>
> all the best, and thanks,
> Assaf
>
>
>
> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>
>> It works something like this:
>>
>> library(edgeR)
>> y <- DGEList(counts=counts)
>> y <- calcNormFactors(y)
>>
>> # Column correct log gene lengths
>> # Columns of gl should add to zero
>>
>> gl <- log(geneLength)
>> gl <- t(t(gl)-colMeans(gl))
>>
>> # Combine library sizes, norm factors and gene lengths:
>>
>> offset <- expandAsMatrix(getOffset(y))
>> offset <- offset + gl
>>
>> Then
>>
>> y$offset <- offset
>> y <- estimateGLMCommonDisp(y,design)
>>
>> etc.
>>
>> Note that I have not tried this myself. It should work in
principle from
>> a differential expression point of view.
>>
>> On the other hand, there may be side effects regarding dispersion
trend
>> estimation -- I do not have time to explore this.
>>
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Wed, 27 Aug 2014, assaf www wrote:
>>
>> Probably wrong, but the reason I thought of using quantile
normalization
>>> is
>>> the need to correct both for the species-length, and library size.
>>>
>>>
>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>>> wrote:
>>>
>>> That doesn't look helpful to me. I suggested that you
incorporate gene
>>>> lengths into the offsets, not do quantile normalization of cpms.
>>>>
>>>> Sorry, I just don't have time to develop a code example for you.
I hope
>>>> someone else will help.
>>>>
>>>> The whole topic of interspecies differential expression is a can
of
>>>> worms.
>>>> Even if you adjust for gene length, there will still be
differences in
>>>> gc
>>>> content and mappability between the species.
>>>>
>>>> Gordon
>>>>
>>>>
>>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>>
>>>> Dear Gordon thanks,
>>>>
>>>>>
>>>>> Suppose I start with the following matrices:
>>>>>
>>>>> # 'counts' is the Rsem filtered counts
>>>>>
>>>>> counts[1:4,]
>>>>>>
>>>>>> h0 h1 h2 n0 n1 n2
>>>>> ENSRNOG00000000021 36 17 20 10 25 38
>>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>>>> ENSRNOG00000000028 26 12 11 18 23 28
>>>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>>>
>>>>> # 'geneLength' is the species-specific gene lengths, for species
'h'
>>>>> and
>>>>> 'n':
>>>>>
>>>>> geneLength[1:3,]
>>>>>>
>>>>>> h0.length h1.length h2.length n0.length
n1.length
>>>>> n2.length
>>>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>>>> 1303
>>>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>>>> 3080
>>>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>>>> 1121
>>>>>
>>>>>
>>>>> does the following code look correct, and may allow allows
across
>>>>> species
>>>>> analysis ?:
>>>>> (technically it works, I checked)
>>>>>
>>>>> # quantile normalization: (as in here:
>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>>>> digital+gene+expression
>>>>> )
>>>>>
>>>>> x1 = counts*1000/geneLength
>>>>> library(limma)
>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>>>> offset = log(counts+0.1)-log(x2+0.1)
>>>>>
>>>>> ...
>>>>>
>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>>>> fit <- glmFit(y,design,offset=offset)
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>> Thanks in advance for any help..,
>>>>> Assaf
>>>>>
>>>>
>>
______________________________________________________________________
>> The information in this email is confidential and intended solely
for the
>> addressee.
>> You must not disclose, forward, print or use it without the
permission of
>> the sender.
>>
______________________________________________________________________
>>
>
>
[[alternative HTML version deleted]]
Looking at the code for rpkm, it looks like it should work just fine
for a matrix or a vector.
On Wed Aug 27 09:30:11 2014, assaf www wrote:
> Sorry I'm not clear:
> for getting the FPKMs, I guess I only need to:
> rpkm(y, gene.length=geneLength, normalized.lib.sizes=T, log=F)
>
> But gene.length should here be a vector, while in this case its a
matrix.
>
>
> Assaf
>
>
> On Wed, Aug 27, 2014 at 6:14 PM, assaf www <assafwww at="" gmail.com="">
wrote:
>
>> This is very helpful for me, thanks.
>>
>> A slight change that I made in the code you sent, to avoid some R
erros,
>> is
>>
>> # to replace:
>> offset = offset + gl
>> # with:
>> offset = sweep(gl,2,offset,"+")
>>
>> In addition to differential expression tests, I wanted also to
extract
>> FPKMs values (and/or normalized CPM values), that would take into
account
>> all components of the offset (which if I'm not mistaken are:
library_size +
>> TMM + gene_size).
>> I assume rpkm()/cpm() should correct only for library_size + TMM.
>> Is there a possibly "decent" solution for that ?
>>
>> all the best, and thanks,
>> Assaf
>>
>>
>>
>> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>>
>>> It works something like this:
>>>
>>> library(edgeR)
>>> y <- DGEList(counts=counts)
>>> y <- calcNormFactors(y)
>>>
>>> # Column correct log gene lengths
>>> # Columns of gl should add to zero
>>>
>>> gl <- log(geneLength)
>>> gl <- t(t(gl)-colMeans(gl))
>>>
>>> # Combine library sizes, norm factors and gene lengths:
>>>
>>> offset <- expandAsMatrix(getOffset(y))
>>> offset <- offset + gl
>>>
>>> Then
>>>
>>> y$offset <- offset
>>> y <- estimateGLMCommonDisp(y,design)
>>>
>>> etc.
>>>
>>> Note that I have not tried this myself. It should work in
principle from
>>> a differential expression point of view.
>>>
>>> On the other hand, there may be side effects regarding dispersion
trend
>>> estimation -- I do not have time to explore this.
>>>
>>> Gordon
>>>
>>> ---------------------------------------------
>>> Professor Gordon K Smyth,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> http://www.statsci.org/smyth
>>>
>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>
>>> Probably wrong, but the reason I thought of using quantile
normalization
>>>> is
>>>> the need to correct both for the species-length, and library
size.
>>>>
>>>>
>>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>>>> wrote:
>>>>
>>>> That doesn't look helpful to me. I suggested that you
incorporate gene
>>>>> lengths into the offsets, not do quantile normalization of cpms.
>>>>>
>>>>> Sorry, I just don't have time to develop a code example for you.
I hope
>>>>> someone else will help.
>>>>>
>>>>> The whole topic of interspecies differential expression is a can
of
>>>>> worms.
>>>>> Even if you adjust for gene length, there will still be
differences in
>>>>> gc
>>>>> content and mappability between the species.
>>>>>
>>>>> Gordon
>>>>>
>>>>>
>>>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>>>
>>>>> Dear Gordon thanks,
>>>>>
>>>>>>
>>>>>> Suppose I start with the following matrices:
>>>>>>
>>>>>> # 'counts' is the Rsem filtered counts
>>>>>>
>>>>>> counts[1:4,]
>>>>>>>
>>>>>>> h0 h1 h2 n0 n1 n2
>>>>>> ENSRNOG00000000021 36 17 20 10 25 38
>>>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>>>>> ENSRNOG00000000028 26 12 11 18 23 28
>>>>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>>>>
>>>>>> # 'geneLength' is the species-specific gene lengths, for
species 'h'
>>>>>> and
>>>>>> 'n':
>>>>>>
>>>>>> geneLength[1:3,]
>>>>>>>
>>>>>>> h0.length h1.length h2.length n0.length
n1.length
>>>>>> n2.length
>>>>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>>>>> 1303
>>>>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>>>>> 3080
>>>>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>>>>> 1121
>>>>>>
>>>>>>
>>>>>> does the following code look correct, and may allow allows
across
>>>>>> species
>>>>>> analysis ?:
>>>>>> (technically it works, I checked)
>>>>>>
>>>>>> # quantile normalization: (as in here:
>>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>>>>> digital+gene+expression
>>>>>> )
>>>>>>
>>>>>> x1 = counts*1000/geneLength
>>>>>> library(limma)
>>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>>>>> offset = log(counts+0.1)-log(x2+0.1)
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>>>>> fit <- glmFit(y,design,offset=offset)
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> Thanks in advance for any help..,
>>>>>> Assaf
>>>>>>
>>>>>
>>>
______________________________________________________________________
>>> The information in this email is confidential and intended solely
for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the
permission of
>>> the sender.
>>>
______________________________________________________________________
>>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Thanks Ryan, it seems you are correct
On Wed, Aug 27, 2014 at 7:39 PM, Ryan <rct at="" thompsonclan.org=""> wrote:
> Looking at the code for rpkm, it looks like it should work just fine
for a
> matrix or a vector.
>
> On Wed Aug 27 09:30:11 2014, assaf www wrote:
>
>> Sorry I'm not clear:
>> for getting the FPKMs, I guess I only need to:
>> rpkm(y, gene.length=geneLength, normalized.lib.sizes=T, log=F)
>>
>> But gene.length should here be a vector, while in this case its a
matrix.
>>
>>
>> Assaf
>>
>>
>> On Wed, Aug 27, 2014 at 6:14 PM, assaf www <assafwww at="" gmail.com="">
wrote:
>>
>> This is very helpful for me, thanks.
>>>
>>> A slight change that I made in the code you sent, to avoid some R
erros,
>>> is
>>>
>>> # to replace:
>>> offset = offset + gl
>>> # with:
>>> offset = sweep(gl,2,offset,"+")
>>>
>>> In addition to differential expression tests, I wanted also to
extract
>>> FPKMs values (and/or normalized CPM values), that would take into
account
>>> all components of the offset (which if I'm not mistaken are:
>>> library_size +
>>> TMM + gene_size).
>>> I assume rpkm()/cpm() should correct only for library_size + TMM.
>>> Is there a possibly "decent" solution for that ?
>>>
>>> all the best, and thanks,
>>> Assaf
>>>
>>>
>>>
>>> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>>> wrote:
>>>
>>> It works something like this:
>>>>
>>>> library(edgeR)
>>>> y <- DGEList(counts=counts)
>>>> y <- calcNormFactors(y)
>>>>
>>>> # Column correct log gene lengths
>>>> # Columns of gl should add to zero
>>>>
>>>> gl <- log(geneLength)
>>>> gl <- t(t(gl)-colMeans(gl))
>>>>
>>>> # Combine library sizes, norm factors and gene lengths:
>>>>
>>>> offset <- expandAsMatrix(getOffset(y))
>>>> offset <- offset + gl
>>>>
>>>> Then
>>>>
>>>> y$offset <- offset
>>>> y <- estimateGLMCommonDisp(y,design)
>>>>
>>>> etc.
>>>>
>>>> Note that I have not tried this myself. It should work in
principle
>>>> from
>>>> a differential expression point of view.
>>>>
>>>> On the other hand, there may be side effects regarding dispersion
trend
>>>> estimation -- I do not have time to explore this.
>>>>
>>>> Gordon
>>>>
>>>> ---------------------------------------------
>>>> Professor Gordon K Smyth,
>>>> Bioinformatics Division,
>>>> Walter and Eliza Hall Institute of Medical Research,
>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>> http://www.statsci.org/smyth
>>>>
>>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>>
>>>> Probably wrong, but the reason I thought of using quantile
>>>> normalization
>>>>
>>>>> is
>>>>> the need to correct both for the species-length, and library
size.
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>>>>> wrote:
>>>>>
>>>>> That doesn't look helpful to me. I suggested that you
incorporate
>>>>> gene
>>>>>
>>>>>> lengths into the offsets, not do quantile normalization of
cpms.
>>>>>>
>>>>>> Sorry, I just don't have time to develop a code example for
you. I
>>>>>> hope
>>>>>> someone else will help.
>>>>>>
>>>>>> The whole topic of interspecies differential expression is a
can of
>>>>>> worms.
>>>>>> Even if you adjust for gene length, there will still be
differences in
>>>>>> gc
>>>>>> content and mappability between the species.
>>>>>>
>>>>>> Gordon
>>>>>>
>>>>>>
>>>>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>>>>
>>>>>> Dear Gordon thanks,
>>>>>>
>>>>>>
>>>>>>> Suppose I start with the following matrices:
>>>>>>>
>>>>>>> # 'counts' is the Rsem filtered counts
>>>>>>>
>>>>>>> counts[1:4,]
>>>>>>>
>>>>>>>>
>>>>>>>> h0 h1 h2 n0 n1 n2
>>>>>>>>
>>>>>>> ENSRNOG00000000021 36 17 20 10 25 38
>>>>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>>>>>> ENSRNOG00000000028 26 12 11 18 23 28
>>>>>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>>>>>
>>>>>>> # 'geneLength' is the species-specific gene lengths, for
species 'h'
>>>>>>> and
>>>>>>> 'n':
>>>>>>>
>>>>>>> geneLength[1:3,]
>>>>>>>
>>>>>>>>
>>>>>>>> h0.length h1.length h2.length n0.length
>>>>>>>> n1.length
>>>>>>>>
>>>>>>> n2.length
>>>>>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>>>>>> 1303
>>>>>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>>>>>> 3080
>>>>>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>>>>>> 1121
>>>>>>>
>>>>>>>
>>>>>>> does the following code look correct, and may allow allows
across
>>>>>>> species
>>>>>>> analysis ?:
>>>>>>> (technically it works, I checked)
>>>>>>>
>>>>>>> # quantile normalization: (as in here:
>>>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>>>>>> digital+gene+expression
>>>>>>> )
>>>>>>>
>>>>>>> x1 = counts*1000/geneLength
>>>>>>> library(limma)
>>>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>>>>>> offset = log(counts+0.1)-log(x2+0.1)
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>>>>>> fit <- glmFit(y,design,offset=offset)
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance for any help..,
>>>>>>> Assaf
>>>>>>>
>>>>>>>
>>>>>> ____________________________________________________________
>>>> __________
>>>> The information in this email is confidential and intended solely
for
>>>> the
>>>> addressee.
>>>> You must not disclose, forward, print or use it without the
permission
>>>> of
>>>> the sender.
>>>>
______________________________________________________________________
>>>>
>>>>
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
>>
>
[[alternative HTML version deleted]]
The code should have been:
offset <- expandAsMatrix(getOffset(y),dim(y))
offset <- offset + gl
This should give same result as your code.
rpkm() corrects for gene length as well as library size -- that's the
whole purpose of RPKMs:
rpkm(y, gene.length=geneLength)
should work for you without any modification.
Gordon
On Wed, 27 Aug 2014, assaf www wrote:
> This is very helpful for me, thanks.
>
> A slight change that I made in the code you sent, to avoid some R
erros, is
>
> # to replace:
> offset = offset + gl
> # with:
> offset = sweep(gl,2,offset,"+")
>
> In addition to differential expression tests, I wanted also to
extract
> FPKMs values (and/or normalized CPM values), that would take into
account
> all components of the offset (which if I'm not mistaken are:
library_size +
> TMM + gene_size).
> I assume rpkm()/cpm() should correct only for library_size + TMM.
> Is there a possibly "decent" solution for that ?
>
> all the best, and thanks,
> Assaf
>
>
>
> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>
>> It works something like this:
>>
>> library(edgeR)
>> y <- DGEList(counts=counts)
>> y <- calcNormFactors(y)
>>
>> # Column correct log gene lengths
>> # Columns of gl should add to zero
>>
>> gl <- log(geneLength)
>> gl <- t(t(gl)-colMeans(gl))
>>
>> # Combine library sizes, norm factors and gene lengths:
>>
>> offset <- expandAsMatrix(getOffset(y))
>> offset <- offset + gl
>>
>> Then
>>
>> y$offset <- offset
>> y <- estimateGLMCommonDisp(y,design)
>>
>> etc.
>>
>> Note that I have not tried this myself. It should work in
principle from
>> a differential expression point of view.
>>
>> On the other hand, there may be side effects regarding dispersion
trend
>> estimation -- I do not have time to explore this.
>>
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Wed, 27 Aug 2014, assaf www wrote:
>>
>> Probably wrong, but the reason I thought of using quantile
normalization
>>> is
>>> the need to correct both for the species-length, and library size.
>>>
>>>
>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>>> wrote:
>>>
>>> That doesn't look helpful to me. I suggested that you
incorporate gene
>>>> lengths into the offsets, not do quantile normalization of cpms.
>>>>
>>>> Sorry, I just don't have time to develop a code example for you.
I hope
>>>> someone else will help.
>>>>
>>>> The whole topic of interspecies differential expression is a can
of
>>>> worms.
>>>> Even if you adjust for gene length, there will still be
differences in gc
>>>> content and mappability between the species.
>>>>
>>>> Gordon
>>>>
>>>>
>>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>>
>>>> Dear Gordon thanks,
>>>>
>>>>>
>>>>> Suppose I start with the following matrices:
>>>>>
>>>>> # 'counts' is the Rsem filtered counts
>>>>>
>>>>> counts[1:4,]
>>>>>>
>>>>>> h0 h1 h2 n0 n1 n2
>>>>> ENSRNOG00000000021 36 17 20 10 25 38
>>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>>>> ENSRNOG00000000028 26 12 11 18 23 28
>>>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>>>
>>>>> # 'geneLength' is the species-specific gene lengths, for species
'h' and
>>>>> 'n':
>>>>>
>>>>> geneLength[1:3,]
>>>>>>
>>>>>> h0.length h1.length h2.length n0.length
n1.length
>>>>> n2.length
>>>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>>>> 1303
>>>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>>>> 3080
>>>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>>>> 1121
>>>>>
>>>>>
>>>>> does the following code look correct, and may allow allows
across
>>>>> species
>>>>> analysis ?:
>>>>> (technically it works, I checked)
>>>>>
>>>>> # quantile normalization: (as in here:
>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>>>> digital+gene+expression
>>>>> )
>>>>>
>>>>> x1 = counts*1000/geneLength
>>>>> library(limma)
>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>>>> offset = log(counts+0.1)-log(x2+0.1)
>>>>>
>>>>> ...
>>>>>
>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>>>> fit <- glmFit(y,design,offset=offset)
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>> Thanks in advance for any help..,
>>>>> Assaf
>>>>>
>>>>
>>
______________________________________________________________________
>> The information in this email is confidential and intended solely
for the
>> addressee.
>> You must not disclose, forward, print or use it without the
permission of
>> the sender.
>>
______________________________________________________________________
>>
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
I checked, it's true, the results look the same.
As for FPKM, of course you are right.
Thanks a lot
Assaf
On Thu, Aug 28, 2014 at 2:47 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
> The code should have been:
>
> offset <- expandAsMatrix(getOffset(y),dim(y))
> offset <- offset + gl
>
> This should give same result as your code.
>
> rpkm() corrects for gene length as well as library size -- that's
the
> whole purpose of RPKMs:
>
> rpkm(y, gene.length=geneLength)
>
> should work for you without any modification.
>
> Gordon
>
>
> On Wed, 27 Aug 2014, assaf www wrote:
>
> This is very helpful for me, thanks.
>>
>> A slight change that I made in the code you sent, to avoid some R
erros,
>> is
>>
>> # to replace:
>> offset = offset + gl
>> # with:
>> offset = sweep(gl,2,offset,"+")
>>
>> In addition to differential expression tests, I wanted also to
extract
>> FPKMs values (and/or normalized CPM values), that would take into
account
>> all components of the offset (which if I'm not mistaken are:
library_size
>> +
>> TMM + gene_size).
>> I assume rpkm()/cpm() should correct only for library_size + TMM.
>> Is there a possibly "decent" solution for that ?
>>
>> all the best, and thanks,
>> Assaf
>>
>>
>>
>> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>> wrote:
>>
>> It works something like this:
>>>
>>> library(edgeR)
>>> y <- DGEList(counts=counts)
>>> y <- calcNormFactors(y)
>>>
>>> # Column correct log gene lengths
>>> # Columns of gl should add to zero
>>>
>>> gl <- log(geneLength)
>>> gl <- t(t(gl)-colMeans(gl))
>>>
>>> # Combine library sizes, norm factors and gene lengths:
>>>
>>> offset <- expandAsMatrix(getOffset(y))
>>> offset <- offset + gl
>>>
>>> Then
>>>
>>> y$offset <- offset
>>> y <- estimateGLMCommonDisp(y,design)
>>>
>>> etc.
>>>
>>> Note that I have not tried this myself. It should work in
principle from
>>> a differential expression point of view.
>>>
>>> On the other hand, there may be side effects regarding dispersion
trend
>>> estimation -- I do not have time to explore this.
>>>
>>> Gordon
>>>
>>> ---------------------------------------------
>>> Professor Gordon K Smyth,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> http://www.statsci.org/smyth
>>>
>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>
>>> Probably wrong, but the reason I thought of using quantile
normalization
>>>
>>>> is
>>>> the need to correct both for the species-length, and library
size.
>>>>
>>>>
>>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
>>>> wrote:
>>>>
>>>> That doesn't look helpful to me. I suggested that you
incorporate gene
>>>>
>>>>> lengths into the offsets, not do quantile normalization of cpms.
>>>>>
>>>>> Sorry, I just don't have time to develop a code example for you.
I
>>>>> hope
>>>>> someone else will help.
>>>>>
>>>>> The whole topic of interspecies differential expression is a can
of
>>>>> worms.
>>>>> Even if you adjust for gene length, there will still be
differences in
>>>>> gc
>>>>> content and mappability between the species.
>>>>>
>>>>> Gordon
>>>>>
>>>>>
>>>>> On Wed, 27 Aug 2014, assaf www wrote:
>>>>>
>>>>> Dear Gordon thanks,
>>>>>
>>>>>
>>>>>> Suppose I start with the following matrices:
>>>>>>
>>>>>> # 'counts' is the Rsem filtered counts
>>>>>>
>>>>>> counts[1:4,]
>>>>>>
>>>>>>>
>>>>>>> h0 h1 h2 n0 n1 n2
>>>>>>>
>>>>>> ENSRNOG00000000021 36 17 20 10 25 38
>>>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>>>>>> ENSRNOG00000000028 26 12 11 18 23 28
>>>>>> ENSRNOG00000000029 22 13 12 16 17 15
>>>>>>
>>>>>> # 'geneLength' is the species-specific gene lengths, for
species 'h'
>>>>>> and
>>>>>> 'n':
>>>>>>
>>>>>> geneLength[1:3,]
>>>>>>
>>>>>>>
>>>>>>> h0.length h1.length h2.length n0.length
n1.length
>>>>>>>
>>>>>> n2.length
>>>>>> ENSRNOG00000000021 1200 1200 1200 1303
1303
>>>>>> 1303
>>>>>> ENSRNOG00000000024 1050 1050 1050 3080
3080
>>>>>> 3080
>>>>>> ENSRNOG00000000028 1047 1047 1047 1121
1121
>>>>>> 1121
>>>>>>
>>>>>>
>>>>>> does the following code look correct, and may allow allows
across
>>>>>> species
>>>>>> analysis ?:
>>>>>> (technically it works, I checked)
>>>>>>
>>>>>> # quantile normalization: (as in here:
>>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>>>>>> digital+gene+expression
>>>>>> )
>>>>>>
>>>>>> x1 = counts*1000/geneLength
>>>>>> library(limma)
>>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>>>>>> offset = log(counts+0.1)-log(x2+0.1)
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>>>>>> fit <- glmFit(y,design,offset=offset)
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> Thanks in advance for any help..,
>>>>>> Assaf
>>>>>>
>>>>>>
>>>>> ____________________________________________________________
>>> __________
>>> The information in this email is confidential and intended solely
for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the
permission of
>>> the sender.
>>>
______________________________________________________________________
>>>
>>>
>>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:10}}