Dear Shona,
These all give the same result as my suggestion. It doesn't matter
how
you express the three ages, as long as they are equally spaced, you
will
get the same differential expression results. Not just similar, but
exactly the same p-values. The coefficient logFC will change
depending on
how you scale the age. If you multiply the ages by 2, the logFC will
divide by 2. Etc.
Best wishes
Gordon
On Fri, 16 Dec 2011, Wood, Shona wrote:
Dear Gordon,
Thank you for your suggestions I think I may have a solution though I
would like to run it by you and anyone else who uses R. I was reading
a
thread on the mailing list called: "[R] Behavior of ordered factors in
glm" and it seems this person has different ages they want to do a
linear
analysis on (not in edgeR just R). Looking at there solution I have
come
up with the following:
> library(edgeR)
> library(limma)
> targets<-read.delim (file="coding_targets.txt")
> targets$age<-C(targets$age,poly,1)
> attr(targets$age,"contrasts")
.L
a-young -7.071068e-01
b-middle -7.850462e-17
c-old 7.071068e-01
> targets$sample<-factor(targets$sample)
> targets
X files age sample
1 p1 p1_coding_counts.txt a-young 1
2 p2 p2_coding_counts.txt a-young 2
3 p3 p3_coding_counts.txt a-young 3
4 p7 p7_coding_counts.txt b-middle 7
5 p8 p8_coding_counts.txt b-middle 8
6 p9 p9_coding_counts.txt b-middle 9
7 p4 p4_coding_counts.txt c-old 4
8 p5 p5_coding_counts.txt c-old 5
9 p6 p6_coding_counts.txt c-old 6
> d<-readDGE(targets, skip=1, comment.char='#')
> colnames(d)<-row.names(targets)
> d<- calcNormFactors(d)
> d
An object of class "DGEList"
$samples
X files age sample lib.size norm.factors
1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724
2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573
3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044
4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660
5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893
6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573
7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553
8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109
9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077
$counts
1 2 3 4 5 6 7 8 9
ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960
ENSRNOG00000000008 0 0 0 0 1 0 0 1 0
ENSRNOG00000000009 0 0 0 3 0 0 1 0 0
ENSRNOG00000000010 38 405 50 18 105 405 372 42 282
ENSRNOG00000000012 0 0 0 0 0 0 0 0 0
22932 more rows ...
> d<-d[rowSums(d$counts)>9,]
> design<- model.matrix(~ age, data = targets)
> design
(Intercept) age.L
1 1 -7.071068e-01
2 1 -7.071068e-01
3 1 -7.071068e-01
4 1 -7.850462e-17
5 1 -7.850462e-17
6 1 -7.850462e-17
7 1 7.071068e-01
8 1 7.071068e-01
9 1 7.071068e-01
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$age
.L
a-young -7.071068e-01
b-middle -7.850462e-17
c-old 7.071068e-01
> d<-estimateGLMCommonDisp(d, design)
> glmfit<- glmFit(d, design, dispersion=d$common.dispersion)
> results<- glmLRT(d, glmfit, coef=c(2))
> topTags(results)
Coefficient: age.L
logConc logFC LR
P.Value FDR
ENSRNOG000000xxxxx -11.271529 -1.9336775 55.45465 9.564007e-14
1.453825e-09
ENSRNOG000000xxxxx -10.359443 1.2661223 49.11802 2.410161e-12
1.831843e-08
ENSRNOG000000xxxxx -11.317494 -1.5480925 47.35359 5.926938e-12
3.003180e-08
Please correct me if I am wrong but I think that this is the change in
gene expression with increasing age taking into account expression at
middle age. I did also have another way of doing it which involved
changing age to a number as suggested by you:
targets$agenum<-as.numeric (targets$age)
targets$agenum<-targets$agenum-1
> design<- model.matrix(~ agenum, data = targets)
> design
(Intercept) agenum
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 1 1
7 1 2
8 1 2
9 1 2
attr(,"assign")
[1] 0 1
And this gives me similar results. Please let me know if you think
this is
doing what I hope it is i.e. giving my the linear change in gene
expression with increasing age whilst taking into account middle age.
Many thanks
Shona
________________________________________
From: Gordon K Smyth [smyth@wehi.EDU.AU]
Sent: 15 December 2011 22:43
To: Wood, Shona
Cc: Bioconductor mailing list
Subject: Re: EdgeR GLM
Dear Shona,
One last suggestion. Try typing
contrasts(age)
This tells you how the coefficients are determined.
I would actually suggest you simply put age in as a numericl column
(1=young,2=middle,3=old) into your design matrix, instead of defining
an
ordinal factor.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Fri, 16 Dec 2011, Gordon K Smyth wrote:
> Dear Shona,
>
> Of course linear regression would use all the data points if you had
50 ages,
> but you have in fact only three.
>
> If there are an odd number of equally spaced ages, the middle one
will not
> contribute to the coefficient. That's just how the math works.
>
> Polynomial regression isn't an invention of the edgeR package, and
how it
> works in R is determined by the base functions in R, not those in
edgeR. If
> you have further questions about linear or quadratic regression
itself, it
> might be best to read separately, or else to ask for help on a more
general
> list, probably the main R help list.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
>
http://www.wehi.edu.au
>
http://www.statsci.org/smyth
>
> On Thu, 15 Dec 2011, Wood, Shona wrote:
>
> Dear Gordon,
>
> Thank you for getting back to me and giving me those explanations.
>
> The only thing that is still confusing me is the linear coefficient.
If it
> compares young and old without regard to middle isn't it basically
the same
> as a pairwise comparison? Also it seems strange that it only uses
the two
> extremities - I thought linear regression should use all the data
points. For
> example if we had 50 different ages would 48 data points be excluded
to
> compare the first and last only?
>
> Thanks
>
> Shona
> ________________________________________
> From: Gordon K Smyth [smyth at wehi.EDU.AU]
> Sent: 13 December 2011 23:18
> To: Wood, Shona
> Cc: Bioconductor mailing list
> Subject: RE: EdgeR GLM
>
> Dear Shona,
>
> I can only give you brief pointers.
>
> There are only three time points in your data. The linear
coefficient in
> the model simply compares old with young, without regard to middle.
The
> quadratic term compares middle with the average of young and old.
So if
> expression goes 2-4-2 for a particular gene over the three times
then you
> get a positive quadratic coefficient and zero linear coefficient.
>
> It is perfectly possible for the likelihood ratio test of both
linear
> and quadratic together:
>
> glmLRT(d, glmfit, coef=2:3)
>
> to detect some genes not in the separate linear or quadratic lists,
> because the linear or quadratic terms might be not quite significant
> separately, but might be so when considered together. For example,
the
> expression might go 2-4-3. Is this linear or quadratic? Not quite
sure.
> But we might be confident that the three times are not all equal.
>
> If you only want to report genes that have increasing expression
with age,
> then I would have thought you want to fit the linear term only,
since the
> quadratic effect can never be guaranteed to be increasing.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> smyth at wehi.edu.au
>
http://www.wehi.edu.au
>
http://www.statsci.org/smyth
>
> On Tue, 13 Dec 2011, Wood, Shona wrote:
>
> Dear Gordon,
>
> Thank you for getting back to me. I am not a statistician so forgive
me if
> I am asking basic questions. I was just reading about quadratic
equations.
> Does a significant hit in the quadratic term mean that the rate of
change
> first increases and then decreases over time (or visa versa)?
Therefore
> hits which require the quadratic term are not a linear up or down
with age
> they are only "linear" after middle age?
>
> If this is the case what does the logFC for age.Q represent, is it
the
> fold change between middle age and old? And how is it calculated?
Does the
> gene expression at young age affect the quadratic calculation? For
example
> if the expression at young age is 2 and then 4 at middle and then
1.99 at
> old would the fact that old age is basically back to the young
expression
> level stop it being a significant quadratic result or would it be
> significant because of the change from middle age.
>
> I have done the quadratic term separately as suggested but I also
did the
> linear term separately (glmLRT(d, glmfit, coef=2)). When I compare
the
> results of both these tests to glmLRT(d, glmfit, coef=2, 3) there
are some
> significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in
> either of the separate quadratic or linear tests - why would this
be?
>
> Finally if i want to report changes with increasing age would it be
best
> to present the results from the quadratic and linear tests
separately or
> would it be better to present the combined test (glmLRT(d, glmfit,
coef=2,
> 3)) and point out which ones rely on the quadratic term?
>
> Many thanks
>
> Shona
> ________________________________________
> From: Gordon K Smyth [smyth at wehi.EDU.AU]
> Sent: 10 December 2011 07:10
> To: Wood, Shona
> Cc: Bioconductor mailing list
> Subject: EdgeR GLM
>
> Dear Shona,
>
> Yes, age.L is the log2-fold-change for increasing age.
>
> However your topTags table is testing for the linear and quadratic
terms
> together, so it is testing the significance of any differences
between the
> three age groups. I'd suggest you test whether the quadratic term
is
> required using
>
> glmLRT(d, glmfit, coef=3)
>
> I'd also generally recommend you go on to estimate the trended and
tagwise
> dispersions, not just the common dispersion.
>
> Best wishes
> Gordon
>
>> Date: Thu, 8 Dec 2011 17:41:07 +0000
>> From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk="">
>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch="">
>> Subject: [BioC] EdgeR GLM
>>
>> Hi,
>>
>> I am trying to look at gene expression changes with increasing age.
I
>> have three time points young, middle age and old. I have been using
the
>> following, ordered factors and coefficient in order to get a fold
change
>> and FDR for a linear change in gene expression with increasing age.
Can
>> you look over my code and see if I have done this right? Is age.L
the
>> fold change in gene expression with increasing age?
>>
>>> library(edgeR)
>>> library(limma)
>>> targets<-read.delim (file="coding_targets.txt")
>>> targets$age<-factor(ordered (targets$age))
>>> targets$sample<-factor(targets$sample)
>>> targets
>> X files age sample
>> 1 p1 p1_coding_counts.txt a-young 1
>> 2 p2 p2_coding_counts.txt a-young 2
>> 3 p3 p3_coding_counts.txt a-young 3
>> 4 p7 p7_coding_counts.txt b-middle 7
>> 5 p8 p8_coding_counts.txt b-middle 8
>> 6 p9 p9_coding_counts.txt b-middle 9
>> 7 p4 p4_coding_counts.txt c-old 4
>> 8 p5 p5_coding_counts.txt c-old 5
>> 9 p6 p6_coding_counts.txt c-old 6
>>
>>> d<-readDGE(targets, skip=1, comment.char='#')
>>> colnames(d)<-row.names(targets)
>>> d<- calcNormFactors(d)
>>> d
>> An object of class "DGEList"
>> $samples
>> X files age sample lib.size norm.factors
>> 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724
>> 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573
>> 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044
>> 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660
>> 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893
>> 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573
>> 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553
>> 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109
>> 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077
>>
>> $counts
>> 1 2 3 4 5 6 7 8 9
>> ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960
>> ENSRNOG00000000008 0 0 0 0 1 0 0 1 0
>> ENSRNOG00000000009 0 0 0 3 0 0 1 0 0
>> ENSRNOG00000000010 38 405 50 18 105 405 372 42 282
>> ENSRNOG00000000012 0 0 0 0 0 0 0 0 0
>> 22932 more rows ...
>>
>>> d<-d[rowSums(d$counts)>9,]
>>> design<- model.matrix(~ age, data = targets)
>>> design
>> (Intercept) age.L age.Q
>> 1 1 -7.071068e-01 0.4082483
>> 2 1 -7.071068e-01 0.4082483
>> 3 1 -7.071068e-01 0.4082483
>> 4 1 -7.850462e-17 -0.8164966
>> 5 1 -7.850462e-17 -0.8164966
>> 6 1 -7.850462e-17 -0.8164966
>> 7 1 7.071068e-01 0.4082483
>> 8 1 7.071068e-01 0.4082483
>> 9 1 7.071068e-01 0.4082483
>> attr(,"assign")
>> [1] 0 1 1
>> attr(,"contrasts")
>> attr(,"contrasts")$age
>> [1] "contr.poly"
>>
>>> d<-estimateGLMCommonDisp(d, design)
>>> names(d)
>> [1] "samples" "counts" "common.dispersion"
>>> glmfit<- glmFit(d, design, dispersion=d$common.dispersion)
>>> results<- glmLRT(d, glmfit, coef=c(2,3))
>>> topTags(results)
>> Coefficient: age.L age.Q
>> logConc age.L age.Q LR
P.Value
>> ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641
1.638591e-24
>>
>> Many Thanks
>>
>> Shona
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:9}}