edgeR design matrix

0

Entering edit mode

lpascual ▴ 50

@lpascual-4906

Last seen 9.6 years ago

Hi all, I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype. I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix. To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar. To do that, it is possible to use one design matrix like these: Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8 1-rep1 1 0 0 0 0 0 0 0 1-rep2 1 0 0 0 0 0 0 0 2-rep1 0 1 0 0 0 0 0 0 2-rep2 0 1 0 0 0 0 0 0 3-rep1 0 0 1 0 0 0 0 0 3-rep2 0 0 1 0 0 0 0 0 4-rep1 0 0 0 1 0 0 0 0 4-rep2 0 0 0 1 0 0 0 0 5-rep1 0 0 0 0 1 0 0 0 5-rep2 0 0 0 0 1 0 0 0 6-rep1 0 0 0 0 0 1 0 0 6-rep2 0 0 0 0 0 1 0 0 7-rep1 0 0 0 0 0 0 1 0 7-rep2 0 0 0 0 0 0 1 0 8-rep1 0 0 0 0 0 0 0 1 8-rep2 0 0 0 0 0 0 0 1 I have try it, but to calculate the null hypothesis the gmlLRT seams just to remove the last column?? Besides I also want to test the differences between two parent lines and the F1 that I have derived from them I have try a design matrix as follows, Cultivar1 Cultivar2 F1 1-rep1 1 0 0 1-rep2 1 0 0 2-rep1 0 1 0 2-rep2 0 1 0 F1-rep1 0 0 1 F1-rep2 0 0 1 But I again face the above mentioned problem. Besides I was wondering if I can employ a design matrix where I will be modeling how many alleles of each cultivar have my sample, doing something like these: Cultivar1 Cultivar2 1-rep1 2 0 1-rep2 2 0 2-rep1 0 2 2-rep2 0 2 F1-rep1 0 1 F1-rep2 0 1 I will appreciate any advice about the proper way to face these analysis. An just an extra question, in the manual it is recommended to eliminate all the tags with low counts, as they are no going to be sadistically significance and they will slow the process. First it is said that the tags with less than 5 counts are not useful, but then all the tags with lees than one tag per million (TPM) are removed, however in my case having 50 million sequences that is all the tags with less than 50 counts. In fact when I do it my tags decrease a lot > dim(d) [1] 2029802 16 > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, dim(d)) >1) >= 2, ] > dim (d) [1] 73310 16 That is normal? Why did you choose to remove less than 1TPM? Thanks in advance Laura Pascual [[alternative HTML version deleted]]

PROcess edgeR PROcess edgeR • 1.6k views

ADD COMMENT • link updated 12.5 years ago by Gordon Smyth 50k • written 12.5 years ago by lpascual ▴ 50

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 12 minutes ago

WEHI, Melbourne, Australia

Dear Laura, > Date: Wed, 12 Oct 2011 14:23:26 +0200 > From: lpascual <laura.pascual at="" avignon.inra.fr=""> > To: Bioconductor at r-project.org > Subject: [BioC] edgeR design matrix > > Hi all, > > I am new with RNA-seq data analysis. I have data from illumina DGE > experiments from 8 different cultivars (duplicates). I also have 4 extra > samples, the F1 hybrids (duplicates), obtained from crossing the 8 > cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE > illumina reactions, where all the samples came from the same > developmental stage and treatment. So the only different factor I have > is the genotype. > > I am trying to analyze the data with the edgeR package, however I am not > sure if I can use the GML function for multiple testings or how should I > made my design matrix. > > To find the differences among the 8 parental lines, I know I can made > two ways testing between the different samples, but I was wondering, if > I could be able to do a test were the null hypothesis will be there is > no difference between the cultivars, and the alternative hypothesis, at > least there is difference for one cultivar. > > To do that, it is possible to use one design matrix like these: > > Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8 > 1-rep1 1 0 0 0 0 0 0 0 > 1-rep2 1 0 0 0 0 0 0 0 > 2-rep1 0 1 0 0 0 0 0 0 > 2-rep2 0 1 0 0 0 0 0 0 > 3-rep1 0 0 1 0 0 0 0 0 > 3-rep2 0 0 1 0 0 0 0 0 > 4-rep1 0 0 0 1 0 0 0 0 > 4-rep2 0 0 0 1 0 0 0 0 > 5-rep1 0 0 0 0 1 0 0 0 > 5-rep2 0 0 0 0 1 0 0 0 > 6-rep1 0 0 0 0 0 1 0 0 > 6-rep2 0 0 0 0 0 1 0 0 > 7-rep1 0 0 0 0 0 0 1 0 > 7-rep2 0 0 0 0 0 0 1 0 > 8-rep1 0 0 0 0 0 0 0 1 > 8-rep2 0 0 0 0 0 0 0 1 > > I have try it, but to calculate the null hypothesis the gmlLRT seams > just to remove the last column?? You could use that design matrix, but it's inconvenient. If you want to find transcripts that are different between any of the cultivars, you would use: design <- model.matrix(~Cultivar) fit <- glmFit(y,design) lrt <- glmLRT(y,fit,design,coef=2:8) topTags(lrt) This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars. Of course you have to estimate the dispersions as well, but I assume you know how to do that. > Besides I also want to test the differences between two parent lines and > the F1 that I have derived from them I have try a design matrix as follows, > > > Cultivar1 Cultivar2 F1 > 1-rep1 1 0 0 > 1-rep2 1 0 0 > 2-rep1 0 1 0 > 2-rep2 0 1 0 > F1-rep1 0 0 1 > F1-rep2 0 0 1 > > But I again face the above mentioned problem. The above example should give you a clue as to how to do this. > Besides I was wondering if I can employ a design matrix where I will be > modeling how many alleles of each cultivar have my sample, doing > something like these: > > Cultivar1 Cultivar2 > 1-rep1 2 0 > 1-rep2 2 0 > 2-rep1 0 2 > 2-rep2 0 2 > F1-rep1 0 1 > F1-rep2 0 1 > > I will appreciate any advice about the proper way to face these analysis. You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices. > An just an extra question, in the manual it is recommended to eliminate > all the tags with low counts, as they are no going to be sadistically > significance and they will slow the process. First it is said that the > tags with less than 5 counts are not useful, but then all the tags with > lees than one tag per million (TPM) are removed, however in my case > having 50 million sequences that is all the tags with less than 50 counts. > > In fact when I do it my tags decrease a lot > > > dim(d) > [1] 2029802 16 > > > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, > dim(d)) >1) >= 2, ] You can just use cpm(d) to compute the counts-per-million. > > dim (d) > > [1] 73310 16 > > That is normal? Why did you choose to remove less than 1TPM? It depends on your library size. A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff. Best wishes Gordon > Thanks in advance > Laura Pascual ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.5 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, I try it like you said, but I get a mistake, I do not know how to solve it?? Thanks in advance Laura > cultivar <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8) + ) > design <-model.matrix(~cultivar) > design (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7 1 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 3 1 1 0 0 0 0 0 4 1 1 0 0 0 0 0 5 1 0 1 0 0 0 0 6 1 0 1 0 0 0 0 7 1 0 0 1 0 0 0 8 1 0 0 1 0 0 0 9 1 0 0 0 1 0 0 10 1 0 0 0 1 0 0 11 1 0 0 0 0 1 0 12 1 0 0 0 0 1 0 13 1 0 0 0 0 0 1 14 1 0 0 0 0 0 1 15 1 0 0 0 0 0 0 16 1 0 0 0 0 0 0 cultivar8 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 1 16 1 attr(,"assign") [1] 0 1 1 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$cultivar [1] "contr.treatment" > d <- estimateGLMCommonDisp(d, design) > d An object of class "DGEList" $samples files group description lib.size norm.factors 1 s_1_CE_1_1counts.txt CER Cervil_CE 54428965 1 2 s_2_CE_1_2counts.txt CER Cervil_CE 47946147 1 3 s_6_CE_6_1counts.txt CRI Criollo_CE 49025258 1 4 s_7_CE_6_2counts.txt CRI Criollo_CE 43322654 1 5 s_1_CE_10_1counts.txt FER Ferum_CE 53574521 1 11 more rows ... $counts 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273 97 208 256 345 GATCTCTGTCTCATCTTTCG 122 87 113 131 210 332 384 399 221 309 138 127 329 272 GATCTTCTTTTGAAGTAAAC 116 141 125 52 158 248 170 141 137 135 166 196 111 108 GATCAGCCAGGTCCAAGGCC 56 15 41 44 36 47 40 39 80 56 26 46 31 46 GATCCTCCGGAACCACAAGA 370 226 11 4 12 0 23 5 129 1 38 22 5 9 15 16 GATCAAGAAAATAAAATGAA 154 69 GATCTCTGTCTCATCTTTCG 112 94 GATCTTCTTTTGAAGTAAAC 134 74 GATCAGCCAGGTCCAAGGCC 81 66 GATCCTCCGGAACCACAAGA 151 7 73310 more rows ... $common.dispersion [1] 0.1259328 > fit <- glmFit(d,design,dispersion = d$common.dispersion) > lrt <- glmLRT(d,fit,design,coef=2:8) Error in glmfit$coefficients %*% contrast : non-conformable arguments On 10/14/2011 07:00 AM, Gordon K Smyth wrote: > Dear Laura, > >> Date: Wed, 12 Oct 2011 14:23:26 +0200 >> From: lpascual <laura.pascual at="" avignon.inra.fr=""> >> To: Bioconductor at r-project.org >> Subject: [BioC] edgeR design matrix >> >> Hi all, >> >> I am new with RNA-seq data analysis. I have data from illumina DGE >> experiments from 8 different cultivars (duplicates). I also have 4 >> extra samples, the F1 hybrids (duplicates), obtained from crossing >> the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 >> DGE illumina reactions, where all the samples came from the same >> developmental stage and treatment. So the only different factor I >> have is the genotype. >> >> I am trying to analyze the data with the edgeR package, however I am >> not sure if I can use the GML function for multiple testings or how >> should I made my design matrix. >> >> To find the differences among the 8 parental lines, I know I can made >> two ways testing between the different samples, but I was wondering, >> if I could be able to do a test were the null hypothesis will be >> there is no difference between the cultivars, and the alternative >> hypothesis, at least there is difference for one cultivar. >> >> To do that, it is possible to use one design matrix like these: >> >> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 >> Cultivar8 >> 1-rep1 1 0 0 0 0 0 0 0 >> 1-rep2 1 0 0 0 0 0 0 0 >> 2-rep1 0 1 0 0 0 0 0 0 >> 2-rep2 0 1 0 0 0 0 0 0 >> 3-rep1 0 0 1 0 0 0 0 0 >> 3-rep2 0 0 1 0 0 0 0 0 >> 4-rep1 0 0 0 1 0 0 0 0 >> 4-rep2 0 0 0 1 0 0 0 0 >> 5-rep1 0 0 0 0 1 0 0 0 >> 5-rep2 0 0 0 0 1 0 0 0 >> 6-rep1 0 0 0 0 0 1 0 0 >> 6-rep2 0 0 0 0 0 1 0 0 >> 7-rep1 0 0 0 0 0 0 1 0 >> 7-rep2 0 0 0 0 0 0 1 0 >> 8-rep1 0 0 0 0 0 0 0 1 >> 8-rep2 0 0 0 0 0 0 0 1 >> >> I have try it, but to calculate the null hypothesis the gmlLRT seams >> just to remove the last column?? > > You could use that design matrix, but it's inconvenient. If you want > to find transcripts that are different between any of the cultivars, > you would use: > > design <- model.matrix(~Cultivar) > fit <- glmFit(y,design) > lrt <- glmLRT(y,fit,design,coef=2:8) > topTags(lrt) > > This gives you a test on 7 degrees of freedom (one for each > coefficient being tested) for differences between the 8 cultivars. > > Of course you have to estimate the dispersions as well, but I assume > you know how to do that. > > >> Besides I also want to test the differences between two parent lines and >> the F1 that I have derived from them I have try a design matrix as >> follows, >> >> >> Cultivar1 Cultivar2 F1 >> 1-rep1 1 0 0 >> 1-rep2 1 0 0 >> 2-rep1 0 1 0 >> 2-rep2 0 1 0 >> F1-rep1 0 0 1 >> F1-rep2 0 0 1 >> >> But I again face the above mentioned problem. > > The above example should give you a clue as to how to do this. > > >> Besides I was wondering if I can employ a design matrix where I will be >> modeling how many alleles of each cultivar have my sample, doing >> something like these: >> >> Cultivar1 Cultivar2 >> 1-rep1 2 0 >> 1-rep2 2 0 >> 2-rep1 0 2 >> 2-rep2 0 2 >> F1-rep1 0 1 >> F1-rep2 0 1 >> >> I will appreciate any advice about the proper way to face these >> analysis. > > You can put a covariate for allele number as a column in your design > matrix, but you obviously need a column for the intercept term as > well. Basically you need to start using the model.matrix() function to > build your design matrices. > > >> An just an extra question, in the manual it is recommended to eliminate >> all the tags with low counts, as they are no going to be sadistically >> significance and they will slow the process. First it is said that the >> tags with less than 5 counts are not useful, but then all the tags with >> lees than one tag per million (TPM) are removed, however in my case >> having 50 million sequences that is all the tags with less than 50 >> counts. >> >> In fact when I do it my tags decrease a lot >> >> > dim(d) >> [1] 2029802 16 >> >> > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, >> dim(d)) >1) >= 2, ] > > You can just use cpm(d) to compute the counts-per-million. > >> > dim (d) >> >> [1] 73310 16 >> >> That is normal? Why did you choose to remove less than 1TPM? > > It depends on your library size. A cutoff of 1 cpm was appropriate > for the case study, but if you library sizes are smaller, then you > will need to use a smaller cutoff. > > Best wishes > Gordon > >> Thanks in advance >> Laura Pascual > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD REPLY • link 12.5 years ago lpascual ▴ 50

0

Entering edit mode

Hi Laura, I think there was a minor typo there. The glmLRT() call should be: lrt <- glmLRT(d,fit,coef=2:8) Also, if you wish, you could create your factor with less keystrokes: cultivar <- factor(rep(1:8,each=2)) Hope that helps, Mark On 14.10.2011, at 17:13, lpascual wrote: > Hi Gordon, > > I try it like you said, but I get a mistake, I do not know how to solve it?? > > Thanks in advance > > Laura > > > cultivar <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8) > + ) > > design <-model.matrix(~cultivar) > > design > (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7 > 1 1 0 0 0 0 0 0 > 2 1 0 0 0 0 0 0 > 3 1 1 0 0 0 0 0 > 4 1 1 0 0 0 0 0 > 5 1 0 1 0 0 0 0 > 6 1 0 1 0 0 0 0 > 7 1 0 0 1 0 0 0 > 8 1 0 0 1 0 0 0 > 9 1 0 0 0 1 0 0 > 10 1 0 0 0 1 0 0 > 11 1 0 0 0 0 1 0 > 12 1 0 0 0 0 1 0 > 13 1 0 0 0 0 0 1 > 14 1 0 0 0 0 0 1 > 15 1 0 0 0 0 0 0 > 16 1 0 0 0 0 0 0 > cultivar8 > 1 0 > 2 0 > 3 0 > 4 0 > 5 0 > 6 0 > 7 0 > 8 0 > 9 0 > 10 0 > 11 0 > 12 0 > 13 0 > 14 0 > 15 1 > 16 1 > attr(,"assign") > [1] 0 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$cultivar > [1] "contr.treatment" > > > d <- estimateGLMCommonDisp(d, design) > > d > An object of class "DGEList" > $samples > files group description lib.size norm.factors > 1 s_1_CE_1_1counts.txt CER Cervil_CE 54428965 1 > 2 s_2_CE_1_2counts.txt CER Cervil_CE 47946147 1 > 3 s_6_CE_6_1counts.txt CRI Criollo_CE 49025258 1 > 4 s_7_CE_6_2counts.txt CRI Criollo_CE 43322654 1 > 5 s_1_CE_10_1counts.txt FER Ferum_CE 53574521 1 > 11 more rows ... > > $counts > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 > GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273 97 208 256 345 > GATCTCTGTCTCATCTTTCG 122 87 113 131 210 332 384 399 221 309 138 127 329 272 > GATCTTCTTTTGAAGTAAAC 116 141 125 52 158 248 170 141 137 135 166 196 111 108 > GATCAGCCAGGTCCAAGGCC 56 15 41 44 36 47 40 39 80 56 26 46 31 46 > GATCCTCCGGAACCACAAGA 370 226 11 4 12 0 23 5 129 1 38 22 5 9 > 15 16 > GATCAAGAAAATAAAATGAA 154 69 > GATCTCTGTCTCATCTTTCG 112 94 > GATCTTCTTTTGAAGTAAAC 134 74 > GATCAGCCAGGTCCAAGGCC 81 66 > GATCCTCCGGAACCACAAGA 151 7 > 73310 more rows ... > > $common.dispersion > [1] 0.1259328 > > > fit <- glmFit(d,design,dispersion = d$common.dispersion) > > lrt <- glmLRT(d,fit,design,coef=2:8) > Error in glmfit$coefficients %*% contrast : non-conformable arguments > > > > > On 10/14/2011 07:00 AM, Gordon K Smyth wrote: >> Dear Laura, >> >>> Date: Wed, 12 Oct 2011 14:23:26 +0200 >>> From: lpascual <laura.pascual at="" avignon.inra.fr=""> >>> To: Bioconductor at r-project.org >>> Subject: [BioC] edgeR design matrix >>> >>> Hi all, >>> >>> I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype. >>> >>> I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix. >>> >>> To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar. >>> >>> To do that, it is possible to use one design matrix like these: >>> >>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8 >>> 1-rep1 1 0 0 0 0 0 0 0 >>> 1-rep2 1 0 0 0 0 0 0 0 >>> 2-rep1 0 1 0 0 0 0 0 0 >>> 2-rep2 0 1 0 0 0 0 0 0 >>> 3-rep1 0 0 1 0 0 0 0 0 >>> 3-rep2 0 0 1 0 0 0 0 0 >>> 4-rep1 0 0 0 1 0 0 0 0 >>> 4-rep2 0 0 0 1 0 0 0 0 >>> 5-rep1 0 0 0 0 1 0 0 0 >>> 5-rep2 0 0 0 0 1 0 0 0 >>> 6-rep1 0 0 0 0 0 1 0 0 >>> 6-rep2 0 0 0 0 0 1 0 0 >>> 7-rep1 0 0 0 0 0 0 1 0 >>> 7-rep2 0 0 0 0 0 0 1 0 >>> 8-rep1 0 0 0 0 0 0 0 1 >>> 8-rep2 0 0 0 0 0 0 0 1 >>> >>> I have try it, but to calculate the null hypothesis the gmlLRT seams >>> just to remove the last column?? >> >> You could use that design matrix, but it's inconvenient. If you want to find transcripts that are different between any of the cultivars, you would use: >> >> design <- model.matrix(~Cultivar) >> fit <- glmFit(y,design) >> lrt <- glmLRT(y,fit,design,coef=2:8) >> topTags(lrt) >> >> This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars. >> >> Of course you have to estimate the dispersions as well, but I assume you know how to do that. >> >> >>> Besides I also want to test the differences between two parent lines and >>> the F1 that I have derived from them I have try a design matrix as follows, >>> >>> >>> Cultivar1 Cultivar2 F1 >>> 1-rep1 1 0 0 >>> 1-rep2 1 0 0 >>> 2-rep1 0 1 0 >>> 2-rep2 0 1 0 >>> F1-rep1 0 0 1 >>> F1-rep2 0 0 1 >>> >>> But I again face the above mentioned problem. >> >> The above example should give you a clue as to how to do this. >> >> >>> Besides I was wondering if I can employ a design matrix where I will be >>> modeling how many alleles of each cultivar have my sample, doing >>> something like these: >>> >>> Cultivar1 Cultivar2 >>> 1-rep1 2 0 >>> 1-rep2 2 0 >>> 2-rep1 0 2 >>> 2-rep2 0 2 >>> F1-rep1 0 1 >>> F1-rep2 0 1 >>> >>> I will appreciate any advice about the proper way to face these analysis. >> >> You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices. >> >> >>> An just an extra question, in the manual it is recommended to eliminate >>> all the tags with low counts, as they are no going to be sadistically >>> significance and they will slow the process. First it is said that the >>> tags with less than 5 counts are not useful, but then all the tags with >>> lees than one tag per million (TPM) are removed, however in my case >>> having 50 million sequences that is all the tags with less than 50 counts. >>> >>> In fact when I do it my tags decrease a lot >>> >>> > dim(d) >>> [1] 2029802 16 >>> >>> > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, >>> dim(d)) >1) >= 2, ] >> >> You can just use cpm(d) to compute the counts-per-million. >> >>> > dim (d) >>> >>> [1] 73310 16 >>> >>> That is normal? Why did you choose to remove less than 1TPM? >> >> It depends on your library size. A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff. >> >> Best wishes >> Gordon >> >>> Thanks in advance >>> Laura Pascual >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago Mark Robinson ▴ 880

0

Entering edit mode

That works perfect, thanks so much Gordon and Mark On 10/14/2011 09:27 PM, Mark Robinson wrote: > Hi Laura, > > I think there was a minor typo there. The glmLRT() call should be: > > lrt<- glmLRT(d,fit,coef=2:8) > > Also, if you wish, you could create your factor with less keystrokes: > > cultivar<- factor(rep(1:8,each=2)) > > Hope that helps, > Mark > > > On 14.10.2011, at 17:13, lpascual wrote: > >> Hi Gordon, >> >> I try it like you said, but I get a mistake, I do not know how to solve it?? >> >> Thanks in advance >> >> Laura >> >>> cultivar<- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8) >> + ) >>> design<-model.matrix(~cultivar) >>> design >> (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7 >> 1 1 0 0 0 0 0 0 >> 2 1 0 0 0 0 0 0 >> 3 1 1 0 0 0 0 0 >> 4 1 1 0 0 0 0 0 >> 5 1 0 1 0 0 0 0 >> 6 1 0 1 0 0 0 0 >> 7 1 0 0 1 0 0 0 >> 8 1 0 0 1 0 0 0 >> 9 1 0 0 0 1 0 0 >> 10 1 0 0 0 1 0 0 >> 11 1 0 0 0 0 1 0 >> 12 1 0 0 0 0 1 0 >> 13 1 0 0 0 0 0 1 >> 14 1 0 0 0 0 0 1 >> 15 1 0 0 0 0 0 0 >> 16 1 0 0 0 0 0 0 >> cultivar8 >> 1 0 >> 2 0 >> 3 0 >> 4 0 >> 5 0 >> 6 0 >> 7 0 >> 8 0 >> 9 0 >> 10 0 >> 11 0 >> 12 0 >> 13 0 >> 14 0 >> 15 1 >> 16 1 >> attr(,"assign") >> [1] 0 1 1 1 1 1 1 1 >> attr(,"contrasts") >> attr(,"contrasts")$cultivar >> [1] "contr.treatment" >> >>> d<- estimateGLMCommonDisp(d, design) >>> d >> An object of class "DGEList" >> $samples >> files group description lib.size norm.factors >> 1 s_1_CE_1_1counts.txt CER Cervil_CE 54428965 1 >> 2 s_2_CE_1_2counts.txt CER Cervil_CE 47946147 1 >> 3 s_6_CE_6_1counts.txt CRI Criollo_CE 49025258 1 >> 4 s_7_CE_6_2counts.txt CRI Criollo_CE 43322654 1 >> 5 s_1_CE_10_1counts.txt FER Ferum_CE 53574521 1 >> 11 more rows ... >> >> $counts >> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 >> GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273 97 208 256 345 >> GATCTCTGTCTCATCTTTCG 122 87 113 131 210 332 384 399 221 309 138 127 329 272 >> GATCTTCTTTTGAAGTAAAC 116 141 125 52 158 248 170 141 137 135 166 196 111 108 >> GATCAGCCAGGTCCAAGGCC 56 15 41 44 36 47 40 39 80 56 26 46 31 46 >> GATCCTCCGGAACCACAAGA 370 226 11 4 12 0 23 5 129 1 38 22 5 9 >> 15 16 >> GATCAAGAAAATAAAATGAA 154 69 >> GATCTCTGTCTCATCTTTCG 112 94 >> GATCTTCTTTTGAAGTAAAC 134 74 >> GATCAGCCAGGTCCAAGGCC 81 66 >> GATCCTCCGGAACCACAAGA 151 7 >> 73310 more rows ... >> >> $common.dispersion >> [1] 0.1259328 >> >>> fit<- glmFit(d,design,dispersion = d$common.dispersion) >>> lrt<- glmLRT(d,fit,design,coef=2:8) >> Error in glmfit$coefficients %*% contrast : non-conformable arguments >> >> >> >> >> On 10/14/2011 07:00 AM, Gordon K Smyth wrote: >>> Dear Laura, >>> >>>> Date: Wed, 12 Oct 2011 14:23:26 +0200 >>>> From: lpascual<laura.pascual at="" avignon.inra.fr=""> >>>> To: Bioconductor at r-project.org >>>> Subject: [BioC] edgeR design matrix >>>> >>>> Hi all, >>>> >>>> I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype. >>>> >>>> I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix. >>>> >>>> To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar. >>>> >>>> To do that, it is possible to use one design matrix like these: >>>> >>>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8 >>>> 1-rep1 1 0 0 0 0 0 0 0 >>>> 1-rep2 1 0 0 0 0 0 0 0 >>>> 2-rep1 0 1 0 0 0 0 0 0 >>>> 2-rep2 0 1 0 0 0 0 0 0 >>>> 3-rep1 0 0 1 0 0 0 0 0 >>>> 3-rep2 0 0 1 0 0 0 0 0 >>>> 4-rep1 0 0 0 1 0 0 0 0 >>>> 4-rep2 0 0 0 1 0 0 0 0 >>>> 5-rep1 0 0 0 0 1 0 0 0 >>>> 5-rep2 0 0 0 0 1 0 0 0 >>>> 6-rep1 0 0 0 0 0 1 0 0 >>>> 6-rep2 0 0 0 0 0 1 0 0 >>>> 7-rep1 0 0 0 0 0 0 1 0 >>>> 7-rep2 0 0 0 0 0 0 1 0 >>>> 8-rep1 0 0 0 0 0 0 0 1 >>>> 8-rep2 0 0 0 0 0 0 0 1 >>>> >>>> I have try it, but to calculate the null hypothesis the gmlLRT seams >>>> just to remove the last column?? >>> You could use that design matrix, but it's inconvenient. If you want to find transcripts that are different between any of the cultivars, you would use: >>> >>> design<- model.matrix(~Cultivar) >>> fit<- glmFit(y,design) >>> lrt<- glmLRT(y,fit,design,coef=2:8) >>> topTags(lrt) >>> >>> This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars. >>> >>> Of course you have to estimate the dispersions as well, but I assume you know how to do that. >>> >>> >>>> Besides I also want to test the differences between two parent lines and >>>> the F1 that I have derived from them I have try a design matrix as follows, >>>> >>>> >>>> Cultivar1 Cultivar2 F1 >>>> 1-rep1 1 0 0 >>>> 1-rep2 1 0 0 >>>> 2-rep1 0 1 0 >>>> 2-rep2 0 1 0 >>>> F1-rep1 0 0 1 >>>> F1-rep2 0 0 1 >>>> >>>> But I again face the above mentioned problem. >>> The above example should give you a clue as to how to do this. >>> >>> >>>> Besides I was wondering if I can employ a design matrix where I will be >>>> modeling how many alleles of each cultivar have my sample, doing >>>> something like these: >>>> >>>> Cultivar1 Cultivar2 >>>> 1-rep1 2 0 >>>> 1-rep2 2 0 >>>> 2-rep1 0 2 >>>> 2-rep2 0 2 >>>> F1-rep1 0 1 >>>> F1-rep2 0 1 >>>> >>>> I will appreciate any advice about the proper way to face these analysis. >>> You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices. >>> >>> >>>> An just an extra question, in the manual it is recommended to eliminate >>>> all the tags with low counts, as they are no going to be sadistically >>>> significance and they will slow the process. First it is said that the >>>> tags with less than 5 counts are not useful, but then all the tags with >>>> lees than one tag per million (TPM) are removed, however in my case >>>> having 50 million sequences that is all the tags with less than 50 counts. >>>> >>>> In fact when I do it my tags decrease a lot >>>> >>>>> dim(d) >>>> [1] 2029802 16 >>>> >>>>> d<-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, >>>> dim(d))>1)>= 2, ] >>> You can just use cpm(d) to compute the counts-per-million. >>> >>>>> dim (d) >>>> [1] 73310 16 >>>> >>>> That is normal? Why did you choose to remove less than 1TPM? >>> It depends on your library size. A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff. >>> >>> Best wishes >>> Gordon >>> >>>> Thanks in advance >>>> Laura Pascual >>> ______________________________________________________________________ >>> The information in this email is confidential and inte...{{dropped:6}} >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago lpascual ▴ 50

Login before adding your answer.