Odd contrast; does it make statistical sense?

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi all, I'm currently using edgeR to test a somewhat odd contrast. Basically, I have multiple groups, and I want to combine them into just 2 big groups and test whether the two big groups have significantly different averages. I'll give a toy example that demonstrates the same concept. In this example, there are 4 groups, A through D, each containing 3 samples, and I want to test whether the mean of all samples in A & B is different from the mean of all samples in C & D: group <- rep(LETTERS[1:4], 3) design <- model.matrix(~0+group) colnames(design) <- LETTERS[1:4] cont <- makeContrasts((A+B)/2 - (C+D)/2, levels=design) My worry is that with this contrast, I'm effectively just testing two groups against each other, and by having 4 groups in the design I will be estimating dispersions that are not appropriate for the test that I'm doing, and hence I will overstate my confidence. Or, to put it another way, am I doing something equivalent to testing a main effect in a model where an interaction term is present? Thank you, -Ryan Thompson

edgeR edgeR • 2.4k views

ADD COMMENT • link updated 2.1 years ago by Gordon Smyth 50k • written 10.3 years ago by Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 45 minutes ago

WEHI, Melbourne, Australia

Hi Ryan,

Your contrast doesn't seem so odd to me. We used a similar contrast for example to compare Basal breast cancer with the average of all other breast cancer subtypes:

http://nar.oxfordjournals.org/content/40/17/e133

My worry is that with this contrast, I'm effectively just testing two groups against each other, and by having 4 groups in the design I will be estimating dispersions that are not appropriate for the test that I'm doing, and hence I will overstate my confidence.

The dispersions remain unchanged regardless of the contrast you test. The dispersions have been estimated after removing all differences between the four groups, i.e., without bias.

edgeR is giving you a correct test of the contrast you have specified. You are testing whether an equal mix of the first two groups has the same average expression as an equal mix of the third and fourth groups.

Note that you are not testing whether the difference between the two big groups is large compared to variation within the big groups. The test does not care how large the differences are between A and B, or between C and D.

Or, to put it another way, am I doing something equivalent to testing a main effect in a model where an interaction term is present?

No, the test does not suffer from the same objection. However you may need to be careful interpreting the test when there is lots of DE between A vs B or C vd D. It may be worthwhile first checking A vs B and C vs D.

Best wishes
Gordon

ADD COMMENT • link 10.3 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

On Thu, Jan 23, 2014 at 6:48 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > My worry is that with this contrast, I'm effectively just testing two >> groups against each other, and by having 4 groups in the design I will be >> estimating dispersions that are not appropriate for the test that I'm >> doing, and hence I will overstate my confidence. >> > > The dispersions remain unchanged regardless of the contrast you test. The > dispersions have been estimated after removing all differences between the > four groups, i.e., without bias. > But had he more simply coded the samples as belonging to two groups, AB and CD, then the dispersions could be larger, and the AB vs. CD mean differences could be less significant than in his four-subgroup design. I think that was the intent of Ryan's question. Is it fair to stratify along a priori expected subgroupings to minimize variance and then ask group-level questions? As you later say, if there is "lots of DE" between A vs. B and/or C vs. D (read: larger group-wise variance) then the test may need to be "carefully interpreted". Thanks for your insight, -Aaron [[alternative HTML version deleted]]

ADD REPLY • link 10.3 years ago Aaron Mackey ▴ 200

0

Entering edit mode

On Thu Jan 23 21:10:27 2014, Aaron Mackey wrote: > But had he more simply coded the samples as belonging to two groups, > AB and CD, then the dispersions could be larger, and the AB vs. CD > mean differences could be less significant than in his four-subgroup > design. I think that was the intent of Ryan's question. Is it fair > to stratify along a priori expected subgroupings to minimize variance > and then ask group-level questions? As you later say, if there is > "lots of DE" between A vs. B and/or C vs. D (read: larger group-wise > variance) then the test may need to be "carefully interpreted". Yes, this is exactly what I am asking. -Ryan

ADD REPLY • link 10.3 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

But had he more simply coded the samples as belonging to two groups, AB and CD, then the dispersions could be larger, and the AB vs. CD mean differences could be less significant than in his four-subgroup design. I think that was the intent of Ryan's question.

Coding that doesn't reflect the true experimental design is likely to perform badly, and give less significance. That doesn't make it more correct.

Is it fair to stratify along a priori expected subgroupings to minimize variance and then ask group-level questions?

You are not asking a well posed question. For one thing, "fair" and "unfair" are unhelpful concepts. The only consideration is whether the statistical test that is done answers the scientific question being asked. You haven't explained what scientific question you want to answer, so there is no basis for choosing a scientific test.

Fitting a model that matches the experimental conditions and then making comparisons between groups has been the anova method since anova was first invented. It answers what it answers, as I explained in my response to Ryan.

Gordon

ADD REPLY • link 10.3 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

As always, thanks Gordon for keeping the conversation focused. I guess I was responding to Ryan's statement that the only hypothesis to be tested was a difference between the two main groups, so the additional modeling of subgroups seems only to reduce the overall residual variance, possibly leading to inflated significance between groups. re: "carefully interpreted", would you suggest in such situations that we F-test the full model (four subgroups) vs. the nested model (two groups), and only perform group-wise comparisons for cases where the nested model could not be rejected (i.e. no evidence of interaction effects?) That would seem to fit the ANOVA interpretation, and the usual concerns about testing for main effects when interactions may be present. Thanks again, -Aaron On Fri, Jan 24, 2014 at 1:33 AM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > On Fri, 24 Jan 2014, Aaron Mackey wrote: > > On Thu, Jan 23, 2014 at 6:48 PM, Gordon K Smyth <smyth@wehi.edu.au> >> wrote: >> >> My worry is that with this contrast, I'm effectively just testing two >>> >>>> groups against each other, and by having 4 groups in the design I will >>>> be >>>> estimating dispersions that are not appropriate for the test that I'm >>>> doing, and hence I will overstate my confidence. >>>> >>>> >>> The dispersions remain unchanged regardless of the contrast you test. The >>> dispersions have been estimated after removing all differences between >>> the >>> four groups, i.e., without bias. >>> >>> >> But had he more simply coded the samples as belonging to two groups, AB >> and CD, then the dispersions could be larger, and the AB vs. CD mean >> differences could be less significant than in his four-subgroup design. I >> think that was the intent of Ryan's question. >> > > Coding that doesn't reflect the true experimental design is likely to > perform badly, and give less significance. That doesn't make it more > correct. > > > Is it fair to stratify along a priori expected subgroupings to minimize >> variance and then ask group-level questions? >> > > You are not asking a well posed question. For one thing, "fair" and > "unfair" are unhelpful concepts. The only consideration is whether the > statistical test that is done answers the scientific question being > answered. You haven't explained what scientific question you want to > answer, so there is no basis for choosing a scientific test. > > Fitting a model that matches the experimental conditions and then making > comparisons between groups has been the anova method since anova was first > invented. It answers what it answers, as I explained in my response to > Ryan. > > Gordon > > As you later say, if there is "lots of DE" between >> A vs. B and/or C vs. D (read: larger group-wise variance) then the test >> may >> need to be "carefully interpreted". >> >> Thanks for your insight, >> -Aaron >> >> > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 10.3 years ago Aaron Mackey ▴ 200

0

Entering edit mode

As always, thanks Gordon for keeping the conversation focused. I guess I was responding to Ryan's statement that the only hypothesis to be tested was a difference between the two main groups,

It is quite clear, from your and Ryan's comments, that this difference is not the only scientific question to be answered, and so it cannot be the only hypothesis to be tested.

so the additional modeling of subgroups seems only to reduce the overall residual variance, possibly leading to inflated significance between groups.

I strongly disagree, as I have already told you. Modeling of subgroups that have a strong effect on the outcome is always good science.

re: "carefully interpreted", would you suggest in such situations that we F-test the full model (four subgroups) vs. the nested model (two groups), and only perform group-wise comparisons for cases where the nested model could not be rejected (i.e. no evidence of interaction effects?) That would seem to fit the ANOVA interpretation, and the usual concerns about testing for main effects when interactions may be present.

Would I be willing to give a blunderbuss recommendation that you should apply in all situations, regardless of the nature of the groups or the scientific questions at issue? No I wouldn't.

I have been trying to prompt you to clarify what your scientific questions actually are. Once you do do, the appropriate statistical procedure will be readily apparent.

I seem to have answered the same question three times now, without getting the message across. I will make one more attempt, but I will reply to Ryan's original post, not to this email, because much of Ryan's original question and my response has been deleted from the thread below.

Gordon

ADD REPLY • link 10.3 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

Hi, I have a few questions on edgeR package. I used it a while ago. Recently, I started to use it again for an ongoing project and realized that it has been updated a lot with many new feaures. The data I am working on has 29 pairs of tumor and matched normal tissue samples from the same patients that have been run through RNAseq, the data has been processed by the popular RSEM method to be raw counts for each sample (tumor or macthed normal tissue of each patients). So I have a matrix of raw counts for 58 columns of data for all genes. we are looking for differential expressed genes between tumors and normals. Here are some main commands I used, eliminating some basic commands for simplicity: > show(head(targets)); SampleName Subject Type 1 T4626_01A 38_4626 Tumor 2 N4626_11A 38_4626 Normal 3 T2668_01A 44_2668 Tumor 4 N2668_11A 44_2668 Normal ...... >Patient<-factor(targets$Subject); >Type<-factor(targets$Type); ...... y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > show(dim(y)); [1] 20531 58 > #filter low expressed genes > y<-y[rowSums(y$counts)>=50,] > #Recompute the library sizes: > y$samples$lib.size <- colSums(y$counts) > #Use Entrez Gene IDs as row names: > rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID > ##Apply TMM normalization > y <- calcNormFactors(y,method=c("TMM")); > show(y$samples); group lib.size norm.factors T4626_01A 1 97435132 0.9834587 N4626_11A 1 62583672 0.9154837 T2668_01A 1 77034746 0.9687860 N2668_11A 1 58631311 0.8644352 ...... > design<-model.matrix(~Patient+Type); > rownames(design) <- colnames(y); > y <- estimateGLMCommonDisp(y, design, verbose=TRUE); Disp = 3.99994 , BCV = 2 There were 50 or more warnings (use warnings() to see the first 50) >show(warnings()); Warning messages: 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 4.520000 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 5.380000 .... > y <- estimateGLMTrendedDisp(y, design); Loading required package: splines There were 50 or more warnings (use warnings() to see the first 50) > y <- estimateGLMTagwiseDisp(y, design) There were 50 or more warnings (use warnings() to see the first 50) > fit <- glmFit(y, design); > lrt <- glmLRT(fit); > show(topTags(lrt)); Coefficient: TypeTumor symbols entrezID gene_id logFC logCPM LR PValue FDR 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 ....... > Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); > cpmRslt<-cpm(y) > cpmRslt<-cpmRslt[rownames(Rslt),]; > show(cpmRslt[1:10,1:6]); T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 ..... > show(Rslt[1:10,]); Coefficient: TypeTumor symbols entrezID gene_id logFC logCPM LR PValue FDR 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 ..... > show(summary(de <- decideTestsDGE(lrt))); [,1] -1 6179 0 4832 1 7204 My questions are below: 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary. is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting? 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings? 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt, there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test? 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice? 5. I did MDS plot, which looks nice in that the tumor and normal are separated very well in one side vs other. In the plotSmear plot, I did see some red spots inside the black spots zones in the center zone (supposed to be lower logFC), why they were called as significant? May this explain why we get so many DEGs >13k of them? Or large variations across samples? Thanks so much in advance for your help! Mike NIH Maryland, USA. > show(sessionInfo()); R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_3.4.2 limma_3.16.8 [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Ming ▴ 380

0

Entering edit mode

Dear Ming,

Something is seriously wrong -- you shouldn't get these warnings, you shouldn't get such a large dispersion estimate and, if you do, you shouldn't get such small p-values.

I suspect that the culprit is RSEM. edgeR is designed to accept raw read counts rather than estimated abundance levels as might be output from RSEM or Cufflinks.

There are a number of tools that produce genewise read counts from RNA-seq data. My lab routinely uses subread and featureCounts:

http://www.ncbi.nlm.nih.gov/pubmed/24227677

which are available from the Bioconductor package Rsubread.

Best wishes
Gordon

ADD REPLY • link 10.2 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

Hello, I made the same mistake last week. I got the same common dispersion estimate for my dataset as you did for your dataset. > allExpectedCounts$common.dispersion [1] 3.999943 EdgeR is for differential gene expression, not for differential transcript expression. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia

ADD REPLY • link 10.2 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

Hi, Thanks for the info. As in the email I just replied to Dr. Smyth, where I show the data file I used, the raw counts I used are indeed based on genes not transcripts and we are looking for differential gene expression. Are we on the same page? Thanks for the info any way! best Ming > From: dstr7320@uni.sydney.edu.au > To: yi02@hotmail.com > CC: bioconductor@r-project.org > Subject: RE: [BioC] questions on edgeR package > Date: Wed, 29 Jan 2014 02:00:09 +0000 > > Hello, > > I made the same mistake last week. I got the same common dispersion estimate for my dataset as you did for your dataset. > > > allExpectedCounts$common.dispersion > [1] 3.999943 > > EdgeR is for differential gene expression, not for differential transcript expression. > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Ming ▴ 380

0

Entering edit mode

Dear Dr Smyth: Thanks for the quick response. The data is in fact TCGA RNAseq data, one of the files looks like below: file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem. genes.results" its content looks below: gene_id raw_count scaled_estimate transcript_id 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 6 ?|136542 0.00 0.000000e+00 uc011krn.1 ...... through contacting with TCGA staff, we know the column "raw_count" is the raw counts of reads after running through RSEM and the normalized data in a different file (I used the raw counts not normalized data for edgeR). The basic description of the data as below (at URL: https ://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/ tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/rnaseqv2/unc.edu_LUAD.I lluminaHiSeq_RNASeqV2.mage-tab.1.13.0/DESCRIPTION.txt) step 4 of: V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1 ...The reads aligned to the reference genome sequences are translated to transcriptome coordinates prior to transcript level quantification. This step is performed using the UNC Bioinformatics Utilities (UBU) software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances (http://deweylab.biostat.wisc.edu/rsem/rsem-calculate- expression.html). .... Unfortunately, we do not have access to the raw RNA-seq data nor bam files etc. The authors of these data files also suggest to use the raw counts as mentioned above for analysis using edgeR or DESeq. Unless there are something wrong with their procedure, we have no control for that. Assuming the data is OK, any possibility other steps in edgeR may cause this? Such as filtering step as I asked: #filter low expressed genes > y<-y[rowSums(y$counts)>=50,] this step using arbitraay cutoff 50 has filtered out quite a few genes (rows) with sum of rows in counts of reads less than 50. or other steps? Any idea? Thanks again! Best Ming > Date: Wed, 29 Jan 2014 11:05:48 +1100 > From: smyth@wehi.EDU.AU > To: yi02@hotmail.com > CC: bioconductor@r-project.org > Subject: Re: questions on edgeR package > > Dear Ming, > > Something is seriously wrong -- you shouldn't get these warnings, you > shouldn't get such a large dispersion estimate and, if you do, you > shouldn't get such small p-values. > > I suspect that the culprit is RSEM. edgeR is designed to accept raw read > counts rather than estimated abundance levels as might be output from RSEM > or Cufflinks. > > There are a number of tools that produce genewise read counts from RNA-seq > data. My lab routinely uses subread and featureCounts: > > http://www.ncbi.nlm.nih.gov/pubmed/24227677 > > which are available from the Bioconductor package Rsubread. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Tue, 28 Jan 2014, Ming Yi wrote: > > > > > Hi, > > > > I have a few questions on edgeR package. I used it a while ago. > > Recently, I started to use it again for an ongoing project and realized > > that it has been updated a lot with many new feaures. > > > > The data I am working on has 29 pairs of tumor and matched normal tissue > > samples from the same patients that have been run through RNAseq, the > > data has been processed by the popular RSEM method to be raw counts for > > each sample (tumor or macthed normal tissue of each patients). So I have > > a matrix of raw counts for 58 columns of data for all genes. we are > > looking for differential expressed genes between tumors and normals. > > Here are some main commands I used, eliminating some basic commands for > > simplicity: > > > >> show(head(targets)); > > SampleName Subject Type > > 1 T4626_01A 38_4626 Tumor > > 2 N4626_11A 38_4626 Normal > > 3 T2668_01A 44_2668 Tumor > > 4 N2668_11A 44_2668 Normal > > ...... > >> Patient<-factor(targets$Subject); > >> Type<-factor(targets$Type); > > ...... > > y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > >> show(dim(y)); > > [1] 20531 58 > > > >> #filter low expressed genes > >> y<-y[rowSums(y$counts)>=50,] > >> #Recompute the library sizes: > >> y$samples$lib.size <- colSums(y$counts) > >> #Use Entrez Gene IDs as row names: > >> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID > >> ##Apply TMM normalization > >> y <- calcNormFactors(y,method=c("TMM")); > >> show(y$samples); > > group lib.size norm.factors > > T4626_01A 1 97435132 0.9834587 > > N4626_11A 1 62583672 0.9154837 > > T2668_01A 1 77034746 0.9687860 > > N2668_11A 1 58631311 0.8644352 > > ...... > >> design<-model.matrix(~Patient+Type); > >> rownames(design) <- colnames(y); > >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > > Disp = 3.99994 , BCV = 2 > > There were 50 or more warnings (use warnings() to see the first 50) > >> show(warnings()); > > Warning messages: > > 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > > non-integer x = 4.520000 > > 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > > non-integer x = 5.380000 > > .... > >> y <- estimateGLMTrendedDisp(y, design); > > Loading required package: splines > > There were 50 or more warnings (use warnings() to see the first 50) > >> y <- estimateGLMTagwiseDisp(y, design) > > There were 50 or more warnings (use warnings() to see the first 50) > >> fit <- glmFit(y, design); > >> lrt <- glmLRT(fit); > >> show(topTags(lrt)); > > Coefficient: TypeTumor > > symbols entrezID gene_id logFC logCPM LR PValue FDR > > 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 > > 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 > > 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 > > ....... > >> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); > >> cpmRslt<-cpm(y) > >> cpmRslt<-cpmRslt[rownames(Rslt),]; > >> show(cpmRslt[1:10,1:6]); > > T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A > > 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 > > 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 > > 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 > > ..... > >> show(Rslt[1:10,]); > > Coefficient: TypeTumor > > symbols entrezID gene_id logFC logCPM LR PValue FDR > > 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 > > 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 > > 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 > > ..... > >> show(summary(de <- decideTestsDGE(lrt))); > > [,1] > > -1 6179 > > 0 4832 > > 1 7204 > > > > My questions are below: > > > > 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary. is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting? > > > > 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings? > > > > 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt, there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test? > > > > 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice? > > > > 5. I did MDS plot, which looks nice in that the tumor and normal are > > separated very well in one side vs other. In the plotSmear plot, I did > > see some red spots inside the black spots zones in the center zone > > (supposed to be lower logFC), why they were called as significant? May > > this explain why we get so many DEGs >13k of them? Or large variations > > across samples? > > > > Thanks so much in advance for your help! > > > > Mike > > NIH > > Maryland, USA. > > > >> show(sessionInfo()); > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > > [1] splines stats graphics grDevices utils datasets methods > > [8] base > > other attached packages: > > [1] edgeR_3.4.2 limma_3.16.8 > > > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}

ADD REPLY • link 10.2 years ago Ming ▴ 380

0

Entering edit mode

Dear Ming,

Thanks for the further information.

It is obvious that numbers like 4.67 are not raw integer counts. I suspect that they are posterior expected counts from RSEM. The column heading "raw_count" does seem rather misleading.

If you want to analyse these expected counts from RSEM, then limma-voom would be a far preferable pipeline than edgeR (or DESeq or any other negative binomial based package).

I understand that the RSEM authors did claim that their expected counts could be input to edgeR, but edgeR is designed to accept integers, and your results appear to confirm some of the problems that arise. The problems cannot be solved by changing the filtering.

The raw FastQ files for the TCGA samples should be publicly available from GEO and SRA. My lab has downloaded similar TCGA data as SRA files and converted them to FastQ.

Best wishes
Gordon

ADD REPLY • link 10.2 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Dr Smyth: Thanks so much for the advice and help. We actually found files with raw counts from TCGA RNAseq v1 dataset (we were discussing about V2 RNAseq data). Unfortunately, the V1 dataset only has a portion of samples in V2 dataset, and V2 only has those RSEM estimated counts. we currently do not have the controlled access permission for fastq and bam files. I also searched SRA and could not find the TCGA RNAseq data (many whole genome or exome-seq data) there. We kind of stuck there. Thanks again for your advice and help! best Ming > Date: Wed, 29 Jan 2014 15:11:54 +1100 > From: smyth@wehi.EDU.AU > To: yi02@hotmail.com > CC: bioconductor@r-project.org > Subject: RE: questions on edgeR package > > Dear Ming, > > Thanks for the further information. > > It is obvious that numbers like 4.67 are not raw integer counts. I > suspect that they are posterior expected counts from RSEM. The column > heading "raw_count" does seem rather misleading. > > If you want to analyse these expected counts from RSEM, then limma- voom > would be a far preferable pipeline than edgeR (or DESeq or any other > negative binomial based package). > > I understand that the RSEM authors did claim that their expected counts > could be input to edgeR, but edgeR is designed to accept integers, and > your results appear to confirm some of the problems that arise. The > problems cannot be solved by changing the filtering. > > The raw FastQ files for the TCGA samples should be publicly available from > GEO and SRA. My lab has downloaded similar TCGA data as SRA files and > converted them to FastQ. > > Best wishes > Gordon > > > On Wed, 29 Jan 2014, Ming Yi wrote: > > > Dear Dr Smyth: > > > > Thanks for the quick response. The data is in fact TCGA RNAseq data, one of the files looks like below: > > file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.r sem.genes.results" > > > > its content looks below: > > gene_id raw_count scaled_estimate transcript_id > > 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 > > 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 > > 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 > > 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 > > 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 > > 6 ?|136542 0.00 0.000000e+00 uc011krn.1 > > ...... > > through contacting with TCGA staff, we know the column "raw_count" is the raw counts of reads after running through RSEM and the normalized data in a different file (I used the raw counts not normalized data for edgeR). The basic description of the data as below (at URL: https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpus ers/anonymous/tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/rnaseqv2/ unc.edu_LUAD.IlluminaHiSeq_RNASeqV2.mage-tab.1.13.0/DESCRIPTION.txt) > > step 4 of: V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1 > > ...The reads aligned to the reference genome sequences are translated to transcriptome coordinates > > prior to transcript level quantification. This step is performed using the UNC Bioinformatics Utilities (UBU) > > software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances > > (http://deweylab.biostat.wisc.edu/rsem/rsem-calculate- expression.html). .... > > > > Unfortunately, we do not have access to the raw RNA-seq data nor bam files etc. The authors of these data files also suggest to use the raw counts as mentioned above for analysis using edgeR or DESeq. Unless there are something wrong with their procedure, we have no control for that. > > > > Assuming the data is OK, any possibility other steps in edgeR may cause this? Such as filtering step as I asked: > > #filter low expressed genes > >> y<-y[rowSums(y$counts)>=50,] > > this step using arbitraay cutoff 50 has filtered out quite a few genes (rows) with sum of rows in counts of reads less than 50. > > > > or other steps? > > > > Any idea? > > > > Thanks again! > > > > Best > > > > Ming > > > > > >> Date: Wed, 29 Jan 2014 11:05:48 +1100 > >> From: smyth@wehi.EDU.AU > >> To: yi02@hotmail.com > >> CC: bioconductor@r-project.org > >> Subject: Re: questions on edgeR package > >> > >> Dear Ming, > >> > >> Something is seriously wrong -- you shouldn't get these warnings, you > >> shouldn't get such a large dispersion estimate and, if you do, you > >> shouldn't get such small p-values. > >> > >> I suspect that the culprit is RSEM. edgeR is designed to accept raw read > >> counts rather than estimated abundance levels as might be output from RSEM > >> or Cufflinks. > >> > >> There are a number of tools that produce genewise read counts from RNA-seq > >> data. My lab routinely uses subread and featureCounts: > >> > >> http://www.ncbi.nlm.nih.gov/pubmed/24227677 > >> > >> which are available from the Bioconductor package Rsubread. > >> > >> Best wishes > >> Gordon > >> > >> --------------------------------------------- > >> Professor Gordon K Smyth, > >> Bioinformatics Division, > >> Walter and Eliza Hall Institute of Medical Research, > >> 1G Royal Parade, Parkville, Vic 3052, Australia. > >> http://www.statsci.org/smyth > >> > >> On Tue, 28 Jan 2014, Ming Yi wrote: > >> > >>> > >>> Hi, > >>> > >>> I have a few questions on edgeR package. I used it a while ago. > >>> Recently, I started to use it again for an ongoing project and realized > >>> that it has been updated a lot with many new feaures. > >>> > >>> The data I am working on has 29 pairs of tumor and matched normal tissue > >>> samples from the same patients that have been run through RNAseq, the > >>> data has been processed by the popular RSEM method to be raw counts for > >>> each sample (tumor or macthed normal tissue of each patients). So I have > >>> a matrix of raw counts for 58 columns of data for all genes. we are > >>> looking for differential expressed genes between tumors and normals. > >>> Here are some main commands I used, eliminating some basic commands for > >>> simplicity: > >>> > >>>> show(head(targets)); > >>> SampleName Subject Type > >>> 1 T4626_01A 38_4626 Tumor > >>> 2 N4626_11A 38_4626 Normal > >>> 3 T2668_01A 44_2668 Tumor > >>> 4 N2668_11A 44_2668 Normal > >>> ...... > >>>> Patient<-factor(targets$Subject); > >>>> Type<-factor(targets$Type); > >>> ...... > >>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > >>>> show(dim(y)); > >>> [1] 20531 58 > >>> > >>>> #filter low expressed genes > >>>> y<-y[rowSums(y$counts)>=50,] > >>>> #Recompute the library sizes: > >>>> y$samples$lib.size <- colSums(y$counts) > >>>> #Use Entrez Gene IDs as row names: > >>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID > >>>> ##Apply TMM normalization > >>>> y <- calcNormFactors(y,method=c("TMM")); > >>>> show(y$samples); > >>> group lib.size norm.factors > >>> T4626_01A 1 97435132 0.9834587 > >>> N4626_11A 1 62583672 0.9154837 > >>> T2668_01A 1 77034746 0.9687860 > >>> N2668_11A 1 58631311 0.8644352 > >>> ...... > >>>> design<-model.matrix(~Patient+Type); > >>>> rownames(design) <- colnames(y); > >>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > >>> Disp = 3.99994 , BCV = 2 > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> show(warnings()); > >>> Warning messages: > >>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > >>> non-integer x = 4.520000 > >>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > >>> non-integer x = 5.380000 > >>> .... > >>>> y <- estimateGLMTrendedDisp(y, design); > >>> Loading required package: splines > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> y <- estimateGLMTagwiseDisp(y, design) > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> fit <- glmFit(y, design); > >>>> lrt <- glmLRT(fit); > >>>> show(topTags(lrt)); > >>> Coefficient: TypeTumor > >>> symbols entrezID gene_id logFC logCPM LR PValue FDR > >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 > >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 > >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 > >>> ....... > >>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); > >>>> cpmRslt<-cpm(y) > >>>> cpmRslt<-cpmRslt[rownames(Rslt),]; > >>>> show(cpmRslt[1:10,1:6]); > >>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A > >>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 > >>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 > >>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 > >>> ..... > >>>> show(Rslt[1:10,]); > >>> Coefficient: TypeTumor > >>> symbols entrezID gene_id logFC logCPM LR PValue FDR > >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 > >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 > >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 > >>> ..... > >>>> show(summary(de <- decideTestsDGE(lrt))); > >>> [,1] > >>> -1 6179 > >>> 0 4832 > >>> 1 7204 > >>> > >>> My questions are below: > >>> > >>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary. is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting? > >>> > >>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings? > >>> > >>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt, there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test? > >>> > >>> 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice? > >>> > >>> 5. I did MDS plot, which looks nice in that the tumor and normal are > >>> separated very well in one side vs other. In the plotSmear plot, I did > >>> see some red spots inside the black spots zones in the center zone > >>> (supposed to be lower logFC), why they were called as significant? May > >>> this explain why we get so many DEGs >13k of them? Or large variations > >>> across samples? > >>> > >>> Thanks so much in advance for your help! > >>> > >>> Mike > >>> NIH > >>> Maryland, USA. > >>> > >>>> show(sessionInfo()); > >>> R version 3.0.1 (2013-05-16) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> locale: > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >>> [7] LC_PAPER=C LC_NAME=C > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>> attached base packages: > >>> [1] splines stats graphics grDevices utils datasets methods > >>> [8] base > >>> other attached packages: > >>> [1] edgeR_3.4.2 limma_3.16.8 > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}

ADD REPLY • link 10.2 years ago Ming ▴ 380

0

Entering edit mode

Dear Ming, If the data truly are expected counts, then limma-voom will analyse them very well. If the data are estimated expression levels instead of expected counts, or have been normalized by gc content or gene length then the situation is more difficult again. In that case a good statistical analysis is difficult and requires knowledge of the gene lengths and gc contents. You could try limma-voom on such data, but it will not operate at full power without modification. I advised you in my previous email that no package based on the negative binomial distribution will work satisfactorily with these estimated quantities. I see that you have written separately to Mike Love and that he has given you the same message. I have checked with my lab, and I was incorrect to say that we have downloaded SRA files for TCGA -- it was actually downloaded RSEM counts that we downloaded. I think it would be a good idea for people like you to put pressure on TCGA to make unprocessed data available. Best wishes Gordon On Wed, 29 Jan 2014, Ming Yi wrote: > Dear Dr Smyth: > > Thanks so much for the advice and help. We actually found files with > raw counts from TCGA RNAseq v1 dataset (we were discussing about V2 > RNAseq data). Unfortunately, the V1 dataset only has a portion of > samples in V2 dataset, and V2 only has those RSEM estimated counts. we > currently do not have the controlled access permission for fastq and bam > files. I also searched SRA and could not find the TCGA RNAseq data (many > whole genome or exome-seq data) there. We kind of stuck there. > > Thanks again for your advice and help! > > best > Ming > > > > >> Date: Wed, 29 Jan 2014 15:11:54 +1100 >> From: smyth at wehi.EDU.AU >> To: yi02 at hotmail.com >> CC: bioconductor at r-project.org >> Subject: RE: questions on edgeR package >> >> Dear Ming, >> >> Thanks for the further information. >> >> It is obvious that numbers like 4.67 are not raw integer counts. I >> suspect that they are posterior expected counts from RSEM. The column >> heading "raw_count" does seem rather misleading. >> >> If you want to analyse these expected counts from RSEM, then limma- voom >> would be a far preferable pipeline than edgeR (or DESeq or any other >> negative binomial based package). >> >> I understand that the RSEM authors did claim that their expected counts >> could be input to edgeR, but edgeR is designed to accept integers, and >> your results appear to confirm some of the problems that arise. The >> problems cannot be solved by changing the filtering. >> >> The raw FastQ files for the TCGA samples should be publicly available from >> GEO and SRA. My lab has downloaded similar TCGA data as SRA files and >> converted them to FastQ. >> >> Best wishes >> Gordon >> >> >> On Wed, 29 Jan 2014, Ming Yi wrote: >> >>> Dear Dr Smyth: >>> >>> Thanks for the quick response. The data is in fact TCGA RNAseq data, one of the files looks like below: >>> file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.r sem.genes.results" >>> >>> its content looks below: >>> gene_id raw_count scaled_estimate transcript_id >>> 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 >>> 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 >>> 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 >>> 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 >>> 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 >>> 6 ?|136542 0.00 0.000000e+00 uc011krn.1 >>> ...... >>> through contacting with TCGA staff, we know the column "raw_count" is the raw counts of reads after running through RSEM and the normalized data in a different file (I used the raw counts not normalized data for edgeR). The basic description of the data as below (at URL: https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpus ers/anonymous/tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/rnaseqv2/ unc.edu_LUAD.IlluminaHiSeq_RNASeqV2.mage-tab.1.13.0/DESCRIPTION.txt) >>> step 4 of: V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1 >>> ...The reads aligned to the reference genome sequences are translated to transcriptome coordinates >>> prior to transcript level quantification. This step is performed using the UNC Bioinformatics Utilities (UBU) >>> software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances >>> (http://deweylab.biostat.wisc.edu/rsem/rsem-calculate- expression.html). .... >>> >>> Unfortunately, we do not have access to the raw RNA-seq data nor bam files etc. The authors of these data files also suggest to use the raw counts as mentioned above for analysis using edgeR or DESeq. Unless there are something wrong with their procedure, we have no control for that. >>> >>> Assuming the data is OK, any possibility other steps in edgeR may cause this? Such as filtering step as I asked: >>> #filter low expressed genes >>>> y<-y[rowSums(y$counts)>=50,] >>> this step using arbitraay cutoff 50 has filtered out quite a few genes (rows) with sum of rows in counts of reads less than 50. >>> >>> or other steps? >>> >>> Any idea? >>> >>> Thanks again! >>> >>> Best >>> >>> Ming >>> >>> >>>> Date: Wed, 29 Jan 2014 11:05:48 +1100 >>>> From: smyth at wehi.EDU.AU >>>> To: yi02 at hotmail.com >>>> CC: bioconductor at r-project.org >>>> Subject: Re: questions on edgeR package >>>> >>>> Dear Ming, >>>> >>>> Something is seriously wrong -- you shouldn't get these warnings, you >>>> shouldn't get such a large dispersion estimate and, if you do, you >>>> shouldn't get such small p-values. >>>> >>>> I suspect that the culprit is RSEM. edgeR is designed to accept raw read >>>> counts rather than estimated abundance levels as might be output from RSEM >>>> or Cufflinks. >>>> >>>> There are a number of tools that produce genewise read counts from RNA-seq >>>> data. My lab routinely uses subread and featureCounts: >>>> >>>> http://www.ncbi.nlm.nih.gov/pubmed/24227677 >>>> >>>> which are available from the Bioconductor package Rsubread. >>>> >>>> Best wishes >>>> Gordon >>>> >>>> --------------------------------------------- >>>> Professor Gordon K Smyth, >>>> Bioinformatics Division, >>>> Walter and Eliza Hall Institute of Medical Research, >>>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>>> http://www.statsci.org/smyth >>>> >>>> On Tue, 28 Jan 2014, Ming Yi wrote: >>>> >>>>> >>>>> Hi, >>>>> >>>>> I have a few questions on edgeR package. I used it a while ago. >>>>> Recently, I started to use it again for an ongoing project and realized >>>>> that it has been updated a lot with many new feaures. >>>>> >>>>> The data I am working on has 29 pairs of tumor and matched normal tissue >>>>> samples from the same patients that have been run through RNAseq, the >>>>> data has been processed by the popular RSEM method to be raw counts for >>>>> each sample (tumor or macthed normal tissue of each patients). So I have >>>>> a matrix of raw counts for 58 columns of data for all genes. we are >>>>> looking for differential expressed genes between tumors and normals. >>>>> Here are some main commands I used, eliminating some basic commands for >>>>> simplicity: >>>>> >>>>>> show(head(targets)); >>>>> SampleName Subject Type >>>>> 1 T4626_01A 38_4626 Tumor >>>>> 2 N4626_11A 38_4626 Normal >>>>> 3 T2668_01A 44_2668 Tumor >>>>> 4 N2668_11A 44_2668 Normal >>>>> ...... >>>>>> Patient<-factor(targets$Subject); >>>>>> Type<-factor(targets$Type); >>>>> ...... >>>>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>>>>> show(dim(y)); >>>>> [1] 20531 58 >>>>> >>>>>> #filter low expressed genes >>>>>> y<-y[rowSums(y$counts)>=50,] >>>>>> #Recompute the library sizes: >>>>>> y$samples$lib.size <- colSums(y$counts) >>>>>> #Use Entrez Gene IDs as row names: >>>>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>>>>> ##Apply TMM normalization >>>>>> y <- calcNormFactors(y,method=c("TMM")); >>>>>> show(y$samples); >>>>> group lib.size norm.factors >>>>> T4626_01A 1 97435132 0.9834587 >>>>> N4626_11A 1 62583672 0.9154837 >>>>> T2668_01A 1 77034746 0.9687860 >>>>> N2668_11A 1 58631311 0.8644352 >>>>> ...... >>>>>> design<-model.matrix(~Patient+Type); >>>>>> rownames(design) <- colnames(y); >>>>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >>>>> Disp = 3.99994 , BCV = 2 >>>>> There were 50 or more warnings (use warnings() to see the first 50) >>>>>> show(warnings()); >>>>> Warning messages: >>>>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >>>>> non-integer x = 4.520000 >>>>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >>>>> non-integer x = 5.380000 >>>>> .... >>>>>> y <- estimateGLMTrendedDisp(y, design); >>>>> Loading required package: splines >>>>> There were 50 or more warnings (use warnings() to see the first 50) >>>>>> y <- estimateGLMTagwiseDisp(y, design) >>>>> There were 50 or more warnings (use warnings() to see the first 50) >>>>>> fit <- glmFit(y, design); >>>>>> lrt <- glmLRT(fit); >>>>>> show(topTags(lrt)); >>>>> Coefficient: TypeTumor >>>>> symbols entrezID gene_id logFC logCPM LR PValue FDR >>>>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 >>>>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 >>>>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 >>>>> ....... >>>>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>>>>> cpmRslt<-cpm(y) >>>>>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>>>>> show(cpmRslt[1:10,1:6]); >>>>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >>>>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >>>>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >>>>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >>>>> ..... >>>>>> show(Rslt[1:10,]); >>>>> Coefficient: TypeTumor >>>>> symbols entrezID gene_id logFC logCPM LR PValue FDR >>>>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0 >>>>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0 >>>>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0 >>>>> ..... >>>>>> show(summary(de <- decideTestsDGE(lrt))); >>>>> [,1] >>>>> -1 6179 >>>>> 0 4832 >>>>> 1 7204 >>>>> >>>>> My questions are below: >>>>> >>>>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary. is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting? >>>>> >>>>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings? >>>>> >>>>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt, there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test? >>>>> >>>>> 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice? >>>>> >>>>> 5. I did MDS plot, which looks nice in that the tumor and normal are >>>>> separated very well in one side vs other. In the plotSmear plot, I did >>>>> see some red spots inside the black spots zones in the center zone >>>>> (supposed to be lower logFC), why they were called as significant? May >>>>> this explain why we get so many DEGs >13k of them? Or large variations >>>>> across samples? >>>>> >>>>> Thanks so much in advance for your help! >>>>> >>>>> Mike >>>>> NIH >>>>> Maryland, USA. >>>>> >>>>>> show(sessionInfo()); >>>>> R version 3.0.1 (2013-05-16) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> attached base packages: >>>>> [1] splines stats graphics grDevices utils datasets methods >>>>> [8] base >>>>> other attached packages: >>>>> [1] edgeR_3.4.2 limma_3.16.8 >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Hi, On Wed, Jan 29, 2014 at 4:52 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: [snip] > I have checked with my lab, and I was incorrect to say that we have > downloaded SRA files for TCGA -- it was actually downloaded RSEM counts that > we downloaded. I think it would be a good idea for people like you to put > pressure on TCGA to make unprocessed data available. I am not an expert on accessing TCGA data, but: The raw data is available, but restricted due to privacy/confidentiality concerns for patients. One could get access to the truly raw data via an application: https://cghub.ucsc.edu/access/get_access.html Also, the following page suggests that some type of "raw expression" ("raw counts" and RPKM) should be accessible from their RNASeq v2 pipeline): https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2 So, perhaps tha v2 data might be helpful there. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I think the point is that Ming has already downloaded the v2 data, and the so-called "raw counts" turned out not to be counts.

If you want to dig to find out what the "raw counts" are exactly, that would be a great service, because I am just guessing. The TCGA documentation just says they are from RSEM.

Gordon

ADD REPLY • link 10.2 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

Hi, On Thu, Jan 30, 2014 at 2:07 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: [snip] > I think the point is that Ming has already downloaded the v2 data, and the > so-called "raw counts" turned out not to be counts. I see ... I wasn't paying very close attention to this thread until my last reply and hadn't realized that the OP has pretty much hashed these things out. > If you want to dig to find out what the "raw counts" are exactly, that would > be a great service, because I am just guessing. The TCGA documentation just > says they are from RSEM. After some digging: there are several pages that show the output of the rna-seq pipeline, such as this one: https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2 These suggest (under "File field descriptions") that there is a column int the gene-level summaries file called "raw_counts" which are "The number of reads mapping to this gene". Digging around a bit more, you find that this is really RSEM expected counts. This page was helpful: https://confluence.broadinstitute.org/download/attachments/29790363/DE SCRIPTION.txt?version=1&modificationDate=1363806109000 At the bottom of the page under the "Column Headers" section is where you get the required detail. This thread on RSEM-users mailing list also confirms: https://groups.google.com/forum/#!topic/rsem-users/H1cswrvvmPs The raw_count columns are really the expected counts from RSEM (where the authors of RSEM suggest that these rounded numbers are suitable for edgeR / DESeq ;-) Perhaps the OP would best use EBSeq for the TCGA "raw_counts" data, or, as you suggest, limma::voom since these are actually the expected counts. If you *really* want to use edgeR, the first link I pointed to suggests that the "raw_count" column in the "exon_quantification.txt" are actually the raw counts to the exon (can the OP confirm that they are actually integers?). If so, perhaps you could sum this up per gene and then continue. I suspect you'd likely then be double counting exon-spanning reads, which might be problematic. In principle, you could then subtract the tallies in the junction_quantification.txt to accommodate for that -- which all seems like a lot of work if limma::voom and EBSeq will work just as well on the RSEM expected counts. Hope that helps, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, Dear Drs. Lianoglou and Smyth: yes, I pretty found all of your referred link excpet for the forum link. The consensus is those raw_counts are not true raw counts. In fact, the V1 RNA-seq data did have a field as raw counts, but only a small portion of samples compared to the V2 RNAseq data, which is kind of dilemma. I may try limma voom and EBSeq and see. Thanks for your efforts! best Ming > Date: Thu, 30 Jan 2014 05:57:06 -0800 > From: lianoglou.steve@gene.com > To: smyth@wehi.edu.au > CC: bioconductor@r-project.org > Subject: Re: [BioC] questions on edgeR package > > Hi, > > On Thu, Jan 30, 2014 at 2:07 AM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > [snip] > > I think the point is that Ming has already downloaded the v2 data, and the > > so-called "raw counts" turned out not to be counts. > > I see ... I wasn't paying very close attention to this thread until my > last reply and hadn't realized that the OP has pretty much hashed > these things out. > > > If you want to dig to find out what the "raw counts" are exactly, that would > > be a great service, because I am just guessing. The TCGA documentation just > > says they are from RSEM. > > After some digging: there are several pages that show the output of > the rna-seq pipeline, such as this one: > > https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2 > > These suggest (under "File field descriptions") that there is a column > int the gene-level summaries file called "raw_counts" which are "The > number of reads mapping to this gene". Digging around a bit more, you > find that this is really RSEM expected counts. This page was helpful: > > https://confluence.broadinstitute.org/download/attachments/29790363/ DESCRIPTION.txt?version=1&modificationDate=1363806109000 > > At the bottom of the page under the "Column Headers" section is where > you get the required detail. This thread on RSEM-users mailing list > also confirms: > > https://groups.google.com/forum/#!topic/rsem-users/H1cswrvvmPs > > The raw_count columns are really the expected counts from RSEM (where > the authors of RSEM suggest that these rounded numbers are suitable > for edgeR / DESeq ;-) > > Perhaps the OP would best use EBSeq for the TCGA "raw_counts" data, > or, as you suggest, limma::voom since these are actually the expected > counts. > > If you *really* want to use edgeR, the first link I pointed to > suggests that the "raw_count" column in the "exon_quantification.txt" > are actually the raw counts to the exon (can the OP confirm that they > are actually integers?). If so, perhaps you could sum this up per gene > and then continue. I suspect you'd likely then be double counting > exon-spanning reads, which might be problematic. In principle, you > could then subtract the tallies in the junction_quantification.txt to > accommodate for that -- which all seems like a lot of work if > limma::voom and EBSeq will work just as well on the RSEM expected > counts. > > Hope that helps, > > -steve > > > -- > Steve Lianoglou > Computational Biologist > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Ming ▴ 380

0

Entering edit mode

Dear Ryan and Aaron,

Given Aaron's reactions to my previous responses, I will make one more attempt to answer in slightly more detail.

The first thing to appreciate is that every statistical test is an answer to a particular question. The contrast test that you mention certainly makes statistical sense, but this is not the issue. The issue is scientific rather than statistical. Whether or not this test is an appropriate answer to your scientific question depends on what your scientific question is. You have not yet laid this out in sufficient detail.

Here are some different scientific contexts that might or might not apply in your situation.

First, you might want to assert that C and D have higher expression than either A or B. If you want to claim that, then clearly you must do individual contrasts C vs A, C vs B, D vs A and D vs B. There is no shortcut. The contrast C+D vs A+B is not sufficient.

Or you might want to assert that the treatments cluster into two big groups, C and D vs A and B. Do establish this, you need to show that the CD vs AB separation is larger compared to CvsD and BvsA. You could do all pairwise comparisons, but a slighly more efficient method would be to test three contrasts B-A, D-C and (C+D)/2-(A+B)/2. You can make this assertion if the third contrast is far more significant than the first two. Even if B-A and D-C are statistically significant, you could still establish the claim by showing that the fold changes for (C+D)/2-(A+B)/2 are much larger than those for B-A or D-C.

Or you might want to assert that a population made up of equal parts C & D would have different expression to a population made of equal parts of A & B. To assert that, you only need to test (C+D)/2-(A+B)/2.

The four groups might arise from two original factors. Suppose that the groups A--D correspond to factors are Big = c(1,1,2,2) and Sub = c(1,2,1,2). You might want to assert that Big high increases expression over Big low regardless of the level of Sub. In that case you need to test the two contrasts C-A and D-B. If both are significantly up, then you can make the assertion.

Or you might want to assert that Big has the same effect on expression regardless of the Sub baseline. In that case you need to show that (C+D)/2-(A+B)/2 is significant but (D-B)-(C-A) is not.

Finally, if you were confident in advance that A and B were not different and C and D were not different, then you could simply pool the A and B samples together and the C and D samples together and do a two group test. This produces a statistically valid test only if there is no systematic differential expression between A and B or between C and C. But if you knew that in advance, why did you classify the samples into four groups in the first place??

Best wishes
Gordon

ADD REPLY • link 10.3 years ago • updated 2.1 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you for the comprehensive analysis, Gordon. I will think carefully about the precise question I want to answer and choose the appropriate example from your post. -Ryan On Sat Jan 25 18:15:26 2014, Gordon K Smyth wrote: > Dear Ryan and Aaron, > > Given Aaron's reactions to my previous responses, I will make one more > attempt to answer in slightly more detail. > > The first thing to appreciate is that every statistical test is an > answer to a particular question. The contrast test that you mention > certainly makes statistical sense, but this is not the issue. The > issue is scientific rather than statistical. Whether or not this test > is an appropriate answer to your scientific question depends on what > your scientific question is. You have not yet laid this out in > sufficient detail. > > Here are some different scientific contexts that might or might not > apply in your situation. > > First, you might want to assert that C and D have higher expression > than either A or B. If you want to claim that, then clearly you must > do individual contrasts C vs A, C vs B, D vs A and D vs B. There is > no shortcut. The contrast C+D vs A+B is not sufficient. > > Or you might want to assert that the treatments cluster into two big > groups, C and D vs A and B. Do establish this, you need to show that > the CD vs AB separation is larger compared to CvsD and BvsA. You > could do all pairwise comparisons, but a slighly more efficient method > would be to test three contrasts B-A, D-C and (C+D)/2-(A+B)/2. You > can make this assertion if the third contrast is far more significant > than the first two. Even if B-A and D-C are statistically > significant, you could still establish the claim by showing that the > fold changes for (C+D)/2-(A+B)/2 are much larger than those for B-A or > D-C. > > Or you might want to assert that a population made up of equal parts C > & D would have different expression to a population made of equal > parts of A & B. To assert that, you only need to test (C+D)/2-(A+B)/2. > > The four groups might arise from two original factors. Suppose that > the groups A--D correspond to factors are Big = c(1,1,2,2) and Sub = > c(1,2,1,2). You might want to assert that Big high increases > expression over Big low regardless of the level of Sub. In that case > you need to test the two contrasts C-A and D-B. If both are > significantly up, then you can make the assertion. > > Or you might want to assert that Big has the same effect on expression > regardless of the Sub baseline. In that case you need to show that > (C+D)/2-(A+B)/2 is significant but (D-B)-(C-A) is not. > > Finally, if you were confident in advance that A and B were not > different and C and D were not different, then you could simply pool > the A and B samples together and the C and D samples together and do a > two group test. This produces a statistically valid test only if there > is no systematic differential expression between A and B or between C > and C. But if you knew that in advance, why did you classify the > samples into four groups in the first place?? > > Best wishes > Gordon > > >>> Date: Wed, 22 Jan 2014 16:17:35 -0800 >>> From: "Ryan C. Thompson" <rct at="" thompsonclan.org=""> >>> To: bioconductor <bioconductor at="" r-project.org=""> >>> Subject: [BioC] Odd contrast; does it make statistical sense? >>> >>> Hi all, >>> >>> I'm currently using edgeR to test a somewhat odd contrast. Basically, I >>> have multiple groups, and I want to combine them into just 2 big groups >>> and test whether the two big groups have significantly different >>> averages. I'll give a toy example that demonstrates the same >>> concept. In >>> this example, there are 4 groups, A through D, each containing 3 >>> samples, and I want to test whether the mean of all samples in A & B is >>> different from the mean of all samples in C & D: >>> >>> group <- rep(LETTERS[1:4], 3) >>> design <- model.matrix(~0+group) >>> colnames(design) <- LETTERS[1:4] >>> cont <- makeContrasts((A+B)/2 - (C+D)/2, levels=design) >>> >>> My worry is that with this contrast, I'm effectively just testing >>> two groups against each other, and by having 4 groups in the design >>> I will be estimating dispersions that are not appropriate for the >>> test that I'm doing, and hence I will overstate my confidence. >>> >>> Or, to put it another way, am I doing something equivalent to >>> testing a main effect in a model where an interaction term is present? >>> >>> Thank you, >>> >>> -Ryan Thompson > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD REPLY • link 10.3 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.