edgeR for data combined from different studies and/or platforms

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.3 years ago

How suitable is edgeR for analyzing RNA sequencing data obtained from multiple studies, possibly using multiple platforms? I am trying to compare mRNA sequencing data obtained for two different cancers by the Cancer Genome Atlas (TCGA) project. Different research teams are handling the work for the two different cancers, and TCGA regularly releases updated, 'level 3,' (within-cancer) RSEM-processed data for cancer-specific sub-projects (each with 200+ samples). I am trying to use edgeR for differential expression analyses with Exact test, using 'raw count' values in the two cancer data-sets as the input for edgeR. I plan to use edgeR with its default settings, except for prior.df in estimateTagwiseDisp() -- intend to use 0.5 instead of 20 -- and, rowsum.filter in estimateCommonDisp() -- intend to use perhaps 500 instead of 5. (1) Is it OK to use edgeR for such cross-study comparison when the two groups I want to compare have been exclusively examined by just one of the two studies? (2) In my case, the sequencing platform is the same for the two studies. Had it been different, could I still use edgeR? (3) Do answers to the above two questions also apply for microRNA sequencing studies (where library [total count] sizes are typically 10-20x smaller)? Thank you. Santos -- output of sessionInfo(): R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_3.0.8 limma_3.14.4 EBSeq_1.1.6 [4] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-9 [7] caTools_1.14 gdata_2.12.0 gtools_2.7.0 [10] blockmodeling_0.1.8 reshape2_1.2.2 plyr_1.8 loaded via a namespace (and not attached): [1] bitops_1.0-4.2 stringr_0.6.2 tools_2.15.1 -- Sent via the guest posting facility at bioconductor.org.

Sequencing Cancer edgeR Sequencing Cancer edgeR • 2.0k views

ADD COMMENT • link updated 11.8 years ago by Ryan C. Thompson ★ 7.9k • written 11.8 years ago by Guest User ★ 13k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 10 weeks ago

Icahn School of Medicine at Mount Sinai…

It sounds like the two cancer projects are essentially independent. What do you hope to gain by combining them? As far as I can tell, the main advantage would be estimating dispersions from a larger pool of samples. Obviously, this only makes sense if the dispersions are similar in the two centers' datasets. You could try estimating dispersions separately for both centers and then plotting them against each other for each gene. If you get a reasonable clustering around the identity line, then you could probably justify combining the datasets for better dispersion estimation. In any case, if you decide to combine the datasets from multiple centers, you would probably want to use edgeR's GLM methods, not exactTest, since you would want to use a design matrix that incorporates effects for differences between centers (and, if more than one center works on the same cancer, center-cancer interaction effects). So, assuming that you believe that combining the datasets will improve your dispersion estimation, my answers to your questions would be: (1) If the two studies have no groups in common, then comparisons between a group from one study and a group from the other study will probably not be meaningful, since the effects would be confounded with the inter-study effects. However, comparing two groups from the same study is perfectly valid. (2) If different sequencing platforms were used, there are a few problems that could arise. One, they might produce very different library sizes, and while in theory edgeR should deal with this, in practice larger differences in library size will probably cause more problems, because there is more to correct for. Second, inter-platform differences will be confounded with inter-center differences. But that's ok since you don't necessarily need to know either of those directly. I'm sure there are other issues that I haven't thought of, so anyone else please feel free to chime in. (3) I see no reason that the above should not apply to microRNA libraries. The total library size shouldn't matter so much as the per-gene counts. If each miRNA has a maximum of 10 reads in every sample, then you're working from very little data and you should not expect very good results. On Sun 17 Mar 2013 07:48:50 PM PDT, Santos [guest] wrote: > > > How suitable is edgeR for analyzing RNA sequencing data obtained from > multiple studies, possibly using multiple platforms? > > I am trying to compare mRNA sequencing data obtained for two different > cancers by the Cancer Genome Atlas (TCGA) project. Different research > teams are handling the work for the two different cancers, and TCGA > regularly releases updated, 'level 3,' (within-cancer) RSEM- processed > data for cancer-specific sub-projects (each with 200+ samples). > > I am trying to use edgeR for differential expression analyses with > Exact test, using 'raw count' values in the two cancer data-sets as > the input for edgeR. I plan to use edgeR with its default settings, > except for prior.df in estimateTagwiseDisp() -- intend to use 0.5 > instead of 20 -- and, rowsum.filter in estimateCommonDisp() -- intend > to use perhaps 500 instead of 5. > > (1) Is it OK to use edgeR for such cross-study comparison when the two > groups I want to compare have been exclusively examined by just one of > the two studies? > > (2) In my case, the sequencing platform is the same for the two > studies. Had it been different, could I still use edgeR? > > (3) Do answers to the above two questions also apply for microRNA > sequencing studies (where library [total count] sizes are typically > 10-20x smaller)? > > Thank you. > > Santos > > > -- output of sessionInfo(): > > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] edgeR_3.0.8 limma_3.14.4 EBSeq_1.1.6 > [4] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-9 > [7] caTools_1.14 gdata_2.12.0 gtools_2.7.0 > [10] blockmodeling_0.1.8 reshape2_1.2.2 plyr_1.8 > > loaded via a namespace (and not attached): > [1] bitops_1.0-4.2 stringr_0.6.2 tools_2.15.1 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi On 18/03/13 18:07, Ryan C. Thompson wrote: > It sounds like the two cancer projects are essentially independent. What > do you hope to gain by combining them? As far as I can tell, the main [...] Ryan, I think you are too optimistic here. Santos wrote: >> Different research teams are handling the work for the two different cancers So, if I understand this right, then Santos has transcriptome data for one cancer type obtained from one team, and for the other cancer type from the other team. In this case, any comparison between the two cancer types is completely confounded with technical differences and hence invalid, and no statistical method will change this. So, just to clarify: Ryan's answer is correct, of course, but he assumed that each of the two team provided data for both cancer types, which, I think, is not what Santos meant. Or is it, and I was too pessimistic? Simon

ADD REPLY • link 11.8 years ago Simon Anders ★ 3.8k

0

Entering edit mode

I was hoping that maybe both studies included both cancer and control samples, so that the control samples would allow inter-study technical differences to be disentangled from experimental effects. On Mon 18 Mar 2013 10:36:06 AM PDT, Simon Anders wrote: > Hi > > On 18/03/13 18:07, Ryan C. Thompson wrote: >> It sounds like the two cancer projects are essentially independent. What >> do you hope to gain by combining them? As far as I can tell, the main > [...] > > Ryan, I think you are too optimistic here. Santos wrote: > >>> Different research teams are handling the work for the two different >>> cancers > > So, if I understand this right, then Santos has transcriptome data for > one cancer type obtained from one team, and for the other cancer type > from the other team. In this case, any comparison between the two > cancer types is completely confounded with technical differences and > hence invalid, and no statistical method will change this. > > So, just to clarify: Ryan's answer is correct, of course, but he > assumed that each of the two team provided data for both cancer types, > which, I think, is not what Santos meant. Or is it, and I was too > pessimistic? > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you for your thoughts. I want to compare the two cancers to study cancer-specific genes/pathways. Both are cancers of the same organ. There is no sample that is common to both data-sets. The sets do include cancer-adjacent normal tissue samples and I can examine the combined data to see if the normal samples of one set are like from that of the other, though one can question the underlying assumption that adjacent 'normal' tissue of one cancer is like that of another. The Cancer Genome Atlas (TCGA) work for the two cancers is quite elaborate and vast. There are hundreds of samples for each cancer, and TCGA seems to have very standardized protocols for handling, processing and assaying the cancer specimens obtained from many institutions. The mRNA expression (sequencing) data for both cancers have been (and are being) obtained using the same platform. Presumably, the sequencing data collection has been going on for multiple years and has involved lots of different people. Knowing this, one would think that persistent systemic variability between the data from the two cancers will be minimal (especially, perhaps, for sequencing data) and one could rationally combine data for the two cancers for inter-cancer comparisons. Santos On Mon, Mar 18, 2013 at 1:07 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> wrote: > It sounds like the two cancer projects are essentially independent. What do > you hope to gain by combining them? As far as I can tell, the main advantage > would be estimating dispersions from a larger pool of samples. Obviously, > this only makes sense if the dispersions are similar in the two centers' > datasets. You could try estimating dispersions separately for both centers > and then plotting them against each other for each gene. If you get a > reasonable clustering around the identity line, then you could probably > justify combining the datasets for better dispersion estimation. > > In any case, if you decide to combine the datasets from multiple centers, > you would probably want to use edgeR's GLM methods, not exactTest, since you > would want to use a design matrix that incorporates effects for differences > between centers (and, if more than one center works on the same cancer, > center-cancer interaction effects). > > So, assuming that you believe that combining the datasets will improve your > dispersion estimation, my answers to your questions would be: > > (1) If the two studies have no groups in common, then comparisons between a > group from one study and a group from the other study will probably not be > meaningful, since the effects would be confounded with the inter- study > effects. However, comparing two groups from the same study is perfectly > valid. > > (2) If different sequencing platforms were used, there are a few problems > that could arise. One, they might produce very different library sizes, and > while in theory edgeR should deal with this, in practice larger differences > in library size will probably cause more problems, because there is more to > correct for. Second, inter-platform differences will be confounded with > inter-center differences. But that's ok since you don't necessarily need to > know either of those directly. I'm sure there are other issues that I > haven't thought of, so anyone else please feel free to chime in. > > (3) I see no reason that the above should not apply to microRNA libraries. > The total library size shouldn't matter so much as the per-gene counts. If > each miRNA has a maximum of 10 reads in every sample, then you're working > from very little data and you should not expect very good results. > > > On Sun 17 Mar 2013 07:48:50 PM PDT, Santos [guest] wrote: >> >> >> >> How suitable is edgeR for analyzing RNA sequencing data obtained from >> multiple studies, possibly using multiple platforms? >> >> I am trying to compare mRNA sequencing data obtained for two different >> cancers by the Cancer Genome Atlas (TCGA) project. Different research teams >> are handling the work for the two different cancers, and TCGA regularly >> releases updated, 'level 3,' (within-cancer) RSEM-processed data for >> cancer-specific sub-projects (each with 200+ samples). >> >> I am trying to use edgeR for differential expression analyses with Exact >> test, using 'raw count' values in the two cancer data-sets as the input for >> edgeR. I plan to use edgeR with its default settings, except for prior.df in >> estimateTagwiseDisp() -- intend to use 0.5 instead of 20 -- and, >> rowsum.filter in estimateCommonDisp() -- intend to use perhaps 500 instead >> of 5. >> >> (1) Is it OK to use edgeR for such cross-study comparison when the two >> groups I want to compare have been exclusively examined by just one of the >> two studies? >> >> (2) In my case, the sequencing platform is the same for the two studies. >> Had it been different, could I still use edgeR? >> >> (3) Do answers to the above two questions also apply for microRNA >> sequencing studies (where library [total count] sizes are typically 10-20x >> smaller)? >> >> Thank you. >> >> Santos >> >> >> -- output of sessionInfo(): >> >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] grid stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] edgeR_3.0.8 limma_3.14.4 EBSeq_1.1.6 >> [4] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-9 >> [7] caTools_1.14 gdata_2.12.0 gtools_2.7.0 >> [10] blockmodeling_0.1.8 reshape2_1.2.2 plyr_1.8 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-4.2 stringr_0.6.2 tools_2.15.1 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.8 years ago Santosh Patnaik ▴ 30

0

Entering edit mode

Hi Santosh Thanks for the additional information. Combining data from different studies is always quite risky. When you see differences between your two cancer types, you will not be able to say for sure whether these are due to the fact that it's different cancer types or that it's different labs. On the other hand, I agree with you that comparing the healthy samples produced by the two labs is a reasonable approach to show that the lab effects are small and well controlled. The only difficulty here is that in such a test, the lab effects have to be large compared to the differences between individuals within each data set. In your actual comparison, on the other hand, you compare tumour and healthy tissue from the same individual, which is a setting with much more inferential power. So, a batch or lab effect might be strong enough to be large compared to the noise in the paired comparisons but too small to appear significant in the un-paired comparisons of the healthy controls. On the third hand, however, the usual method of correcting for lab or batch effects is to include a blocking factor in your linear model, and this assumes that the batch effect is additive. Once you accept this assumption (which may be questionable but is standard practice) you don't need to account for batch effects at all in a paired comparison between tumours and healthy tissue as long as both the tumour and the control sample from each subject have always been processed in the same lab (because then, any additive lab effect cancels out when looking at tumour-control differences). So, considering all this, I'd say, go ahead with your comparison but make sure that all your tests are paired. So, you would make a design table with one row for each sample and three columns: subject (one level for each subject, IDs running over all subjects from both studies), disease state (two levels: healthy control or tumour tissue) and cancer type (two levels: cancer A or cancer B; this is the cancer of the subject, not the sample, so the healthy tissue samples get a cancer type assigned, too). Now, you fit a reduced model count ~ subject + disease_state + cancer_type and a full model count ~ subject + disease_state + cancer_type + disease_state:cancer_type and compare them to test for significance of the interaction term (which indicates that the difference between tumour and control tissue differs between cancers for the tested gene). (The formula notation is with DESeq in mind. In DESeq2, you only fit the second model and then do a Wald test for the interaction coefficient, as describe in the vignette. For edgeR, IIRC, you also just fit the full model and then get a p value for the last coefficient, which should be the interaction coefficient.) Simon

ADD REPLY • link 11.8 years ago Simon Anders ★ 3.8k

Login before adding your answer.