DESEq | Batch effect | VST data and linear model adjustment

0

Entering edit mode

Pesce, Francesco ▴ 20

@pesce-francesco-6664

Last seen 11.3 years ago

Hi, We have two cohorts, cases and controls and a set of covariates for both of them ( center, library.prep.date, age, rna.rin.score, sex ). Center and library.prep.date are collinear with the status (all the cases were collected in London while the controls were collected in 4 different centers worldwide) so I used the first principal component of these two covariates and ran DESeq2 using this design: ~ PC1 + age + rna.rin.score + sex + status Unfortunately it looks like the batch effect is too strong and I have ~16K genes with adjP<0.05 One question: is the fold-change still reliable (So that I can use it as rank for GSEA analysie for example) ? Now, although the differential expression might be hampered by the study design and I dont know if I can use these results (what do you think?) the main problem is the following: I have based all the analyses for my PhD thesis and the manuscript I am preparing using DESeq. The pipeline is based on co-expression clustering (WGCNA), diffco-ex between cases and controls and GWAS hits enrichment in these clusters. For the pre-processing of these analyses Ive first obtained the VST data and then adjusted these for the covariates using a linear model. Then I used the residuals for the analyses: > vsd <- varianceStabilizingTransformation(dds, blind=TRUE) > vstMat <- assay(vsd) > lm=lm(vstMat ~ as.factor(info$library.prep.date) + as.numeric(info$age) + as.factor(info$sex) + as.numeric(info$rna.rin.score) + as.factor(info$center)) > data = residuals(lm) The main question is that we are not sure if this pre-processing is correct, does the linear model work here for this purpose on VST data ? (The doubt came by the fact that a fold-change for one gene of interest suggested it was strongly up-regulated in cases, but when boxplotting the residuals from the linear model adjusted for all the covariates you dont see any difference at all ) Thank you very much in advance Looking forward to you reply Best - Francesco Francesco Pesce MD Early Stage Researcher (Marie Curie Fellow) National Heart & Lung Institute Imperial College London Mobile: 0044 (0)7928 341136 Email: f.pesce@imperial.ac.uk<mailto:f.pesce@imperial.ac.uk> Skype: francpesce Please note: The content of this e-mail (including any attachment) contain confidential information and may be protected by law as a legally privileged document and copyright work. Its content should not be disclosed and it should not be given or copied to anyone other than the person(s) named or referenced above. If you have received this mail in error, please contact the sender immediately on the telephone or fax numbers above and then delete it (including any attachment) from your system. Thank you. [[alternative HTML version deleted]]

Clustering Lung DESeq DESeq2 Clustering Lung DESeq DESeq2 • 2.6k views

ADD COMMENT • link updated 11.4 years ago by Simon Anders ★ 3.8k • written 11.4 years ago by Pesce, Francesco ▴ 20

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 5.4 years ago

Zentrum für Molekularbiologie, Universi…

Hi Francesco On 23/07/14 13:39, Pesce, Francesco wrote: > We have two cohorts, cases and controls and a set of covariates for both of them ( center, library.prep.date, age, rna.rin.score, sex ). > Center and library.prep.date are collinear with the status (all the cases were collected in London while the controls > were collected in 4 different centers worldwide) so I used the first principal component of these two covariates and ran DESeq2 using this design: > ~ PC1 + age + rna.rin.score + sex + status > > Unfortunately it looks like the batch effect is too strong and I have ~16K genes with adjP<0.05 > One question: is the fold-change still reliable (So that I can use it as rank for GSEA analysie for example) ? > Now, although the differential expression might be hampered by the study design and I don?t know if I can use these results (what do you think?) That your status (control/disease) is aliased with the sequencing centre is a severe problem. If you find a gene to differ between control and disease samples, you have no way of knowing whether this difference is due to the status or due to specific peculiarities in the library handling of the centres. If there are any differences in the way how the London centre handles its samples, compared to the other centres, and these differences are substantial enough to cause genes to appear differentially expressed, then this will completely overwhelm any signal. Unless you have taken special precautions, this is quite likely to happen, and the fact that you got so much stronger differences than expected, indicated that it likely did happen. So, no, I don't think you can use any results from this study. I am very puzzled about your regression on PC1. Could you explain? Maybe I misunderstood you, but I don't see how PCA analysis can cure an aliasing issue. I'm afraid your main problem is that you seem to be under the wrong impression that it were possible, at least in principle, to tease apart the effect of status and center. > the main problem is the following: > > I have based all the analyses for my PhD thesis and the manuscript I am preparing using DESeq. > The pipeline is based on co-expression clustering (WGCNA), diffco-ex between cases and controls and GWAS hits enrichment in these clusters. > > For the pre-processing of these analyses I?ve first obtained the VST data and then adjusted these for the covariates using a linear model. Then I used the residuals for the analyses: > >> vsd <- varianceStabilizingTransformation(dds, blind=TRUE) >> vstMat <- assay(vsd) >> lm=lm(vstMat ~ as.factor(info$library.prep.date) + as.numeric(info$age) + as.factor(info$sex) + as.numeric(info$rna.rin.score) + as.factor(info$center)) >> data = residuals(lm) > > The main question is that we are not sure if this pre-processing is correct, does the linear model work here for this purpose on VST data ? > (The doubt came by the fact that a fold-change for one gene of interest suggested it was strongly up-regulated in cases, but when boxplotting the residuals from the linear model adjusted for all the covariates you don?t see any difference at all?) If you regress out the effect of center, you also regress out the effect of status, because it is aliased. Hence, your clustering on the residuals can pick up all kinds of effects, but they will be almost guaranteed to have nothing to do with the disease/control difference. So, of course, you don't see any differences between case and control after you regress out center. Apart from that: We now recommend the rlog transformation in DESeq2 instead of the VST from DESeq. It is more reliable in case of varying library sizes. I'm not familiar with WGCNA, so I cannot comment on how well it works with transformed RNA-Seq data. In principle, it should work, I suppose. But, of course, it won't magically cure your aliasing issue either. Simon

ADD COMMENT • link 11.4 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thank you Simon for your informative reply Best - Francesco On 23 Jul 2014, at 13:03, Simon Anders <anders@embl.de<mailto:anders@embl.de>> wrote: Hi Francesco On 23/07/14 13:39, Pesce, Francesco wrote: We have two cohorts, cases and controls and a set of covariates for both of them ( center, library.prep.date, age, rna.rin.score, sex ). Center and library.prep.date are collinear with the status (all the cases were collected in London while the controls were collected in 4 different centers worldwide) so I used the first principal component of these two covariates and ran DESeq2 using this design: ~ PC1 + age + rna.rin.score + sex + status Unfortunately it looks like the batch effect is too strong and I have ~16K genes with adjP<0.05 One question: is the fold-change still reliable (So that I can use it as rank for GSEA analysie for example) ? Now, although the differential expression might be hampered by the study design and I dont know if I can use these results (what do you think?) That your status (control/disease) is aliased with the sequencing centre is a severe problem. If you find a gene to differ between control and disease samples, you have no way of knowing whether this difference is due to the status or due to specific peculiarities in the library handling of the centres. If there are any differences in the way how the London centre handles its samples, compared to the other centres, and these differences are substantial enough to cause genes to appear differentially expressed, then this will completely overwhelm any signal. Unless you have taken special precautions, this is quite likely to happen, and the fact that you got so much stronger differences than expected, indicated that it likely did happen. So, no, I don't think you can use any results from this study. I am very puzzled about your regression on PC1. Could you explain? Maybe I misunderstood you, but I don't see how PCA analysis can cure an aliasing issue. I'm afraid your main problem is that you seem to be under the wrong impression that it were possible, at least in principle, to tease apart the effect of status and center. the main problem is the following: I have based all the analyses for my PhD thesis and the manuscript I am preparing using DESeq. The pipeline is based on co-expression clustering (WGCNA), diffco-ex between cases and controls and GWAS hits enrichment in these clusters. For the pre-processing of these analyses Ive first obtained the VST data and then adjusted these for the covariates using a linear model. Then I used the residuals for the analyses: vsd <- varianceStabilizingTransformation(dds, blind=TRUE) vstMat <- assay(vsd) lm=lm(vstMat ~ as.factor(info$library.prep.date) + as.numeric(info$age) + as.factor(info$sex) + as.numeric(info$rna.rin.score) + as.factor(info$center)) data = residuals(lm) The main question is that we are not sure if this pre-processing is correct, does the linear model work here for this purpose on VST data ? (The doubt came by the fact that a fold-change for one gene of interest suggested it was strongly up-regulated in cases, but when boxplotting the residuals from the linear model adjusted for all the covariates you dont see any difference at all ) If you regress out the effect of center, you also regress out the effect of status, because it is aliased. Hence, your clustering on the residuals can pick up all kinds of effects, but they will be almost guaranteed to have nothing to do with the disease/control difference. So, of course, you don't see any differences between case and control after you regress out center. Apart from that: We now recommend the rlog transformation in DESeq2 instead of the VST from DESeq. It is more reliable in case of varying library sizes. I'm not familiar with WGCNA, so I cannot comment on how well it works with transformed RNA-Seq data. In principle, it should work, I suppose. But, of course, it won't magically cure your aliasing issue either. Simon _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Pesce, Francesco ▴ 20

Login before adding your answer.