Question

Combat Continuous

0

Entering edit mode

Michael Breen ▴ 380

@michael-breen-5999

Last seen 11.4 years ago

Hi all, I have time course gene-expression from blood at 4 different time points. During the last two time points we see an increase in various different cell-type frequencies. We are interested in correcting our gene expression matrix with numerous continuous variables, that is estimated cell-type frequency. However, I realize that Combat does not correct for continuous batches, but rather continuous variables. In this scenario, I have no batch effects however I am interested in using the continuous variables (cell-type frequencies) to correct our gene expression matrix. 1. What does the function "numCovs" implement exactly and how does it handle continuous variables? What is the result on the gene-expression matrix? 2. If correcting the matrix is not feasible, we may consider simply using cell-type frequencies as continuous variables in an ANCOVA. Yours, Michael -- M.S. Breen PhD, Bioinformatics and Genomics Clinical and Experimental Sciences Univ. of Southampton [[alternative HTML version deleted]]

• 2.4k views

ADD COMMENT • link 11.8 years ago Michael Breen ▴ 380

score 0 · Answer 1 · 2014-05-09

0

Entering edit mode

Michael Breen ▴ 380

@michael-breen-5999

Last seen 11.4 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140509="" 33bf604a="" attachment-0001.pl="">

ADD COMMENT • link 11.8 years ago Michael Breen ▴ 380

0

Entering edit mode

Since the true experts are silent, here are my 2 cents... On Fri, May 9, 2014 at 2:14 PM, Michael Breen <breenbioinformatics at="" gmail.com=""> wrote: > Hi all, > > I have time course gene-expression from blood at 4 different time points. > During the last two time points we see an increase in various different > cell-type frequencies. We are interested in correcting our gene expression > matrix with numerous continuous variables, that is estimated cell- type > frequency. However, I realize that Combat does not correct for continuous > batches, but rather continuous variables. > > In this scenario, I have no batch effects however I am interested in using > the continuous variables (cell-type frequencies) to correct our gene > expression matrix. You may want to read previous work on cell-type specific expression analysis, for example Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain. A. Kuhn, D. Thu, H. J. Waldvogel, R. L. M. Faull and R. Luthi-Carter. in Nature Methods, vol. 8, num. 11, p. 945-947, 2011. Estimating gene expression within specific cell populations is more involved than simply using a linear model in which cell type frequencies are covariates. > > 1. What does the function "numCovs" implement exactly and how does it > handle continuous variables? What is the result on the gene- expression > matrix? numCovs is not a function, it is an argument of the function ComBat. It lets you specify the columns of the model matrix that are to be treated as continuous variables rather than factors. Continuous variables and factors are treated differently in the underlying linear model; there is one term for each continuous variable, while there is one term for each level of a factor except the first one. HTH, Peter

ADD REPLY • link 11.8 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Peter, Thanks for the clarification of Combat continous variables vs. factors. That is the information I was missing. As far as deconvoultion analysis, we currenlty are using Zhong et al. (2013) DSA method using HaemAtlas to provide a signature matrix (or cell-type marker list). The method estimates cell proportions from mixed sample expression data, given a set of markers (HaemAtlas), i.e. features that are known to be exclusively expressed by a single cell type in the mixture. Although, these analyses are completely dependent upon your marker lists. Now, that I have some reasonable cell-type frequencies I would like to explore the potential of either: A) correcting these cell-type frequences, as within Combat B) using these frequences as continuous variables in a linear model. I don't find option B) anymore involved than that. Can you elaborate? Michael On Mon, May 12, 2014 at 6:42 PM, Peter Langfelder < peter.langfelder@gmail.com> wrote: > Since the true experts are silent, here are my 2 cents... > > On Fri, May 9, 2014 at 2:14 PM, Michael Breen > <breenbioinformatics@gmail.com> wrote: > > Hi all, > > > > I have time course gene-expression from blood at 4 different time points. > > During the last two time points we see an increase in various different > > cell-type frequencies. We are interested in correcting our gene > expression > > matrix with numerous continuous variables, that is estimated cell- type > > frequency. However, I realize that Combat does not correct for continuous > > batches, but rather continuous variables. > > > > In this scenario, I have no batch effects however I am interested in > using > > the continuous variables (cell-type frequencies) to correct our gene > > expression matrix. > > You may want to read previous work on cell-type specific expression > analysis, for example > > Population-specific expression analysis (PSEA) reveals molecular > changes in diseased brain. A. Kuhn, D. Thu, H. J. Waldvogel, R. L. M. > Faull and R. Luthi-Carter. in Nature Methods, vol. 8, num. 11, p. > 945-947, 2011. > > Estimating gene expression within specific cell populations is more > involved than simply using a linear model in which cell type > frequencies are covariates. > > > > > 1. What does the function "numCovs" implement exactly and how does it > > handle continuous variables? What is the result on the gene- expression > > matrix? > > numCovs is not a function, it is an argument of the function ComBat. > It lets you specify the columns of the model matrix that are to be > treated as continuous variables rather than factors. Continuous > variables and factors are treated differently in the underlying linear > model; there is one term for each continuous variable, while there is > one term for each level of a factor except the first one. > > HTH, > > Peter > -- M.S. Breen PhD, Bioinformatics and Genomics Clinical and Experimental Sciences Univ. of Southampton [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Michael Breen ▴ 380

0

Entering edit mode

Hi Michael, On Mon, May 12, 2014 at 1:23 PM, Michael Breen <breenbioinformatics at="" gmail.com=""> wrote: > As far as deconvoultion analysis, we currenlty are using Zhong et al. (2013) > DSA method using HaemAtlas to provide a signature matrix (or cell- type > marker list). The method estimates cell proportions from mixed sample > expression data, given a set of markers (HaemAtlas), i.e. features that are > known to be exclusively expressed by a single cell type in the mixture. > Although, these analyses are completely dependent upon your marker lists. This is straightforward. > > Now, that I have some reasonable cell-type frequencies I would like to > explore the potential of either: > > A) correcting these cell-type frequences, as within Combat > B) using these frequences as continuous variables in a linear model. > > I don't find option B) anymore involved than that. Can you elaborate? Well, we should first define your analysis goal more precisely. What so you have in mind when you say "adjust gene expression for changes in cell type composition"? If you want cell-type specific expression, here's how the authors of the paper I referred to earlier do it (and I agree): Gene expression of a particular gene is presumably different in each cell type. So the total expression of each gene is a sum of cell-type specific expressions multiplied by the cell type abundances. This sort of looks like a linear model, except that the coefficients multiplying the cell type abundances are not constant - the cell type specific gene expressions presumably also change between time points and it is these changes people are usually after (this is why we need to define your analysis goal more precisely). To isolate the cell-type specific expression changes between time points, you would then have to write a separate linear model at each time point, figure out the cell-type specific expression as the regression coefficient at each time point, then compare them (i.e., the regression coefficients). This of course assumes that at each time point you have multiple samples, preferably many more than the number of cell types you suspect you have in your sample, and the cell-type specific expression of each gene within each cell type at each time point can be considered constant. If you simply adjust for cell type frequencies, it is not clear to me how to interpret the resulting number. Peter

ADD REPLY • link 11.8 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Peter, I considered re-titling this post however we can leave it as is. On Mon, May 12, 2014 at 10:48 PM, Peter Langfelder < peter.langfelder@gmail.com> wrote: > Hi Michael, > > On Mon, May 12, 2014 at 1:23 PM, Michael Breen > <breenbioinformatics@gmail.com> wrote: > > > As far as deconvoultion analysis, we currenlty are using Zhong et al. > (2013) > > DSA method using HaemAtlas to provide a signature matrix (or cell- type > > marker list). The method estimates cell proportions from mixed sample > > expression data, given a set of markers (HaemAtlas), i.e. features that > are > > known to be exclusively expressed by a single cell type in the mixture. > > Although, these analyses are completely dependent upon your marker lists. > > This is straightforward. > > > > > Now, that I have some reasonable cell-type frequencies I would like to > > explore the potential of either: > > > > A) correcting these cell-type frequences, as within Combat > > B) using these frequences as continuous variables in a linear model. > > > > I don't find option B) anymore involved than that. Can you elaborate? > > Well, we should first define your analysis goal more precisely. What > so you have in mind when you say "adjust gene expression for changes > in cell type composition"? > > If you want cell-type specific expression, here's how the authors of > the paper I referred to earlier do it (and I agree): Gene expression > of a particular gene is presumably different in each cell type. So the > total expression of each gene is a sum of cell-type specific > expressions multiplied by the cell type abundances. This sort of looks > like a linear model, except that the coefficients multiplying the cell > type abundances are not constant - the cell type specific gene > expressions presumably also change between time points and it is these > changes people are usually after (this is why we need to define your > analysis goal more precisely). > > To isolate the cell-type specific expression changes between time > points, you would then have to write a separate linear model at each > time point, figure out the cell-type specific expression as the > regression coefficient at each time point, then compare them (i.e., > the regression coefficients). This of course assumes that at each time > point you have multiple samples, preferably many more than the number > of cell types you suspect you have in your sample, and the cell-type > specific expression of each gene within each cell type at each time > point can be considered constant. > > I am familiar with this paper and too agree with the methodology of writing seperate linear modles at each time point and then comparing the regression coefficents. As it stands, this project has 4 time points (*baseline, pre, post, 1 hour post*) and at each time point the same 6 subjects. In this study we have profiled the luekocytes and given the type of experiment, it would be easy prior to investigation to assume that at our post and 1 hour post time points there will be large increases in the frequency of NK cell-types (which is the case after deconvoluting). Because we are simply after differentially expressed genes, it is easy to see most of these resulting genes are associated with the alterations of NK-cells. Therefore, we would like to remove this effect of NK-cell type frequency (and all other cell-types with significant variation at the time points), as we know that gene expression can vary substantially among cell types and the heterogeneity of our tissue may mask the identification of biologically important information within less abundant cell-types. In other words, what I am after is aggregating the regression coefficents from the various time points of our different cell-types and using these as continuous variables to adjust for within our DE testing in order to reveal differences that are not necessarily associated with increases of a particular cell-type. Does that make sense? p.s. Like others, I have enjoyed what you have done with WGCNA. > If you simply adjust for cell type frequencies, it is not clear to me > how to interpret the resulting number. > > Peter > -- M.S. Breen PhD, Bioinformatics and Genomics Clinical and Experimental Sciences Univ. of Southampton [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Michael Breen ▴ 380

0

Entering edit mode

Hey Guys, I hesitate even responding here, because it seems that you are on the right track. It seems to me that ComBat is clearly not the right fit here. The numerical covariate designation only allows you to keep variation from a numerical covariate IN the data and does not remove it. Also note, that this just uses a simple linear regression, so it seems that your problem is much too complicated. Good luck with your analysis. Sorry I couldn't be more help. Evan On May 13, 2014, at 3:50 AM, Michael Breen <breenbioinformatics@gmail.com<mailto:breenbioinformatics@gmail.com>> wrote: Peter, I considered re-titling this post however we can leave it as is. On Mon, May 12, 2014 at 10:48 PM, Peter Langfelder <peter.langfelder@gmail.com<mailto:peter.langfelder@gmail.com>> wrote: Hi Michael, On Mon, May 12, 2014 at 1:23 PM, Michael Breen <breenbioinformatics@gmail.com<mailto:breenbioinformatics@gmail.com>> wrote: > As far as deconvoultion analysis, we currenlty are using Zhong et al. (2013) > DSA method using HaemAtlas to provide a signature matrix (or cell- type > marker list). The method estimates cell proportions from mixed sample > expression data, given a set of markers (HaemAtlas), i.e. features that are > known to be exclusively expressed by a single cell type in the mixture. > Although, these analyses are completely dependent upon your marker lists. This is straightforward. > > Now, that I have some reasonable cell-type frequencies I would like to > explore the potential of either: > > A) correcting these cell-type frequences, as within Combat > B) using these frequences as continuous variables in a linear model. > > I don't find option B) anymore involved than that. Can you elaborate? Well, we should first define your analysis goal more precisely. What so you have in mind when you say "adjust gene expression for changes in cell type composition"? If you want cell-type specific expression, here's how the authors of the paper I referred to earlier do it (and I agree): Gene expression of a particular gene is presumably different in each cell type. So the total expression of each gene is a sum of cell-type specific expressions multiplied by the cell type abundances. This sort of looks like a linear model, except that the coefficients multiplying the cell type abundances are not constant - the cell type specific gene expressions presumably also change between time points and it is these changes people are usually after (this is why we need to define your analysis goal more precisely). To isolate the cell-type specific expression changes between time points, you would then have to write a separate linear model at each time point, figure out the cell-type specific expression as the regression coefficient at each time point, then compare them (i.e., the regression coefficients). This of course assumes that at each time point you have multiple samples, preferably many more than the number of cell types you suspect you have in your sample, and the cell-type specific expression of each gene within each cell type at each time point can be considered constant. I am familiar with this paper and too agree with the methodology of writing seperate linear modles at each time point and then comparing the regression coefficents. As it stands, this project has 4 time points (baseline, pre, post, 1 hour post) and at each time point the same 6 subjects. In this study we have profiled the luekocytes and given the type of experiment, it would be easy prior to investigation to assume that at our post and 1 hour post time points there will be large increases in the frequency of NK cell-types (which is the case after deconvoluting). Because we are simply after differentially expressed genes, it is easy to see most of these resulting genes are associated with the alterations of NK-cells. Therefore, we would like to remove this effect of NK-cell type frequency (and all other cell- types with significant variation at the time points), as we know that gene expression can vary substantially among cell types and the heterogeneity of our tissue may mask the identification of biologically important information within less abundant cell-types. In other words, what I am after is aggregating the regression coefficents from the various time points of our different cell-types and using these as continuous variables to adjust for within our DE testing in order to reveal differences that are not necessarily associated with increases of a particular cell-type. Does that make sense? p.s. Like others, I have enjoyed what you have done with WGCNA. If you simply adjust for cell type frequencies, it is not clear to me how to interpret the resulting number. Peter -- M.S. Breen PhD, Bioinformatics and Genomics Clinical and Experimental Sciences Univ. of Southampton [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago W. Evan Johnson ▴ 870

0

Entering edit mode

Thanks for the response and nice discussion. To summarize, we have deconvolved blood gene expression from 4 time points into cell-type frequencies. After, we are interested in taking into account differences in cell-type frequencies in our linear model for DE testing. In this example it is tricky to aggregate regression coefficients (cell- type frequencies) from different time points into one continuous variable. Nonetheless, while using this as a continuous variable, this seems to have a nice effect, in our data at least, for reducing the effect an over-expressed cell-type at one of our four time points. Keep in mind that deconvolution of a hetergenous tissue, as blood, depends largely on the cell-type markers being used and this could considerably change results. On Tue, May 13, 2014 at 3:33 PM, Johnson, William Evan <wej@bu.edu> wrote: > Hey Guys, > > I hesitate even responding here, because it seems that you are on the > right track. It seems to me that ComBat is clearly not the right fit here. > The numerical covariate designation only allows you to keep variation from > a numerical covariate IN the data and does not remove it. Also note, that > this just uses a simple linear regression, so it seems that your problem is > much too complicated. > > Good luck with your analysis. Sorry I couldn't be more help. > > Evan > > > On May 13, 2014, at 3:50 AM, Michael Breen < > breenbioinformatics@gmail.com> wrote: > > Peter, > > I considered re-titling this post however we can leave it as is. > > > On Mon, May 12, 2014 at 10:48 PM, Peter Langfelder < > peter.langfelder@gmail.com> wrote: > >> Hi Michael, >> >> On Mon, May 12, 2014 at 1:23 PM, Michael Breen >> <breenbioinformatics@gmail.com> wrote: >> >> > As far as deconvoultion analysis, we currenlty are using Zhong et al. >> (2013) >> > DSA method using HaemAtlas to provide a signature matrix (or cell-type >> > marker list). The method estimates cell proportions from mixed sample >> > expression data, given a set of markers (HaemAtlas), i.e. features that >> are >> > known to be exclusively expressed by a single cell type in the mixture. >> > Although, these analyses are completely dependent upon your marker >> lists. >> >> This is straightforward. >> >> > >> > Now, that I have some reasonable cell-type frequencies I would like to >> > explore the potential of either: >> > >> > A) correcting these cell-type frequences, as within Combat >> > B) using these frequences as continuous variables in a linear model. >> > >> > I don't find option B) anymore involved than that. Can you elaborate? >> >> Well, we should first define your analysis goal more precisely. What >> so you have in mind when you say "adjust gene expression for changes >> in cell type composition"? >> >> If you want cell-type specific expression, here's how the authors of >> the paper I referred to earlier do it (and I agree): Gene expression >> of a particular gene is presumably different in each cell type. So the >> total expression of each gene is a sum of cell-type specific >> expressions multiplied by the cell type abundances. This sort of looks >> like a linear model, except that the coefficients multiplying the cell >> type abundances are not constant - the cell type specific gene >> expressions presumably also change between time points and it is these >> changes people are usually after (this is why we need to define your >> analysis goal more precisely). >> >> To isolate the cell-type specific expression changes between time >> points, you would then have to write a separate linear model at each >> time point, figure out the cell-type specific expression as the >> regression coefficient at each time point, then compare them (i.e., >> the regression coefficients). This of course assumes that at each time >> point you have multiple samples, preferably many more than the number >> of cell types you suspect you have in your sample, and the cell- type >> specific expression of each gene within each cell type at each time >> point can be considered constant. >> >> > I am familiar with this paper and too agree with the methodology of > writing seperate linear modles at each time point and then comparing the > regression coefficents. As it stands, this project has 4 time points (*baseline, > pre, post, 1 hour post*) and at each time point the same 6 subjects. In > this study we have profiled the luekocytes and given the type of > experiment, it would be easy prior to investigation to assume that at our > post and 1 hour post time points there will be large increases in the > frequency of NK cell-types (which is the case after deconvoluting). > Because we are simply after differentially expressed genes, it is easy to > see most of these resulting genes are associated with the alterations of > NK-cells. Therefore, we would like to remove this effect of NK-cell type > frequency (and all other cell-types with significant variation at the time > points), as we know that gene expression can vary substantially among cell > types and the heterogeneity of our tissue may mask the identification of > biologically important information within less abundant cell-types. > > In other words, what I am after is aggregating the regression > coefficents from the various time points of our different cell-types and > using these as continuous variables to adjust for within our DE testing in > order to reveal differences that are not necessarily associated with > increases of a particular cell-type. > > Does that make sense? > > p.s. Like others, I have enjoyed what you have done with WGCNA. > > > >> If you simply adjust for cell type frequencies, it is not clear to me >> how to interpret the resulting number. >> >> Peter >> > > > -- > M.S. Breen > PhD, Bioinformatics and Genomics > Clinical and Experimental Sciences > Univ. of Southampton > > > -- M.S. Breen PhD, Bioinformatics and Genomics Clinical and Experimental Sciences Univ. of Southampton [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Michael Breen ▴ 380