Question: How to normalize chromatin and RNA-Seq data together ?
gravatar for g.atla
12 months ago by
g.atla0 wrote:

Dear All,

I have chromatin data, from ATAC and other histone marks and RNA-Seq data for same samples. Lets say I have 14 independent samples subjected to ATAC-Seq and RNA-Seq, but at different times, different sequencing centres.

Now I want to compare the signal from chromatin data to gene expression data, lets say calculating the correlation values of chromatin signal at a peak to near by gene expression levels.

As this data is not directly comparable to each other, and sequencing depth normalisation is not enough, I would like to know how to make this data sets comparable to each other such that the analysis is biologically valid and accepted for publication. I could not find any online methods for this sort of analysis, though its been shown in many papers.


ADD COMMENTlink modified 12 months ago by Michael Love16k • written 12 months ago by g.atla0

Hi! It's not clear to my what you're doing. When you say:

"14 independent samples subjected <...> at different times" are you implying that:


1. The ATAC-Seq was generated at biological days 1 .. 2 .. 10 (for example) while RNA-seq was at biological days 1 ... 3 .. 5 ?

2. You sent off 14 ATAC-seq libraries for sequencing, each one independently to the sequencing centre? And the 14 RNA-seq libraries were also independently sent?

3. The ATAC-Seq and RNA-seq were generated from the same cells, at the same biological day (i.e. you split your cell population into two, and then made libraries from the two different protocols). Then you sent off say 5 and then 5 and then another 4 libraries to the same or different sequencing centres? 


In the case of (1) and (2) there really is no truly "valid" way of doing the comparison. In the case of (3), you can analyse the RNA-seq and ATAC-seq independently of each other (as appropriate for each of the techniques), making sure that you have a "batch" variable in your design formula which takes into account which samples were run on the same sequencing machine in the same run (this is actually what's important, not which sequencing centre your samples went to). I'm also assuming that you have at least three replicates per assay for each of the biological conditions you're investigating (or two, at the non-ideal-and-really-invalid bare minimum). 

Then you combine the output (say, differential gene expression with ATAC-seq differentially detected peaks at the same time points). 14 samples isn't really enough to try things like WGCNA and other more complex, fun, correlation tools.

ADD REPLYlink written 12 months ago by Darya Vanichkina90
gravatar for Michael Love
12 months ago by
Michael Love16k
United States
Michael Love16k wrote:

DESeq2 won't help you correlate nearby chromatin to gene expression. It doesn't offer any built-in functionality for that.

The case when people put e.g. RNA-seq and other types of assay (ChIP or ATAC) in the same dds object, is to answer if the fold change due to treatment within one assay is different than in another assay, using an interaction term. But that doesn't sound like what you want to do.

ADD COMMENTlink written 12 months ago by Michael Love16k

Indeed, even just trying to fit the same GLM to counts from different data types is a bit of a stretch. For starters, they will have different modes of variability, so trying to estimate a single dispersion value for each "gene" will be inappropriate, let alone trying to fit a mean-dispersion trend. The biases will also be different so you'd have to normalise the counts from each data type separately, precluding direct comparisons between data types. The best you could do would be as Mike suggests, where you compare the log-fold changes upon treatment between data types. Even that's a bit sketchy, because differences are inevitable - for example, why would you ever expect a 2-fold increase in chromatin accessibility to result in a 2-fold increase in transcription? The entire process depends on so many other things, it's not likely to be a linear relationship in general.

Anyway, assuming your samples are separated into biological conditions, your best shot would be to look for condition-level correlations, i.e., changes in binding associated with changes in expression. This corresponds to scenario 3 in Darya's post. Here, the idea would be to analyse each data type separately to avoid the problems I've mentioned above.

ADD REPLYlink modified 12 months ago • written 12 months ago by Aaron Lun18k

"Even that's a bit sketchy, because differences are inevitable - for example, why would you ever expect a 2-fold increase in chromatin accessibility to result in a 2-fold increase in transcription?" => agree

ADD REPLYlink written 12 months ago by Michael Love16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 183 users visited in the last hour