Question

Global reduction of epigenetic factors (regularized log transformation DESeq2) ~ Diffbind

0

Entering edit mode

JunLVI ▴ 40

@junlvi-8996

Last seen 5.7 years ago

Japan

Hi, I was trying to investigate the changes of H3K27ac upon conditional KO of gene X.

After mapping and counting the data using "featurecount" from "Rsubread" packages, I used the DEseq2 to perform differential analysis.

Below is the scatter plot (WT vs CKO), I could obtain genomic region with the significantly changed H3K27ac (as highlighted in "red" in the plot),

but I also noticed that there is global shift of histone marks.

One possibility I could think of is that this shift might just reflect the fact that the sequence depth is different between WT and CKO samples.

I checked the "regularized log transformation" part of the vigentte as well as the original paper of "DESeq2",

To be honest, I was not able to fully understand the underlying mathematics, but it seems me that the "rlog" transformation has factored in "size factor", or the sequence depth . Even though the main rationale of performing "rlog" transformation is to stabilize the variance across gene with difference means.

So does it mean the shift did reflect the global reduction of H3K27ac upon gene X KO, given that the wet experiments and other data process performed properly? ( I guess even it is a correct interpretation, other independent method might be needed to verify the conclusion.)

Or did I just misunderstand of "rlog" methodology?

Thanks in advance.

I asked the question at biostar, in a more general sense.

https://www.biostars.org/p/255052/

Devon Ryan kindly suggested to use DiffBind to "normalize only to non-peak regions", I checked "DiffBind", could not find the options he referred to. But maybe I missed some key points. So if "Diffbind" could help solve my issue, please kindly let me know.

deseq2 diffbind • 1.9k views

ADD COMMENT • link updated 6.9 years ago by Michael Love 41k • written 6.9 years ago by JunLVI ▴ 40

score 1 · Answer 1 · 2017-06-02

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

if there is global up or down changes you need to specify the controlGenes for the size factor estimation to make sense.

Yes there can be changes when you update versions. If you need the results not to change, best to stay within a release and not change the version of R/Bioconductor.

ADD COMMENT • link 6.9 years ago Michael Love 41k

0

Entering edit mode

@Michal Love, thanks very much for your suggestions. I will try a set of house keeping genes ( here those gene associated epigenomic loci).

it seems to me that some prior evidences /knowledges / assumptions are needed to decide which is the best way to normalize the datasets.

Devon recommended to use non-peak or randomly chosen regions, even this is not watertight, as he readily admitted

"since one could consider cases where the underlying histone modification level is affecting the background coverage rate.".

I guess that brings us back to the basic idea of normalization, "normalize to something stable/unchanged/comparable".

ADD REPLY • link 6.9 years ago JunLVI ▴ 40

score 0 · Answer 2 · 2017-05-29

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

I don't follow what is being plotted. Is that rlog-ed values? The rlog transformation does correct for sequencing depth, by using already calculated size factors or estimating them if they have not been calculated. This means you can provide whatever size factors you like (e.g. estimated from subsets of regions) before calling rlog().

I'm not sure about the normalization options within DiffBind, but I do know that you'll need to provide your code for someone to offer useful suggestions on that part.

ADD COMMENT • link 6.9 years ago Michael Love 41k

0

Entering edit mode

@Michael Love, Hi, Michael, thank you very much for your feedback. Yes, the rlog-ed values are plotted, sorry that I did not make that clear. I checked the "estimateSizeFactors" documentations, I guess the "control gene" argument is what you mean by "you can provide whatever size factors you like (e.g. estimated from subsets of regions)". I did not notice that. Since I have been using the default setting, I guess that means I having been using all the regions in the "DESeqDataSet" object for estimation of "size factor". maybe I should try to use a set of house keeping genes.

If I use all the regions to estimate the size factor, the setting I have been using, somehow my intuition is that after I apply rlog() to the read count, in scatter plot, the dot should be largely evenly distributed along the diagonal line, instead of skewing to one side. Does that make sense?

And if the scatterplot show this kind of skewness after rlog() with size factor estimated from all the regions, does it still legitimate to apply the differential analysis with Deseq2?

ADD REPLY • link 6.9 years ago JunLVI ▴ 40

0

Entering edit mode

I meant you can do:

sizeFactors(dds) <- ...

To set the size factors however you like. Or yes, if you want to pick certain rows to do size factor estimation on, you can use controlGenes.

Yes, I would expect that if you used all regions to estimate the size factors, then the points above would fall on the y=x line. Can you provide the actual code you used to make the plot? This helps me to see what might be going on.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

@Michael Love, Thanks to your suggestion, I went back to check my original code:

Here is code in R session:

H3K27ac_CellA_CellD <- dba(sampleSheet="sampleconfiguration.csv",)
H3K27ac_CellA_CellD_Count <- dba.count(H3K27ac_CellA_CellD, score= DBA_SCORE_TMM_READS_EFFECTIVE, mapQCth=0, bRemoveDuplicates=FALSE) 
H3K27ac_CellA_CellD_Diff <- dba.analyze(H3K27ac_CellA_CellD_Count,method=DBA_DESEQ2)
H3K27ac_CellA_CellD_Diff_All <- dba.report(H3K27ac_CellA_CellD_Diff,method=DBA_DESEQ2, th=1)
H3K27ac_All <- data.frame (Conc_WT=H3K27ac_CellA_CellD_Diff_All @elementMetadata$Conc_WT,
                           Conc_GeneXcKO= H3K27ac_CellA_CellD_Diff_All @elementMetadata$Conc_GeneXcKO)

p <- ggplot(H3K27ac_All, aes(Conc_WT, Conc_GeneXcKO))
p + geom_point() +
    geom_abline(intercept = 0, slope = 1, linetype=2) +
    scale_x_continuous(name="Epifeatures\n in WT Cell ", limits = c(-1, 11)) +
    scale_y_continuous(name="Epifeatures \n in KO Cell ",limits = c(-1, 11)) 

H3K27ac_CellA_CellD_Diff_All  @elementMetadata
DataFrame with 6523 rows and 6 columns
          Conc Conc_GeneXcKO   Conc_WT      Fold   p-value       FDR
     <numeric>    <numeric> <numeric> <numeric> <numeric> <numeric>
1         4.77        -0.51      5.75     -6.26  6.51e-11  4.11e-07
2         4.41         0.27      5.37     -5.11  2.55e-08  8.06e-05
...
6523      3.88         4.06      3.67      0.39         1         1

Embarrassingly, I actually did not user rlog(), but the value was from the report the result from DiffBind differential analysis...

I used the "DESEQ2" argument when using Diffbind, perhaps I need to ask how "dba.report" decides...

Conc Conc_GeneXcKO   Conc_WT

ADD REPLY • link 6.9 years ago JunLVI ▴ 40

0

Entering edit mode

Yeah, you could either directly use rlog(), in which case I'd be surprised if I saw the above plot, or otherwise, I can't offer much advice on the internals of the DiffBind functions.

ADD REPLY • link 6.9 years ago Michael Love 41k

score 0 · Answer 3 · 2017-06-02

@Michael Love, I have posted the question related to DiffBind, has yet to receive any feedback. but I have found some related posts,

in replying " DiffBind dba.report() output: Is "concentration" normalized?"

Rory explained, "the counts are normalized. Concentrations are reported as log2 values."

in term of "normalization factor",

by default (bFullLibrarySize=TRUE),

"the factors are set to:

> DESeq2::sizeFactors(DESeqDataSeq) <- libsize/min(libsize)"

the sizeFactors ill set as library size (total reads numbers in the bam file)

otherwise(bFullLibrarySize=FALSE),

then "calls DESeq2::estimateSizeFactors() to calculate the normalization factors.

I changed my argument when apply DiffBind analysis, basically use the "size factor" estimated by DESeq2

for normalization.

H3K27ac_CellA_CellD <- dba(sampleSheet="sampleconfiguration.csv",)
H3K27ac_CellA_CellD_Count <- dba.count(H3K27ac_CellA_CellD, score= DBA_SCORE_TMM_READS_EFFECTIVE, mapQCth=0, bRemoveDuplicates=FALSE) 
H3K27ac_CellA_CellD_Diff <- dba.analyze(H3K27ac_CellA_CellD_Count,method=DBA_DESEQ2,　bFullLibrarySize=FALSE)
H3K27ac_CellA_CellD_Diff_All <- dba.report(H3K27ac_CellA_CellD_Diff,method=DBA_DESEQ2, th=1)

H3K27ac_All <- data.frame (Conc_WT=H3K27ac_CellA_CellD_Diff_All @elementMetadata$Conc_WT,
                           Conc_GeneXcKO= H3K27ac_CellA_CellD_Diff_All @elementMetadata$Conc_GeneXcKO)


p <- ggplot(H3K27ac_All, aes(Conc_WT, Conc_GeneXcKO))
p + geom_point() +
    geom_abline(intercept = 0, slope = 1, linetype=2) +
    scale_x_continuous(name="Epifeatures\n in WT Cell ", limits = c(-1, 11)) +
    scale_y_continuous(name="Epifeatures \n in KO Cell ",limits = c(-1, 11))

The plot looks right, in the code above, raw read counts were normalized against size factor calculated by "DESeq2", then log transformed by the base of 2. It seems to me that the value is arithmatically equivalent to rlog() in DESeq2. ( I am not 100% sure, the original DEseq2 paper explained in statistic notation with the statistics model in mind, which I am not sure that I understood it completely.)

If it is "regularized log transformation" result, I wonder is there any possibility "rlog" might mask the global differences among the samples.

by that, I do not mean rlog() itself are flawed, but is there any occasion "rlog" (or normalize by "size factor" ) should not be applied?

in the original "DEseq" paper it explained why total read counts might not be a good measure of sequence depth: "a few highly and differentially expressed genes may have strong influence on the total read count, causing the ratio of total read counts not to be a good estimate for the ratio of expected counts.", it seems to me that it implicitly assumes that most or majority of genes, expression of them are largely comparable (if plotted, distributed alone the line with slope of 1) , does this means it should be cautious when there is possibility that all the genes are disregulated in one direction (up or down) ? I guess such scenario should be quite rare. I guess this might be beyond the scope of DESeq2, but if suspect such a scenario (global increase or decrease), which might be the recommended way to normalize?

PS: I forgot to mention that when I re ran my previous code, I got the different report of differential analysis, as far as I am aware, I used the same codes. The only thing I could think of is that I reinstalled DESeq2 this week, so the versions might be different, could that contribute to the differences here?

1 Contrast:
Group1 Members1 Group2 Members2 DB.DESeq2
1 KO 4 WT 4 133

1 Contrast:
Group1 Members1 Group2 Members2 DB.DESeq2
1 KO 4 WT 4 64