Question

normalization of ChIP-seq data by using the spike-ins or by using total library sizes

1

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 6 months ago

Palo Alto, CA, USA

Dear all,

This question may have been asked before, I have searched the mailing list and I can not find an answer. The question is about the correct way of setting the SizeFactors() in DESeq2 in 3 situations. I would like to double check with you. Although the R code that I do list below refers to DESeq2, the question applies equally to edgeR or to limma / voom. I would appreciate having your suggestions.

I am working on 4 ChIP-seq datasets (2 DMSO samples and 2 TREAT samples). The input is a list of PEAKS with raw counts in all these 4 conditions (2 DMSO, and 2 TREAT). I am doing the analysis in 3 ways :

<> normalization to # Spike-ins (The Spike-ins numerical values have been estimated). I am setting the SizeFactors as the ratio between each Spike-in # , divided by the minimum # of all Spike-ins. Is it the correct way of doing it ?

spikeins=c(68924, 80282, 52023, 47090) 

spikeins_norm = spikeins / min(spikeins)

sizeFactors(dds) = spikeins_norm
dds <- estimateDispersions(dds)
dds <- nbinomLRT(dds, reduced= ~1)
res <- results(dds, name = "group_TZO_vs_DMSO")

<> normalization to Total # or Aligned Reads (in BAM files). I am setting the Size Factors() in the same way as I do it for Spike-Ins (Certainly, considering the Total # of Aligned Reads is not a recommended way to set the SizeFactors(), I am just doing it for comparison purposes).

     counts_BAM=c(1489249, 1568028, 1616202, 1647090) 

     counts_BAM_norm = counts_BAM / min(counts_BAM)

     sizeFactors(dds) = counts_BAM_norm
     dds <- estimateDispersions(dds)
     dds <- nbinomLRT(dds, reduced= ~1)
     res <- results(dds, name = "group_TZO_vs_DMSO")

<> normalization to # Reads in the Peaks. In this case, I follow the standard R workflow :

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)

Please let me know what you think. Thanks,

Bogdan

DESeq2 edgeR limma • 2.3k views

ADD COMMENT • link updated 16 months ago by Michael Love 41k • written 17 months ago by Bogdan ▴ 670

0

Entering edit mode

Referring to limma and edgeR, I could add that the R code that has been posted on the web pages (listed below) works and gives results that make sense.

 https://support.bioconductor.org/p/74870/
 https://support.bioconductor.org/p/p132471/
 https://support.bioconductor.org/p/9135179/
 https://www.biostars.org/p/166556/

Any suggestions about the R code in limma/voom ?

ADD REPLY • link 17 months ago Bogdan ▴ 670

score 0 · Answer 1 · 2022-11-27

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 6 hours ago

United States

For spike-ins, just use this built in paradigm:

dds <- estimateSizeFactors(dds, controlGenes= <names or numeric index of control features> )
dds <- DESeq(dds, ...)
...

For the others, I recommend to set sizeFactors by centering them around 0 first, e.g. the mean of the log of size factors should be ~0. So you can do

sizeFactors <- sizeFactors / exp(mean(log(sizeFactors)))

ADD COMMENT • link 16 months ago Michael Love 41k

0

Entering edit mode

Thank you very much Mike . In the light of the article on spiker : https://spiker.readthedocs.io/en/latest/reference.html where the size factors are computed as : 1 mil reads / # spike-in

given the amount of spike-ins :

counts_spike_ins = c(1489249, 1568028, 1616202, 1647090)

is it legit to use the size factors as :

size_factors = 1 / counts_spike_ins

or :

 size_factors = 1000000 / counts_spike_ins

Thanks a lot !

ADD REPLY • link 16 months ago Bogdan ▴ 670

0

Entering edit mode

It used to be a larger question, but I have solved the issue on controlGenes. I do keep the post just in case that somebody else is interested.

given the command :

dds <- estimateSizeFactors(dds, controlGenes= <names or numeric index of control features> )

In order to compute the SizeFactors :

I am starting with a data frame with 4 columns that contains the data of interest that is COUNTS_EXONS_dse , and I add a row with the spikes values :

spikeins = data.frame(DMSO_1 = 154815, DMSO_7  = 253491, TAZ_2 = 517331, TAZ_8  = 638848)

rownames(spikeins) = "spikeins"

head(rbind(spikeins, COUNTS_EXONS_dse))

COUNTS_EXONS_dse = rbind(spikeins, COUNTS_EXONS_dse)

head(COUNTS_EXONS_dse)
                    DMSO_1  DMSO_7  TAZ_2 TAZ_8
spikeins                154815     253491    517331    638848
1 : 858863 - 859082         25         17         5         4
1 : 860261 - 860998         82         51        24        21
1 : 862299 - 862760         37         26         7        11
1 : 867468 - 869833        281        179        79        76
1 : 874902 - 875224         25         15        11         5

dds <- estimateSizeFactors(dds, controlGenes=1)

sizeFactors(dds)
DMSO_1_S.0 DMSO_7_S.6  TAZ_2_S.1  TAZ_8_S.7 
 0.4587516  0.7511508  1.5329680  1.8930502

or :

isControl <- rownames(dds) %in% "spikeins"

 dds <- estimateSizeFactors(dds, controlGenes=isControl) # the row 1 contains the spike-in data

dds$sizeFactors

sizeFactors(dds)
DMSO_1_S.0 DMSO_7_S.6  TAZ_2_S.1  TAZ_8_S.7 
 0.4587516  0.7511508  1.5329680  1.8930502

There is smaller question though : how the instruction "controlGenes" work ?

ADD REPLY • link 16 months ago Bogdan ▴ 670

0

Entering edit mode

Mike,

I would like to ask you, how is the normalization done when we compute the differential expression and set up ControlGenes = "spike-ins" (how does the algorithm work ?)

Thanks !

ADD REPLY • link 16 months ago Bogdan ▴ 670

0

Entering edit mode

It uses only those rows to compute size factors. So it’s the same algorithm just focused on the features that should be stable aside from library size differences.

ADD REPLY • link 16 months ago Michael Love 41k

0

Entering edit mode

Thank you. At this moment I have included in the matrix of gene expression only 1 spike-in (let's call it S). Therefore, is it legit to compute the size factors based on only on only a spike-in ?

We do have a list of approx 30 spike-ins, however, only spike-in S has the highest intensity (approx 100x than other spike-ins intensity). Shall I include all of spike-ins in the analysis or only spike-in S suffices ?

Thanks again !

ADD REPLY • link 16 months ago Bogdan ▴ 670

0

Entering edit mode

On another topic, talking about edgeR or limma, any thoughts regarding the computation of the normalization factors as Gordon Smith suggested in : Using edgeR and a spike-in to calculate absolute abundance

I have asked Gordon too, but I have not heard from him yet. The code that he suggested is :

 y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
norm.factors <- spike_in_factor / y$samples$lib.size
norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))
y$samples$norm.factors <- norm.factors

ADD REPLY • link 16 months ago Bogdan ▴ 670

0

Entering edit mode

This is edgeR specific, how norm.factors and lib.size are defined there.

ADD REPLY • link 16 months ago Michael Love 41k

0

Entering edit mode

I wouldn't recommend with a single spike in, that has too much variance. Include all spike-ins.

ADD REPLY • link 16 months ago Michael Love 41k

0

Entering edit mode

Thanks Mike.

If you do not mind, I would add another question. It is about limma (I have asked Gordon and Aaron Lu, and I have not heard back from them). Is it legit to set up the normalization factors in limma/voom as equal directly to the spike-in amounts ? You must have done the comparisons with edgeR and voom, when you did write the function controlGenes in DEseq2.

For example :

SPIKEINS = c(154815, 253491, 517331, 638848)
y$samples$norm.factors = SPIKEINS
v <- voom(y, design, plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contrast.matrix)
efit <- eBayes(vfit)

I would appreciate having your comments. Thanks !

ADD REPLY • link 16 months ago Bogdan ▴ 670

0

Entering edit mode

I don't know the difference, controlGenes is simply computing median ratio over the control genes.

ADD REPLY • link 16 months ago Michael Love 41k