normalization of ChIP-seq data by using the spike-ins or by using total library sizes
1
1
Entering edit mode
Bogdan ▴ 660
@bogdan-2367
Last seen 11 minutes ago
Palo Alto, CA, USA

Dear all,

This question may have been asked before, I have searched the mailing list and I can not find an answer. The question is about the correct way of setting the SizeFactors() in DESeq2 in 3 situations. I would like to double check with you. Although the R code that I do list below refers to DESeq2, the question applies equally to edgeR or to limma / voom. I would appreciate having your suggestions.

I am working on 4 ChIP-seq datasets (2 DMSO samples and 2 TREAT samples). The input is a list of PEAKS with raw counts in all these 4 conditions (2 DMSO, and 2 TREAT). I am doing the analysis in 3 ways :

<> normalization to # Spike-ins (The Spike-ins numerical values have been estimated). I am setting the SizeFactors as the ratio between each Spike-in # , divided by the minimum # of all Spike-ins. Is it the correct way of doing it ?

spikeins=c(68924, 80282, 52023, 47090)

spikeins_norm = spikeins / min(spikeins)

sizeFactors(dds) = spikeins_norm
dds <- estimateDispersions(dds)
dds <- nbinomLRT(dds, reduced= ~1)
res <- results(dds, name = "group_TZO_vs_DMSO")


<> normalization to Total # or Aligned Reads (in BAM files). I am setting the Size Factors() in the same way as I do it for Spike-Ins (Certainly, considering the Total # of Aligned Reads is not a recommended way to set the SizeFactors(), I am just doing it for comparison purposes).

     counts_BAM=c(1489249, 1568028, 1616202, 1647090)

counts_BAM_norm = counts_BAM / min(counts_BAM)

sizeFactors(dds) = counts_BAM_norm
dds <- estimateDispersions(dds)
dds <- nbinomLRT(dds, reduced= ~1)
res <- results(dds, name = "group_TZO_vs_DMSO")


<> normalization to # Reads in the Peaks. In this case, I follow the standard R workflow :

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)


Please let me know what you think. Thanks,

Bogdan

DESeq2 edgeR limma • 568 views
0
Entering edit mode

Referring to limma and edgeR, I could add that the R code that has been posted on the web pages (listed below) works and gives results that make sense.

 https://support.bioconductor.org/p/74870/
https://support.bioconductor.org/p/p132471/
https://support.bioconductor.org/p/9135179/
https://www.biostars.org/p/166556/


Any suggestions about the R code in limma/voom ?

0
Entering edit mode
@mikelove
Last seen 52 minutes ago
United States

For spike-ins, just use this built in paradigm:

dds <- estimateSizeFactors(dds, controlGenes= <names or numeric index of control features> )
dds <- DESeq(dds, ...)
...


For the others, I recommend to set sizeFactors by centering them around 0 first, e.g. the mean of the log of size factors should be ~0. So you can do

sizeFactors <- sizeFactors / exp(mean(log(sizeFactors)))

0
Entering edit mode

Thank you very much Mike . In the light of the article on spiker : https://spiker.readthedocs.io/en/latest/reference.html where the size factors are computed as : 1 mil reads / # spike-in

given the amount of spike-ins :

counts_spike_ins = c(1489249, 1568028, 1616202, 1647090)


is it legit to use the size factors as :

size_factors = 1 / counts_spike_ins


or :

 size_factors = 1000000 / counts_spike_ins


Thanks a lot !

0
Entering edit mode

It used to be a larger question, but I have solved the issue on controlGenes. I do keep the post just in case that somebody else is interested.

given the command :

dds <- estimateSizeFactors(dds, controlGenes= <names or numeric index of control features> )


In order to compute the SizeFactors :

I am starting with a data frame with 4 columns that contains the data of interest that is COUNTS_EXONS_dse , and I add a row with the spikes values :

spikeins = data.frame(DMSO_1 = 154815, DMSO_7  = 253491, TAZ_2 = 517331, TAZ_8  = 638848)

rownames(spikeins) = "spikeins"

COUNTS_EXONS_dse = rbind(spikeins, COUNTS_EXONS_dse)

DMSO_1  DMSO_7  TAZ_2 TAZ_8
spikeins                154815     253491    517331    638848
1 : 858863 - 859082         25         17         5         4
1 : 860261 - 860998         82         51        24        21
1 : 862299 - 862760         37         26         7        11
1 : 867468 - 869833        281        179        79        76
1 : 874902 - 875224         25         15        11         5

dds <- estimateSizeFactors(dds, controlGenes=1)

sizeFactors(dds)
DMSO_1_S.0 DMSO_7_S.6  TAZ_2_S.1  TAZ_8_S.7
0.4587516  0.7511508  1.5329680  1.8930502


or :

isControl <- rownames(dds) %in% "spikeins"

dds <- estimateSizeFactors(dds, controlGenes=isControl) # the row 1 contains the spike-in data

dds$sizeFactors sizeFactors(dds) DMSO_1_S.0 DMSO_7_S.6 TAZ_2_S.1 TAZ_8_S.7 0.4587516 0.7511508 1.5329680 1.8930502  There is smaller question though : how the instruction "controlGenes" work ? ADD REPLY 0 Entering edit mode Mike, I would like to ask you, how is the normalization done when we compute the differential expression and set up ControlGenes = "spike-ins" (how does the algorithm work ?) Thanks ! ADD REPLY 0 Entering edit mode It uses only those rows to compute size factors. So it’s the same algorithm just focused on the features that should be stable aside from library size differences. ADD REPLY 0 Entering edit mode Thank you. At this moment I have included in the matrix of gene expression only 1 spike-in (let's call it S). Therefore, is it legit to compute the size factors based on only on only a spike-in ? We do have a list of approx 30 spike-ins, however, only spike-in S has the highest intensity (approx 100x than other spike-ins intensity). Shall I include all of spike-ins in the analysis or only spike-in S suffices ? Thanks again ! ADD REPLY 0 Entering edit mode On another topic, talking about edgeR or limma, any thoughts regarding the computation of the normalization factors as Gordon Smith suggested in : Using edgeR and a spike-in to calculate absolute abundance I have asked Gordon too, but I have not heard from him yet. The code that he suggested is :  y <- DGEList(counts = counts_without_spike_in, samples = mapping_file) norm.factors <- spike_in_factor / y$samples$lib.size norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors)) y$samples$norm.factors <- norm.factors  ADD REPLY 0 Entering edit mode This is edgeR specific, how norm.factors and lib.size are defined there. ADD REPLY 0 Entering edit mode I wouldn't recommend with a single spike in, that has too much variance. Include all spike-ins. ADD REPLY 0 Entering edit mode Thanks Mike. If you do not mind, I would add another question. It is about limma (I have asked Gordon and Aaron Lu, and I have not heard back from them). Is it legit to set up the normalization factors in limma/voom as equal directly to the spike-in amounts ? You must have done the comparisons with edgeR and voom, when you did write the function controlGenes in DEseq2. For example : SPIKEINS = c(154815, 253491, 517331, 638848) y$samples\$norm.factors = SPIKEINS
v <- voom(y, design, plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contrast.matrix)
efit <- eBayes(vfit)


0
Entering edit mode

I don't know the difference, controlGenes is simply computing median ratio over the control genes.