Question

csaw spike-in normalization

0

Entering edit mode

Nicolas Servant ▴ 260

@nicolas-servant-1466

Last seen 23 months ago

France

Hi all,

I have a question related to ChIP-seq spike-in normalization as illustrated on the csaw vignette.

The idea of calculating a norm factor on the spikes data is great. But as mentionned in the vignette, using the normoffset function on the spike (...) assume that the library sizes are the same between spike.data and endog.data (...).

However, in my case, I did two seperate mappings (one on the foreign and one on the reference genome). It means that my lib.sizes are differents between the two genomes ...

So, is it still correct to use the norm factor calculated on the foreign genome (with its own lib.size), and to apply it to the counts from my reference genome ? using CPM function for instance ?

Apart of that I guess that with this mapping strategy, results can be very different with another normalization strategy such as DESeq for instance, as the lib.size is directly included into the scaling factor ... meaning that it will use the lib.size from my foreign genome ...

Many thanks for your comments.

Best. Nicolas

ChIP-seq csaw • 1.4k views

ADD COMMENT • link updated 5.5 years ago by Aaron Lun ★ 28k • written 5.5 years ago by Nicolas Servant ▴ 260

score 0 · Answer 1 · 2018-10-30

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 19 hours ago

The city by the bay

Just set the library sizes for the foreign genome to the library sizes for your reference genome, before computing the normalization factors. This is allowable because the library sizes themselves are not really important - rather, the important thing is the product of the library sizes and normalization factors, i.e., the effective library sizes. The effective library sizes are the values responsible for (indirectly) scaling the counts when fitting a GLM. By using the reference library sizes during calculation of the spike-in normalization factors from the foreign genome, you ensure that the effective library sizes are the same between the foreign and reference genomes. The final scaling will be the same if the library sizes are the same between genomes.

And yes, if you had size factors from DESeq, you wouldn't have to worry about this. This is because size factors are conceptually the same as the effective library sizes in edgeR, so you can just transfer them directly between genomes (assuming, of course, that the biases are the same between genomes). Admittedly, we could also achieve this effect with edgeR by transferring the GLM offsets (log-effective library sizes) between genomes, which would override any supplied factors or library size. I guess I could do this, though I haven't analyzed enough spike-in data to be bothered.

Note: the user's guide assumes equal library sizes between the foreign and reference genomes because it assumes that you mapped to a combined foreign + reference genome. This would be my recommended approach to prevent cross-mapping in homologous regions, as such reads will be detected as multi-mapping and removed.

ADD COMMENT • link 5.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your fast answer Aaron. And yes, I agree with you that redoing the mapping would be better. I'll think about it. N Le 30/10/2018 à 20:55, Aaron Lun [bioc] a écrit : > Activity on a post you are following on support.bioconductor.org > <https: support.bioconductor.org=""> > > User Aaron Lun <https: support.bioconductor.org="" u="" 6732=""/> wrote > Answer: csaw spike-in normalization > <https: support.bioconductor.org="" p="" 114638="" #114644="">: > > Just set the library sizes for the foreign genome to the library sizes > for your reference genome, *before computing the normalization > factors*. This is allowable because the library sizes themselves are > not really important - rather, the important thing is the product of > the library sizes and normalization factors, i.e., the effective > library sizes. The effective library sizes are the values responsible > for (indirectly) scaling the counts when fitting a GLM. By using the > reference library sizes during calculation of the spike-in > normalization factors from the foreign genome, you ensure that the > effective library sizes are the same between the foreign and reference > genomes. That is, you can take the normalization factors computed from > the foreign genome and use them for the reference genome, and the > final scaling will be the same if the library sizes are the same > between genomes. > > And yes, if you had *size factors* from /DESeq/, you wouldn't have to > worry about this. This is because size factors are the same as the > effective library sizes, so you can just transfer them directly > between genomes (assuming, of course, that the biases are the same > between genomes). > > Note: the user's guide assumes equal library sizes between the foreign > and reference genomes because it assumes that you mapped to a combined > foreign + reference genome. This would be my recommended approach to > prevent cross-mapping in homologous regions, as such reads will be > detected as multi-mapping and removed. > > ------------------------------------------------------------------------ > > Post tags: ChIP-seq, csaw > > You may reply via email or visit > A: csaw spike-in normalization >

ADD REPLY • link 5.5 years ago Nicolas Servant ▴ 260

0

Entering edit mode

Sorry Aaron, I'm just coming back to what you previously said about DESeq.

If I'm using DESeq, my understanding is that it is even more dangerous because I will directly have scaling factor that can be applied to my reference raw counts. So in this precise context of doing separate mappings, applying these DESeq scaling factors (calculated from foreign genome lib.size) to my reference genome is not correct, and can therefore not be transfer between genomes ?

But I agree that with a single bam file and therefore a single lib.size, it should work.

Thanks. N

ADD REPLY • link 5.5 years ago Nicolas Servant ▴ 260

0

Entering edit mode

The size factors are fine and can be transferred, as it directly represents the common scaling bias for the two genomes. The same logic underlies the transfer of normalization factors when the library sizes are the same. If the scaling bias is different between genomes, you shouldn't be using spike-ins at all.

ADD REPLY • link 5.5 years ago Aaron Lun ★ 28k