Search
Question: GC content and length bias correction with DESeq2 using cqn normalization factor matrix
0
12 months ago by
glmazzo0
glmazzo0 wrote:
Hi everybody,

I am working with RNASeq data.

I am using DESeq2 and I would like to correct for the GC content and length bias using cqn.

When I tested the normalized counts with NOISeq to test if the biases were corrected, I noticed that when I divided out the normalization factors by the geometric mean (as suggested in DESeq2 manual), the cqn correction becomes ineffective.

Instead when I used the normalization factors without dividing out the geometric mean (different scale) the biases are significantly reduced

Could you help me with this? I cannot understand why this is not working.

Here is the code I used

#> head(uCovar_filt)

#                   length GC_content

#ENSBTAG00000000005   2310      43.65

#ENSBTAG00000000009   1543      65.02

#ENSBTAG00000000010   1348      48.38

#ENSBTAG00000000012   1227      38.70

#ENSBTAG00000000013   4478      37.35

#ENSBTAG00000000014   1044      50.95

#                      6    8   10   11   16   20   27   28   29   33   42  14   17

#ENSBTAG00000000005  222  160  252  230  448  380  306  259  333  420  240 218  514

#ENSBTAG00000000009  103  139  138   97   57  198  161   57  138  149   60  89  303

#ENSBTAG00000000010  290  593  458  423  653  718  608  402  527  686  459 260  641

#ENSBTAG00000000012  390  369  318  245  425  342  372  274  392  391  376 217  291

#ENSBTAG00000000013 1667 2637 1862 1537 2001 2292 1730 1445 2182 2125 1686 940 2225

#ENSBTAG00000000014  614 1814 1276  751  959 1867 1733  865 1325 1777  882 674 1177

#                     19   21   25   32   34   38

#ENSBTAG00000000005  316  336  236  315  269  334

#ENSBTAG00000000009  171  140  121   97   78  117

#ENSBTAG00000000010  601  498  416  377  335  408

#NSBTAG00000000012  285  213  286  295  264  291

#ENSBTAG00000000013 1858 1538 1606 1538 1456 1813

#ENSBTAG00000000014 1628 1465 1273  969  670  759

library(DESeq2)

# input data

design=~covariate1+covariate2+covariate3

dds <- DESeqDataSetFromMatrix(countData = countData,colData = pheno,design =design)

#estimate size factors to be used in cqn

dds=estimateSizeFactors(dds)

sizefactors=sizeFactors(dds)

#run cqn to generate the normalization factors matrix to account for GC and length bias

library(cqn)

cqnObject=cqn(counts = countData,x=uCovar_filt$GC_content, lengths = uCovar_filt$length, sizeFactors =sizefactors)

#extract normalization factors as suggested in DESeq2 manual

cqnOffset <- cqnObject\$glm.offset

cqnNormFactors <- exp(cqnOffset)

#generate the normalization factor matrices with and without dividing out the geometric mean

normFactors=cqnNormFactors

normFactors_sameScale <- cqnNormFactors / exp(rowMeans(log(cqnNormFactors)))

#create other two DESeq objects using the normalization facors with (sameScale)and without dividing out the geometric mean.

# create dds1 using normFactors_sameScale

dds1 <- DESeqDataSetFromMatrix(countData = countData,colData = pheno,design =design)

normalizationFactors(dds1)=normFactors_sameScale

# create dds2 using normFactors

dds2 <- DESeqDataSetFromMatrix(countData = countData,colData = pheno,design =design)

normalizationFactors(dds2)=normFactors

#extract the normalized counts

#dds contains the size factors generated by DESEq2

#dds1 contains the normalization factor matrix generated with cqn, dividing out the geometric means

#dds2 contains the normalization factor matrix generated wit cqn, WITHOUT dividing out the geometric means

counts_normalized=counts(dds,normalized=TRUE)    #(Fig A)

counts_nobias_sameScale=counts(dds1,normalized=TRUE)      #(Fig B)

counts_nobias=counts(dds2,normalized=TRUE)             #(Fig C)

#I tested these three count matrices with NOISeq. I attached the plots generated (Fig A, Fig B, Fig C)

Fig A (normalization with DESeq2)

Fig B (bias correction with cqn dividing out the geometric mean)

Fig C (bias correction with cqn WITHOUT dividing out the geometric mean)

Thank you.

Best regards,

Gianluca

modified 12 months ago by Michael Love17k • written 12 months ago by glmazzo0
0
12 months ago by
Michael Love17k
United States
Michael Love17k wrote:

hi Gianluca,

In the NOISeq plot, how are the different samples represented? I see mean expression per bin, but the different sample have variable dependence of counts ~ GC content.

Dear Mike,

Noiseq generates a plot for each sample for the GC content (plot showed in the post) and a plot for each sample for the length bias (plot not showed in my post)

The plots look similar across the sample (both for GC content and length bias).

I attached here the plots for other samples (only for GC content)

FIG A (6 samples, normalization with DESeq2)

Fig B (6 samples, bias correction with cqn dividing out the geometric mean)

Fig C (6 sample bias correction with cqn WITHOUT diving out the geometric mean)

Thank you. Best Regards,

Gianluca

Ok, thanks, I think I see what is going on. Using cqn in combination with DESeq2 via normalization factors, corrects for the differential GC dependence per gene. So if the counts are low because the gene has high GC content, this is fine for DE testing except if the degree to which the counts are low differs across samples for that gene. So the key plot would be if the log2 fold changes depended on the gene's GC content before and after using cqn. In short, my answer is that, it's ok if there is general overall dependence of counts on GC content -- this is not a problem for statistical analysis at all -- as long as there is not differential dependence per gene across samples, which should be correct by using cqn in combination with DESeq2.

When you keep in the geometric mean of the normalization factors, cqn is additionally correcting the row-wise average count so when you look across genes it has removed dependence of this average on the gene's GC content. But in doing so, it hurts DESeq2's methods, because it shifts the row-wise average count from its observed value to a new range. This hurts DESeq2's variance mean modeling, and gives no advantage for statistical testing (the shift is absorbed by the intercept either way).

Great! Thank you so much for your help.

If I understood correctly, you suggest to perform the DE analysis using cqn to correct for between samples (GC content and length) bias.

I know that the gene length and GC content bias within sample does not compromise DE results but that they affect the functional enrichment. It is expected that genes with more counts will be most likely to be DE than shorter gene (within sample bias). Am I right?

If possible, I would like to ask you a small suggestion.

If I want to perform functional enrichment of the DE genes,  do you suggest to account for these biases after  the DE (using for example GOseq) and not before?

Thank you for all your help!

Best Regards,

Gianluca

Yes, you could use goseq to account for these biases.

If you read the goseq manual, there is a section "6.8 Correcting for other biases", where they discuss using the mean or total count across a row instead of gene length as a way to deal with other biases than strictly gene length bias. Binning by mean count makes a lot of sense to me, as the sensitivity for gene DE depends on length and GC content only to the extent that these increase (or decrease) the mean count. So by modeling on the mean count, you account for length and GC bias.