Dropping genes with zero read counts in a dataset before TMM Normalization
Entering edit mode
Last seen 6.1 years ago

I am a beginner in R and still learning how to use edgeR. I want to do a TMM normalization to my dataset, however I have the following questions:

1.  The total number of genes that were expressed in the experiment is 3622, since the samples have some genes with zero read counts, and their log-fold  changes cannot be calculated.  I noticed that the reaDGE has a included 3617 genes, but some of the samples have 3567 expressed genes.  The edgeR user's guide says that the the genes with zero read counts must be drop prior any analysis. How can I perform this task?                       

> files <- dir(pattern="*\\.csv$")
> group<- c(1,2,2,3,3,4,4,5,5,6,6,7,7)
> RG<- readDGE(files, group=group, labels=NULL)

> RG$samples
              files group lib.size norm.factors
ARef       ARef.csv     1  2911654            1
CO21       CO21.csv     2 11198927            1
CO224     CO224.csv     2 11294624            1
Light1   Light1.csv     3 12454641            1
Light24 Light24.csv     3  8668049            1
NaCl1     NaCl1.csv     4  6550245            1
NaCl24   NaCl24.csv     4 11475584            1
NaNO31   NaNO31.csv     5 10521157            1
NaNO324 NaNO324.csv     5  9045265            1
pH1         pH1.csv     6 11850679            1
pH24       pH24.csv     6  9275761            1
Temp1     Temp1.csv     7 11726524            1
Temp24   Temp24.csv     7  8120990            1
> keep <-rowSums(cpm(RG)>1) >=1
> RG<- RG[keep, , keep.lib.sizes=FALSE]
> dim(RG)
[1] 3617   13

2. It is understood that the TMMr(r) =1; however, using calcNormFactors selecting the first column as the reference sample,  the value is 1.1318411 as seen below. It is right, or I am missing some parameter here?

RG<-calcNormFactors(RG, method=c("TMM"),
+                 refColumn=1, logratioTrim=.3, sumTrim=0.05, doWeighting=TRUE,
+                 Acutoff=-1e10, p=0.75)
> RG$samples
              files group lib.size norm.factors
ARef       ARef.csv     1  2911652    1.1318411
CO21       CO21.csv     2 11198918    1.0381266
CO224     CO224.csv     2 11294616    0.8066788
Light1   Light1.csv     3 12454627    0.7167523
Light24 Light24.csv     3  8668044    0.7112044
NaCl1     NaCl1.csv     4  6550232    0.7621304
NaCl24   NaCl24.csv     4 11475570    0.8852015
NaNO31   NaNO31.csv     5 10521142    1.3151826
NaNO324 NaNO324.csv     5  9045259    1.4467104
pH1         pH1.csv     6 11850666    1.0574913
pH24       pH24.csv     6  9275757    0.7936610
Temp1     Temp1.csv     7 11726477    1.2622590
Temp24   Temp24.csv     7  8120977    1.5219508

Any help will be appreciated

normalization • 644 views
Entering edit mode
Aaron Lun ★ 27k
Last seen 16 hours ago
The city by the bay

Just because a gene has zero in one sample, doesn't mean you have to filter it out. In fact, this would be rather silly because you'd lose strongly DE genes that were expressed in one group and completely silent in another group. Don't worry about the log-fold change; by default (in glmFit and related functions), we add a prior count which ensures that it can still be calculated for genes with zeroes in all samples of one group.

What you should be doing, at the very least, is to remove genes that have zeroes in all samples, because such genes are completely uninformative with respect to differential expression. In practice, we also remove genes that have low counts in all samples, because such genes don't provide a lot of evidence to reject the null hypothesis. (Also, they use up computational time during processing, increase the severity of the multiple testing correction, distort the mean-dispersion trend fitting and compromise some of the statistical approximations in GLM fitting.) This is the motivation for the filtering strategy described in the user's guide.

As for the TMM question; I don't know where you got the idea that the normalization factor for the reference should be 1, because that's not true. While all libraries are normalized against the reference, the final factors are scaled so that the geometric mean of factors across all libraries is equal to unity. (This is permissible as it does not change the relative sizes of the factors.) This is done so as to keep the effective library sizes on the same scale as the original library sizes, and to make the values of the normalization factors comparable across different choices of the reference. As a result, the reported factor for the reference library will not usually be equal to 1.

P.S. In future questions, add the edgeR tag to your post, otherwise the maintainers will not see it.

Entering edit mode

Aaron, thanks for your answers. They clear me a lot my concerns.


Login before adding your answer.

Traffic: 220 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6