Question

Dropping genes with zero read counts in a dataset before TMM Normalization

0

Entering edit mode

humberto_munoz • 0

@humberto_munoz-10903

Last seen 7.8 years ago

I am a beginner in R and still learning how to use edgeR. I want to do a TMM normalization to my dataset, however I have the following questions:

1. The total number of genes that were expressed in the experiment is 3622, since the samples have some genes with zero read counts, and their log-fold changes cannot be calculated. I noticed that the reaDGE has a included 3617 genes, but some of the samples have 3567 expressed genes. The edgeR user's guide says that the the genes with zero read counts must be drop prior any analysis. How can I perform this task?

> files <- dir(pattern="*\\.csv$")
> group<- c(1,2,2,3,3,4,4,5,5,6,6,7,7)
> RG<- readDGE(files, group=group, labels=NULL)
>
> RG$samples
files group lib.size norm.factors
ARef ARef.csv 1 2911654 1
CO21 CO21.csv 2 11198927 1
CO224 CO224.csv 2 11294624 1
Light1 Light1.csv 3 12454641 1
Light24 Light24.csv 3 8668049 1
NaCl1 NaCl1.csv 4 6550245 1
NaCl24 NaCl24.csv 4 11475584 1
NaNO31 NaNO31.csv 5 10521157 1
NaNO324 NaNO324.csv 5 9045265 1
pH1 pH1.csv 6 11850679 1
pH24 pH24.csv 6 9275761 1
Temp1 Temp1.csv 7 11726524 1
Temp24 Temp24.csv 7 8120990 1
> keep <-rowSums(cpm(RG)>1) >=1
> RG<- RG[keep, , keep.lib.sizes=FALSE]
> dim(RG)
[1] 3617 13

2. It is understood that the TMMr(r) =1; however, using calcNormFactors selecting the first column as the reference sample, the value is 1.1318411 as seen below. It is right, or I am missing some parameter here?

RG<-calcNormFactors(RG, method=c("TMM"),
+ refColumn=1, logratioTrim=.3, sumTrim=0.05, doWeighting=TRUE,
+ Acutoff=-1e10, p=0.75)
> RG$samples
files group lib.size norm.factors
ARef ARef.csv 1 2911652 1.1318411
CO21 CO21.csv 2 11198918 1.0381266
CO224 CO224.csv 2 11294616 0.8066788
Light1 Light1.csv 3 12454627 0.7167523
Light24 Light24.csv 3 8668044 0.7112044
NaCl1 NaCl1.csv 4 6550232 0.7621304
NaCl24 NaCl24.csv 4 11475570 0.8852015
NaNO31 NaNO31.csv 5 10521142 1.3151826
NaNO324 NaNO324.csv 5 9045259 1.4467104
pH1 pH1.csv 6 11850666 1.0574913
pH24 pH24.csv 6 9275757 0.7936610
Temp1 Temp1.csv 7 11726477 1.2622590
Temp24 Temp24.csv 7 8120977 1.5219508

Any help will be appreciated

normalization • 1.1k views

ADD COMMENT • link 7.8 years ago humberto_munoz • 0

score 2 · Answer 1 · 2016-07-16

Just because a gene has zero in one sample, doesn't mean you have to filter it out. In fact, this would be rather silly because you'd lose strongly DE genes that were expressed in one group and completely silent in another group. Don't worry about the log-fold change; by default (in glmFit and related functions), we add a prior count which ensures that it can still be calculated for genes with zeroes in all samples of one group.

What you should be doing, at the very least, is to remove genes that have zeroes in all samples, because such genes are completely uninformative with respect to differential expression. In practice, we also remove genes that have low counts in all samples, because such genes don't provide a lot of evidence to reject the null hypothesis. (Also, they use up computational time during processing, increase the severity of the multiple testing correction, distort the mean-dispersion trend fitting and compromise some of the statistical approximations in GLM fitting.) This is the motivation for the filtering strategy described in the user's guide.

As for the TMM question; I don't know where you got the idea that the normalization factor for the reference should be 1, because that's not true. While all libraries are normalized against the reference, the final factors are scaled so that the geometric mean of factors across all libraries is equal to unity. (This is permissible as it does not change the relative sizes of the factors.) This is done so as to keep the effective library sizes on the same scale as the original library sizes, and to make the values of the normalization factors comparable across different choices of the reference. As a result, the reported factor for the reference library will not usually be equal to 1.

P.S. In future questions, add the edgeR tag to your post, otherwise the maintainers will not see it.