edgeR for metagenomic data
3
0
Entering edit mode
@eleonoregravier-8219
Last seen 2.5 years ago
France

Hi everyone,

I have data from metagenomic experiments:  pyrosequencing of the 16S gene (454 GS Junior Roche). This gene was sequenced for 44 samples divided in 2 treatment groups A and B. The number of patients studied is 22, and for each patient one sample received the treatment A and another sample received the treatment B. So, we have paired data.

The reads from 71 bacterial species were counted in each of the 44 samples, leading to a count matrix which looks like that (71 rows and 44 paired samples, here just the 6 first rows and columns):

 

                    P1.grpA   P2.grpA   P3.grpA   P1.grpB   P2.grpB   P3.grpB

Species1       0               0               0             0               0               0

Species2       0               0               0             0               0                0

Species3       0               0               104         0               0             107

Species4       0               0               0             0               0                0

Species5       0               0               0             0               0                0

Species6       0               0              0              0                0                0

 

The objective of the study is to detect species differentially expressed between the two groups.

I used edgeR to analyze the data, here is my code and some outputs:

 

>dgeFull <- DGEList(counts=data,remove.zeros=TRUE)

# Removing 24 rows with all zero counts.

>summary(dgeFull$samples$lib.size)

#Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

#343    1096    3564    5957    7417   32600

>dgeFull <- calcNormFactors(dgeFull, method="TMM")

 

>summary(dgeFull$samples$norm.factors)

#Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

#0.1733  0.9240  1.1320  1.0800  1.2640  1.8060

 

>gp <-dgeFull$sampleInfo$gp

>patient <- dgeFull$sampleInfo$patient

>design <- model.matrix(~patient+gp)

 

>y <- estimateGLMCommonDisp(dgeFull,design)

>y$common.dispersion # 3.220813

>sqrt(y$common.dispersion) # 1.79

>y <- estimateGLMTrendedDisp(y,design)

>y <- estimateGLMTagwiseDisp(y,design)

>plotBCV(y)

>fit <- glmFit(y,design)

>lrt <- glmLRT(fit, coef=23)

>topTags(lrt,adjust.method="BH")

 

> sessionInfo()

R version 3.1.2 (2014-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)

 

locale:

[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252  

[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                 

[5] LC_TIME=French_France.1252   

 

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base    

 

other attached packages:

[1] edgeR_3.8.6  limma_3.22.6 rj_2.0.3-1 

 

loaded via a namespace (and not attached):

[1] rj.gd_2.0.0-1 tools_3.1.2 

 

 

 

My questions are :

  • I am seriously wondering if edgeR (TMM normalization, and computation of the dispersion parameter)  is convenient for my data (or if my data met some problems) because of :
    • In all the count data I have seen analyzed with edgeR, the number of features was high and never with less than 100 features like in my data. Moreover, my dataset contains a lot of zeros (22 samples only have more than 8 strictly positive values among the 71 species and half of the 71 species have <=1 strictly positive values over the 44 samples)
    • The library sizes are very hetoregeneous (from 343 to 32600)
    • The normalization factor are very extreme (with a min value of 0.1),
    • The value of BCV (=1.79) is very high
    • The plotBCV is very strange, with an increasing trend, which seems to be strongly influenced by outliers (without these outliers the trens would be "flat"). May be do I have to NOT perform the estimateGLMTrendedDisp function ?

 

  • Do I have to add a filter for the species, for example with :

keep <- rowSums(cpm(dgeFull) > 1) >= 22 ? In this case, only 5 species will be kept…

 

  • I also had (data not shown) a SAMPLE (NOT a species) with all zero counts, and so the “calcNormFactors” generates this error message:

Error in quantile.default(x, p = p) :

# missing values and NaN's not allowed if 'na.rm' is FALSE

I removed this sample from my dataset, but I am wondering if I could keep it and set artificially a normalization factor of 1 ? Do you think I have to remove it ?

  • Finaly, how exactly is computed the “log2FC” of the TopTags outputs ?

I tried log2(rawCount(mean(gpB)))- log2(rawCount(mean(gpA))) (and also with cpm values rather than rawCount) but I do not found what is generated in the outputs.

 

Thanks a lot for your help,

Eleonore

 

edger metagenomics • 3.8k views
ADD COMMENT
0
Entering edit mode

You're using old versions of R and Bioconductor. It's good practice to use the most recent versions; this ensures that problems in old versions have been fixed, so we don't spend time talking about bugs that have already been dealt with.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

Question 1:

I don't think there's enough non-zero counts for each library for stable normalization. Any zero counts will result in undefined M-values which are automatically removed in calcNormFactors; depending on how many genes are left, there might just not be enough to get a reliable estimate of the normalization factor. Having highly variable library sizes and normalization factors isn't too problematic so long as those values are stable, but this cannot be said for your data set.

As for the BCV; 1.8 is definitely high, but in the bigger scheme of things it isn't too bad. For example, single-cell RNA-seq often gives dispersion values in excess of 5, especially at low abundances. The trend you mention is a bit strange, but it's not too surprising - I'd expect any fitted trend to be rather unstable, given that you've got such a small number of genes with so many zero counts.

Question 2:

We generally recommend filtering - see C: NaN results on some pathways in CAMERA for an explanation of why we do so. However, in your case, there's not enough genes to start with, so I think filtering would just make the analysis more unstable. You could try less aggressive filtering schemes, e.g., reducing the CPM threshold or reducing the minimum number of libraries required. But, given the problems mentioned above, it's a bit like putting a band-aid on bullet wound.

Question 3:

There's no information in a sample with all-zero counts. So, removing it would be the correct decision. Forcing the normalization factor to unity wouldn't help, as the library size would still be zero (so the effective library size would also be zero, regardless of what the normalization factor was). While you could change the library size to some non-zero value, what would you change it to? There's no sensible choice that's obvious to me.

Question 4:

The log-FC is computed like so:

  1. Add a small prior count to each count, to stabilise the final value when the counts are very small.
  2. Fit a GLM to the counts, using the specified design matrix.
  3. Take the value of the coefficient of interest, and divide it by log(2).

This gives you the log2-fold change across your DE comparison of interest. Check out predFC for more details.

ADD COMMENT
0
Entering edit mode
@eleonoregravier-8219
Last seen 2.5 years ago
France

Hi Aaron,

 

Many thanks for your quick and informative answer.

First of all, I am sorry for the old versions of R and Bioconductor. I performed again the analysis with the sessionInfo() at the end of the mail and the results were the same.

Then, I think you are right, I have a lot of species with low/zero counts and so the normalization factor and the dispersion trend are unstable. Do you think it would be better to use another normalization process (RLE for example ?). And for the trend estimte step, do you think it is better to NOT perform this step or may be to performe another type of estimation?

 

Thanks again

Best

Eléonore

 

 

 

 

 

 

 

 

 

 

>sessionInfo()

#R version 3.2.1 (2015-06-18)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows  x64 (build 9600)
#
#locale:
#        [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
#[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#[5] LC_TIME=French_France.1252    
#
#attached base packages:
#        [1] stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
#        [1] edgeR_3.8.6          limma_3.22.6         BiocInstaller_1.16.5
#[4] rj_2.0.4-2          
#
#loaded via a namespace (and not attached):
#        [1] tools_3.2.1   rj.gd_2.0.0-1

 

ADD COMMENT
1
Entering edit mode

In general, the scaling normalization methods tend to behave similarly. I don't think you'll get major improvements by switching to another method from TMM. As for the dispersion estimation, you might get more stable results by turning off trend fitting (i.e., by skipping the estimateGLMTrendedDisp step). This will work with the common dispersion during tagwise estimation. However, this seems to be a superficial solution that papers over the fundamental problem of too few genes and too many zeros.

ADD REPLY
0
Entering edit mode
@joseph-nathaniel-paulson-6442
Last seen 7.8 years ago
United States

Hi Eleonore,

There has been a lot of recent work analyzing the unique characteristics of 16S data and methods developed for this type of data due to the sparsity issues.

I would recommend checking out metagenomeSeq[http://bioconductor.org/packages/devel/bioc/html/metagenomeSeq.html] and the paper[http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2658.html]. Perhaps you'll find them useful - as well as the comparison to edgeR.

ADD COMMENT
0
Entering edit mode

Hi Joseph,

Thanks a lot for your recommendation. I am going to study your package which seems to be useful for analyzing my dataset.

Best

Eleonore

 

 

ADD REPLY
0
Entering edit mode

Hi Joseph,

I studied your paper and applied the metagenomeSeq package to my dataset, thanks again for this work.

As the zero-inflated Gaussian distribution mixture model takes into account the under-sampling phenomenon present in my dataset, I thought that species with a lot of zeros in a group would not be said significant but it was not the case... even if I used the calculateEffectiveSample and uniqueFeatures functions to avoid non valid results.

Briefly, here is my code :

res <- fitZig (obj=mrobj, mod=design2, control=settings)

effectiveSamples<-calculateEffectiveSamples(res)
valid = which(effectiveSamples >= quantile(effectiveSamples, p = 0.5, na.rm = TRUE))

fit2<-eBayes(res$fit)
out<-topTable(fit2,coef=23:26,adjust.method="BH",sort.by="F",number=10,confint=TRUE)

final<-intersect(valid,rownames(out)[out$adj.P.Val<=0.05])

warning<-uniqueFeatures(mrobj,cl=pData(mrobj)$groupe, nsamples = 0, nreads = 0)

# Final results

final2<-setdiff(final,rownames(warning))

Among the species said significant at the final2 step, I have species like that :

res$z["species50",pData(mrobj)$groupe=="A"]
#P1              P2          P3             P4            P5
#0.9259486 0.9980554 0.9723243 0.0000000 0.7804181

res$z["species50",pData(mrobj)$groupe=="B"]
#P6             P7           P8             P9           P10
#0.7724242 0.0000000 0.0000000 0.0000000 0.9886364

Don't you think this species with only 1 non-zero value in group A and all other zero values (in this group) considered to be strongly affected by the undersampling (posterior probability >0.5), should not have to be said significant ?

Thanks a lot for your help

Best

Eléonore

 

 

 

 

 

 

 

 

ADD REPLY
0
Entering edit mode
Hi Eléonore, Thanks for the message. I do believe that features completely not present in one group should be diff abundant if they have high-enough depth. I would also recommend taking a look at the developer version of metagenomeSeq as we have some new methods in development you might find useful and I would love to hear your experience on. On Thu, Jul 9, 2015 at 5:37 AM eleonoregravier [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User eleonoregravier <https: support.bioconductor.org="" u="" 8219=""/> wrote Comment: > edgeR for metagenomic data > <https: support.bioconductor.org="" p="" 68940="" #69704="">: > > Hi Joseph, > > I studied your paper and applied the metagenomeSeq package to my dataset, > thanks again for this work. > > As the zero-inflated Gaussian distribution mixture model takes into > account the under-sampling phenomenon present in my dataset, I thought that > species with a lot of zeros in a group would not be said significant but it > was not the case... even if I used the calculateEffectiveSample and > uniqueFeatures functions to avoid non valid results. > > Briefly, here is my code : > > res <- fitZig (obj=mrobj, mod=design2, control=settings) > > effectiveSamples<-calculateEffectiveSamples(res) > valid = which(effectiveSamples >= quantile(effectiveSamples, p = 0.5, > na.rm = TRUE)) > > fit2<-eBayes(res$fit) > out<-topTable(fit2,coef=23:26,adjust.method="BH",sort.by > ="F",number=10,confint=TRUE) > > final<-intersect(valid,rownames(out)[out$adj.P.Val<=0.05]) > > warning<-uniqueFeatures(mrobj,cl=pData(mrobj)$groupe, nsamples = 0, nreads > = 0) > > # Final results > > final2<-setdiff(final,rownames(warning)) > > Among the species said significant at the final2 step, I have species like > that : > > res$z["species50",pData(mrobj)$groupe=="A"] > #P1 P2 P3 P4 P5 > #0.9259486 0.9980554 0.9723243 0.0000000 0.7804181 > > res$z["species50",pData(mrobj)$groupe=="B"] > #P6 P7 P8 P9 P10 > #0.7724242 0.0000000 0.0000000 0.0000000 0.9886364 > > Don't you think this species with only 1 non-zero value in group A and all > other zero values (in this group) considered to be strongly affected by the > undersampling (posterior probability >0.5), should not have to be said > significant ? > > Thanks a lot for your help > > Best > > Eléonore > > > > > > > > > > > > > > > > > > ------------------------------ > > Post tags: edger, metagenomics > > You may reply via email or visit > C: edgeR for metagenomic data >
ADD REPLY

Login before adding your answer.

Traffic: 771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6