Question

edgeR for metagenomic data

0

Entering edit mode

eleonoregravier ▴ 70

@eleonoregravier-8219

Last seen 21 months ago

France

Hi everyone,

I have data from metagenomic experiments: pyrosequencing of the 16S gene (454 GS Junior Roche). This gene was sequenced for 44 samples divided in 2 treatment groups A and B. The number of patients studied is 22, and for each patient one sample received the treatment A and another sample received the treatment B. So, we have paired data.

The reads from 71 bacterial species were counted in each of the 44 samples, leading to a count matrix which looks like that (71 rows and 44 paired samples, here just the 6 first rows and columns):

P1.grpA P2.grpA P3.grpA P1.grpB P2.grpB P3.grpB

Species1 0 0 0 0 0 0

Species2 0 0 0 0 0 0

Species3 0 0 104 0 0 107

Species4 0 0 0 0 0 0

Species5 0 0 0 0 0 0

Species6 0 0 0 0 0 0

The objective of the study is to detect species differentially expressed between the two groups.

I used edgeR to analyze the data, here is my code and some outputs:

>dgeFull <- DGEList(counts=data,remove.zeros=TRUE)

# Removing 24 rows with all zero counts.

>summary(dgeFull$samples$lib.size)

#Min. 1st Qu. Median Mean 3rd Qu. Max.

#343 1096 3564 5957 7417 32600

>dgeFull <- calcNormFactors(dgeFull, method="TMM")

>summary(dgeFull$samples$norm.factors)

#Min. 1st Qu. Median Mean 3rd Qu. Max.

#0.1733 0.9240 1.1320 1.0800 1.2640 1.8060

>gp <-dgeFull$sampleInfo$gp

>patient <- dgeFull$sampleInfo$patient

>design <- model.matrix(~patient+gp)

>y <- estimateGLMCommonDisp(dgeFull,design)

>y$common.dispersion # 3.220813

>sqrt(y$common.dispersion) # 1.79

>y <- estimateGLMTrendedDisp(y,design)

>y <- estimateGLMTagwiseDisp(y,design)

>plotBCV(y)

>fit <- glmFit(y,design)

>lrt <- glmLRT(fit, coef=23)

>topTags(lrt,adjust.method="BH")

> sessionInfo()

R version 3.1.2 (2014-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:

[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252

[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C

[5] LC_TIME=French_France.1252

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] edgeR_3.8.6 limma_3.22.6 rj_2.0.3-1

loaded via a namespace (and not attached):

[1] rj.gd_2.0.0-1 tools_3.1.2

My questions are :

I am seriously wondering if edgeR (TMM normalization, and computation of the dispersion parameter) is convenient for my data (or if my data met some problems) because of :
- In all the count data I have seen analyzed with edgeR, the number of features was high and never with less than 100 features like in my data. Moreover, my dataset contains a lot of zeros (22 samples only have more than 8 strictly positive values among the 71 species and half of the 71 species have <=1 strictly positive values over the 44 samples)
- The library sizes are very hetoregeneous (from 343 to 32600)
- The normalization factor are very extreme (with a min value of 0.1),
- The value of BCV (=1.79) is very high
- The plotBCV is very strange, with an increasing trend, which seems to be strongly influenced by outliers (without these outliers the trens would be "flat"). May be do I have to NOT perform the estimateGLMTrendedDisp function ?

Do I have to add a filter for the species, for example with :

keep <- rowSums(cpm(dgeFull) > 1) >= 22 ? In this case, only 5 species will be kept…

I also had (data not shown) a SAMPLE (NOT a species) with all zero counts, and so the “calcNormFactors” generates this error message:

Error in quantile.default(x, p = p) :

# missing values and NaN's not allowed if 'na.rm' is FALSE

I removed this sample from my dataset, but I am wondering if I could keep it and set artificially a normalization factor of 1 ? Do you think I have to remove it ?

Finaly, how exactly is computed the “log2FC” of the TopTags outputs ?

I tried log2(rawCount(mean(gpB)))- log2(rawCount(mean(gpA))) (and also with cpm values rather than rawCount) but I do not found what is generated in the outputs.

Thanks a lot for your help,

Eleonore

edger metagenomics • 3.3k views

ADD COMMENT • link updated 8.9 years ago by Joseph Nathaniel Paulson ▴ 280 • written 8.9 years ago by eleonoregravier ▴ 70

0

Entering edit mode

You're using old versions of R and Bioconductor. It's good practice to use the most recent versions; this ensures that problems in old versions have been fixed, so we don't spend time talking about bugs that have already been dealt with.

ADD REPLY • link updated 8.9 years ago by Gordon Smyth 50k • written 8.9 years ago by Aaron Lun ★ 28k

score 1 · Answer 1 · 2015-06-19

Question 1:

I don't think there's enough non-zero counts for each library for stable normalization. Any zero counts will result in undefined M-values which are automatically removed in calcNormFactors; depending on how many genes are left, there might just not be enough to get a reliable estimate of the normalization factor. Having highly variable library sizes and normalization factors isn't too problematic so long as those values are stable, but this cannot be said for your data set.

As for the BCV; 1.8 is definitely high, but in the bigger scheme of things it isn't too bad. For example, single-cell RNA-seq often gives dispersion values in excess of 5, especially at low abundances. The trend you mention is a bit strange, but it's not too surprising - I'd expect any fitted trend to be rather unstable, given that you've got such a small number of genes with so many zero counts.

Question 2:

We generally recommend filtering - see C: NaN results on some pathways in CAMERA for an explanation of why we do so. However, in your case, there's not enough genes to start with, so I think filtering would just make the analysis more unstable. You could try less aggressive filtering schemes, e.g., reducing the CPM threshold or reducing the minimum number of libraries required. But, given the problems mentioned above, it's a bit like putting a band-aid on bullet wound.

Question 3:

There's no information in a sample with all-zero counts. So, removing it would be the correct decision. Forcing the normalization factor to unity wouldn't help, as the library size would still be zero (so the effective library size would also be zero, regardless of what the normalization factor was). While you could change the library size to some non-zero value, what would you change it to? There's no sensible choice that's obvious to me.

Question 4:

The log-FC is computed like so:

Add a small prior count to each count, to stabilise the final value when the counts are very small.
Fit a GLM to the counts, using the specified design matrix.
Take the value of the coefficient of interest, and divide it by log(2).

This gives you the log2-fold change across your DE comparison of interest. Check out predFC for more details.

score 0 · Answer 2 · 2015-06-19

Hi Aaron,

Many thanks for your quick and informative answer.

First of all, I am sorry for the old versions of R and Bioconductor. I performed again the analysis with the sessionInfo() at the end of the mail and the results were the same.

Then, I think you are right, I have a lot of species with low/zero counts and so the normalization factor and the dispersion trend are unstable. Do you think it would be better to use another normalization process (RLE for example ?). And for the trend estimte step, do you think it is better to NOT perform this step or may be to performe another type of estimation?

Thanks again

Best

Eléonore

>sessionInfo()

#R version 3.2.1 (2015-06-18)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows x64 (build 9600)
#
#locale:
#       [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
#[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
#[5] LC_TIME=French_France.1252
#
#attached base packages:
#       [1] stats     graphics grDevices utils     datasets methods   base
#
#other attached packages:
#       [1] edgeR_3.8.6          limma_3.22.6         BiocInstaller_1.16.5
#[4] rj_2.0.4-2
#
#loaded via a namespace (and not attached):
#       [1] tools_3.2.1   rj.gd_2.0.0-1

score 0 · Answer 3 · 2015-06-19

0

Entering edit mode

Joseph Nathaniel Paulson ▴ 280

@joseph-nathaniel-paulson-6442

Last seen 7.1 years ago

United States

Hi Eleonore,

There has been a lot of recent work analyzing the unique characteristics of 16S data and methods developed for this type of data due to the sparsity issues.

I would recommend checking out metagenomeSeq[http://bioconductor.org/packages/devel/bioc/html/metagenomeSeq.html] and the paper[http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2658.html]. Perhaps you'll find them useful - as well as the comparison to edgeR.

ADD COMMENT • link 8.9 years ago Joseph Nathaniel Paulson ▴ 280

0

Entering edit mode

Hi Joseph,

Thanks a lot for your recommendation. I am going to study your package which seems to be useful for analyzing my dataset.

Best

Eleonore

ADD REPLY • link 8.9 years ago eleonoregravier ▴ 70

0

Entering edit mode

Hi Joseph,

I studied your paper and applied the metagenomeSeq package to my dataset, thanks again for this work.

As the zero-inflated Gaussian distribution mixture model takes into account the under-sampling phenomenon present in my dataset, I thought that species with a lot of zeros in a group would not be said significant but it was not the case... even if I used the calculateEffectiveSample and uniqueFeatures functions to avoid non valid results.

Briefly, here is my code :

res <- fitZig (obj=mrobj, mod=design2, control=settings)

effectiveSamples<-calculateEffectiveSamples(res)
valid = which(effectiveSamples >= quantile(effectiveSamples, p = 0.5, na.rm = TRUE))

fit2<-eBayes(res$fit)
out<-topTable(fit2,coef=23:26,adjust.method="BH",sort.by="F",number=10,confint=TRUE)

final<-intersect(valid,rownames(out)[out$adj.P.Val<=0.05])

warning<-uniqueFeatures(mrobj,cl=pData(mrobj)$groupe, nsamples = 0, nreads = 0)

# Final results

final2<-setdiff(final,rownames(warning))

Among the species said significant at the final2 step, I have species like that :

res$z["species50",pData(mrobj)$groupe=="A"]
#P1 P2 P3 P4 P5
#0.9259486 0.9980554 0.9723243 0.0000000 0.7804181

res$z["species50",pData(mrobj)$groupe=="B"]
#P6 P7 P8 P9 P10
#0.7724242 0.0000000 0.0000000 0.0000000 0.9886364

Don't you think this species with only 1 non-zero value in group A and all other zero values (in this group) considered to be strongly affected by the undersampling (posterior probability >0.5), should not have to be said significant ?

Thanks a lot for your help

Best

Eléonore

ADD REPLY • link 8.8 years ago eleonoregravier ▴ 70

0

Entering edit mode

Hi Eléonore, Thanks for the message. I do believe that features completely not present in one group should be diff abundant if they have high-enough depth. I would also recommend taking a look at the developer version of metagenomeSeq as we have some new methods in development you might find useful and I would love to hear your experience on. On Thu, Jul 9, 2015 at 5:37 AM eleonoregravier [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User eleonoregravier <https: support.bioconductor.org="" u="" 8219=""/> wrote Comment: > edgeR for metagenomic data > <https: support.bioconductor.org="" p="" 68940="" #69704="">: > > Hi Joseph, > > I studied your paper and applied the metagenomeSeq package to my dataset, > thanks again for this work. > > As the zero-inflated Gaussian distribution mixture model takes into > account the under-sampling phenomenon present in my dataset, I thought that > species with a lot of zeros in a group would not be said significant but it > was not the case... even if I used the calculateEffectiveSample and > uniqueFeatures functions to avoid non valid results. > > Briefly, here is my code : > > res <- fitZig (obj=mrobj, mod=design2, control=settings) > > effectiveSamples<-calculateEffectiveSamples(res) > valid = which(effectiveSamples >= quantile(effectiveSamples, p = 0.5, > na.rm = TRUE)) > > fit2<-eBayes(res$fit) > out<-topTable(fit2,coef=23:26,adjust.method="BH",sort.by > ="F",number=10,confint=TRUE) > > final<-intersect(valid,rownames(out)[out$adj.P.Val<=0.05]) > > warning<-uniqueFeatures(mrobj,cl=pData(mrobj)$groupe, nsamples = 0, nreads > = 0) > > # Final results > > final2<-setdiff(final,rownames(warning)) > > Among the species said significant at the final2 step, I have species like > that : > > res$z["species50",pData(mrobj)$groupe=="A"] > #P1 P2 P3 P4 P5 > #0.9259486 0.9980554 0.9723243 0.0000000 0.7804181 > > res$z["species50",pData(mrobj)$groupe=="B"] > #P6 P7 P8 P9 P10 > #0.7724242 0.0000000 0.0000000 0.0000000 0.9886364 > > Don't you think this species with only 1 non-zero value in group A and all > other zero values (in this group) considered to be strongly affected by the > undersampling (posterior probability >0.5), should not have to be said > significant ? > > Thanks a lot for your help > > Best > > Eléonore > > > > > > > > > > > > > > > > > > ------------------------------ > > Post tags: edger, metagenomics > > You may reply via email or visit > C: edgeR for metagenomic data >

ADD REPLY • link 8.8 years ago Joseph Nathaniel Paulson ▴ 280