Is there a different way to generate a tree for phyloseq object
2
0
Entering edit mode
mschmidt ▴ 10
@mschmidt-18923
Last seen 16 months ago
Poznan

I am constructing phylogenetic tree as written in Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses [version 2; peer review: 3 approved]. I have 242 samples that contain 41684 ASVs. The code:

dm <- dist.ml(phang.align)


resulted in data object (dm) of 13.9 GB (sic!), the next line cannot be performed due to memory shortage

> treeNJ <- NJ(dm)
Error: vector memory exhausted (limit reached?)


is there any other way to complete the code part? Are there any other solutions available? like splitting the 41k ASVs into batches, making several small trees and later join the trees?

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] phangorn_2.6.3 ape_5.5

loaded via a namespace (and not attached):
[1] nlme_3.1-152                bitops_1.0-7                matrixStats_0.58.0          phyloseq_1.34.0
[5] bit64_4.0.5                 RColorBrewer_1.1-2          progress_1.2.2              httr_1.4.2
[9] GenomeInfoDb_1.26.7         tools_4.0.3                 vegan_2.5-7                 utf8_1.2.1
[13] R6_2.5.0                    mgcv_1.8-35                 DBI_1.1.1                   BiocGenerics_0.36.1
[21] tidyselect_1.1.0            prettyunits_1.1.1           DESeq2_1.30.1               bit_4.0.4
[25] compiler_4.0.3              microbiome_1.13.8           Biobase_2.50.0              DelayedArray_0.16.3
[37] pkgconfig_2.0.3             MatrixGenerics_1.2.1        fastmap_1.1.0               rlang_0.4.10
[41] rstudioapi_0.13             RSQLite_2.2.7               generics_0.1.0              hwriter_1.3.2
[45] jsonlite_1.7.2              BiocParallel_1.24.1         dplyr_1.0.5                 RCurl_1.98-1.3
[49] magrittr_2.0.1              GenomeInfoDbData_1.2.4      biomformat_1.18.0           Matrix_1.3-2
[53] Rhdf5lib_1.12.1             Rcpp_1.0.6                  munsell_0.5.0               S4Vectors_0.28.1
[57] fansi_0.4.2                 lifecycle_1.0.0             stringi_1.5.3               MASS_7.3-53.1
[61] SummarizedExperiment_1.20.0 zlibbioc_1.36.0             rhdf5_2.34.0                Rtsne_0.15
[65] plyr_1.8.6                  grid_4.0.3                  blob_1.2.1                  parallel_4.0.3
[69] crayon_1.4.1                lattice_0.20-41             Biostrings_2.58.0           splines_4.0.3
[73] multtest_2.46.0             annotate_1.68.0             hms_1.0.0                   locfit_1.5-9.4
[77] pillar_1.6.0                igraph_1.2.6                GenomicRanges_1.42.0        geneplotter_1.68.0
[81] reshape2_1.4.4              codetools_0.2-18            stats4_4.0.3                fastmatch_1.1-0
[89] data.table_1.14.0           RcppParallel_5.1.2          png_0.1-7                   vctrs_0.3.7
[93] foreach_1.5.1               tidyr_1.1.3                 gtable_0.3.0                purrr_0.3.4
[97] assertthat_0.2.1            cachem_1.0.4                ggplot2_3.3.3               xfun_0.22
[101] xtable_1.8-4                survival_3.2-11             tibble_3.1.1                iterators_1.0.13
[105] GenomicAlignments_1.26.0    tinytex_0.31                AnnotationDbi_1.52.0        memoise_2.0.0
[109] IRanges_2.24.1              cluster_2.1.2               ellipsis_0.3.1

r dada2 Bioconductor • 1.8k views
2
Entering edit mode
Axel Klenk ▴ 990
@axel-klenk-3224
Last seen 5 days ago
Switzerland

I don't know what and how diverse your samples are, and you do have a lot of them, but 41K ASVs seems pretty much to me. In my own analyses I try to filter them based on the QC plots and procedures described in that workflow until there are (rule of thumb) about 1000 ASVs left.

0
Entering edit mode

Hi Axel! Thanks for your comment, that's something new for me! I do not know what to expect. I thought the higher number the better. I just switched from OTUs. I have human stool microbiome samples, V3-V4 amplicons sequenced with 2x300PE. My QC of R1 and R2 reads are and so I set truncLen = c(260, 181) in the filterAndTrim. Longer sequences resulted in a lower number of ASVs (but still > 36K). Merging R1 and R2 gave sequences from 260 to 429 bp . Can you comment and give me some advice. Best regards, Marcin

1
Entering edit mode

Hi Marcin, ok, human stool is not that exceptional. :-)

I have realized in the meantime that the filtering I thought of is actually done after constructing the phylogenetic tree in the workflow, in the phyloseq section on pages 10-12. I would think that postponing the tree construction is an option in your case although I'm not an expert here.

If you have set truncLen to c(260, 181), then the two groups to the right of your plot (at 422 and 427) will have an overlap of 10-20 bases only and I'd be inclined to remove them as well as the funny ones at 260.

Have you used any of the other arguments to filterAndTrim()? The published workflow is using maxEE = 2 and I have experimented with this one quite a bit.

Usually I aim at removing 10%-20% of the reads during QC and retaining about 1000 ASVs after filtering.

Best,

• axel
0
Entering edit mode

Hi Axel, I am using the settings from DADA2 Pipeline Tutorial. There are maxN=0, maxEE=c(2,2), truncQ=2, and rm.phix=TRUE. Removing the small ones and these >422 seems a good idea. Does lowering the maxEE and truncQ will make the selection more stringent? Best regards, Marcin

1
Entering edit mode

Hi Marcin, yes, lowering these threshold should result in more stringent quality filtering, although you don't have many choices with a default of 2. And of course, how much more stringent will depend on your data. In our in-house example I have checked yesterday, I actually had to raise the threshold to avoid losing more than 50% of the reads (human saliva), especially of the reverse reads.

If that doesn't bring the number of ASVs down far enough, it may help to look at the counts per ASV and sample, and filter ASVs based on minimum counts or mininmum occurrence in samples, depending on your research question.

Different tree-building algorithms may have different memory requirements but I cannot advise on this.

Best regards,

• axel
0
Entering edit mode

Thanks again! I have tried longer sequences truncLen = c(265, 185) to allow more overlap and the number of ASVs dropped to 1368, and after removeBimeraDenovo to 1058. That should be good right now. Best regards, Marcin

0
Entering edit mode

1058 sounds much, much better. :-) Good to hear you got it done.

Cheers,

• axel
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

There are usually just two ways to get around that error.

2. Get a computer with more RAM.

For #2 you can often get time on heavy iron at your institution, and barring that, you can always just spin up a big enough EC2 instance on AWS and use that.