Hi there,
I am running into an issue where I am unable to run large amounts of samples at once. One of our study species has 140 samples, but when I try and run these even on a Linux machine with 128 GB of storage I am met with a memory error. We are running the code remotely by submitting batch slurm scripts for each study species separately. Below is the code and the error for the species with 140 samples. Unfortunately I do not have the R sessionInfo() because of the remote script running.
I would also like to mention that I have tried to work around this memory error by splitting the study species samples in half and attempting to concatenate the xset objects further in the processing. However, when I do this the objects lose all of their peak information and cannot be further analyzed / annotated using CAMERA.
Is there another way around this memory overload? Does XCMS have any functionality that allows this number of samples to be processed more efficiently?
Any help would be greatly appreciated!
Thanks,
Abbey Soule
######CODE######
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly = TRUE)
cat(" START TIME:", date(),"\n")
if (length(args)!=1){
stop("Wrong arguments -> need 1 !")
} else {
SPECIES_ID = as.integer(args[1])
cat(" SPECIES_ID:",SPECIES_ID,"\n",sep="")
}
####### Load required packages #######
library(multtest)
library(xcms)
library(vegan)
library(splitstackshape)
library(CAMERA)
library(reshape2)
library(RMySQL)
### load stuff in ####
imports = parent.env(getNamespace("CAMERA"))
unlockBinding("groups", imports)
imports[["groups"]] = xcms::groups
lockBinding("groups", imports)
####### Load CAMERA file with list of possible adducts #######
rule_mod<-read.csv(file=paste("/uufs/chpc.utah.edu/common/home/inga-group2/4_directories_chem_evolution_2019_03_26/extended_adducts_POS_NEG_modes", list.files("/uufs/chpc.utah.edu/common/home/inga-group1/4_directories_chem_evolution_2019_03_26/extended_adducts_POS_NEG_modes")[grep("current_neg",list.files("/uufs/chpc.utah.edu/common/home/inga-group1/4_directories_chem_evolution_2019_03_26/extended_adducts_POS_NEG_modes"))], sep="/"), header= TRUE)
###### Prepare data frame to fill with peak data #######
all_features_long <- data.frame("sample"=NULL,"Species_code_sample"=NULL,"RT"=NULL,"MZ"=NULL,"PC_ID"=NULL,"TIC"=NULL)
# set working directory
setwd("/uufs/chpc.utah.edu/common/home/inga-group2/local_adaptation_speciation/")
# Define absolute path to mzxml files, ending with folder containing folders named by site
mzxml_location <- "/uufs/chpc.utah.edu/common/home/inga-group2/local_adaptation_speciation/data/mzxml_files/"
# set retention time cutoffs (in minutes) for peaks that should be kept in analysis (typically after 1 minute and before 32 minutes
# for C18 column standard run)
min_rt <- 0.5
max_rt <- 14.5
# This code assumes .mzXML files are arranged into folders by species, with species being arranged in folders by site.
# Within each species folder, there should be two folders, one named "Sample" containing all samples for that species
# And the other named "Blank" containing all blanks for that species.
# Create vector with all sites you wish to analyze
sites <- c("Tiputini","Yasuni")
####### Loop through directories containing .mzXML files. Samples and their associated blanks should be in separate folders named "Sample" and "Blank". Depending on how files are arranged, you may or may not need to use the outer loop for site #######
species_sample <- data.frame()
for(i in 1:length(sites)) {
species <- list.dirs(paste(mzxml_location,sites[i],sep=""), recursive=FALSE, full.names=FALSE)
for(j in 1:length(species)) {
species_sample <- rbind(species_sample,data.frame(site=sites[i],species=species[j]))
}}
site = species_sample[SPECIES_ID,"site"]
species=species_sample[SPECIES_ID,"species"]
all_cons <- dbListConnections(MySQL())
for(con in all_cons){
dbDisconnect(con)}
myolddb <- dbConnect(MySQL(), user='u*****', password='******', dbname='inga', host='mysql.chpc.utah.edu')
processed_sample <- unique(dbGetQuery(myolddb, "SELECT Species_code_sample FROM `xcms_feature_table_long_speciation`"))[,"Species_code_sample"]
processed <- sapply(processed_sample, function(x) {
temp = unlist(strsplit(x, split = "_"))
return(temp[1])
})
if(!species %in% processed) {
# make vector with relative filepaths (used by XCMS to find files), sample groups (used by XCMS to group files into blank and sample), and sample names (used by XCMS in peak list output)
files_1 <- list.files(paste(mzxml_location, site, "/",species,sep=""),
recursive = TRUE, full.names = TRUE)
files_2 <- files_1[endsWith(files_1, "mzXML")]
files <- sample(files_2)
s_groups <- sapply(files, function(x) {
temp = unlist(strsplit(x, split = "/"))
return(temp[length(temp)-1])
})
is.na(s_groups) <- 0
s_names <- sapply(files, function(x) {
temp = unlist(strsplit(x, split = "/"))
return(sub(temp[length(temp)], pattern = ".mzXML", replacement = "", fixed = TRUE))
})
# set parameters for XCMS functions
# centwave is the method used to perform initial peak picking/finding.
cwparam <- CentWaveParam(ppm=15, peakwidth=c(2,10), snthresh=5, prefilter=c(10,500))
# Obiwarp is the method used for retention time correction
oparam <- ObiwarpParam(binSize=1)
# FillChromPeaks look for all peaks in all samples to find peaks that were below threshold set during peak picking
fpparam <- FillChromPeaksParam(expandMz = 0.25, expandRt = 0.5)
# peakdensity is the method used to group peaks across samples based on retention time. For the first peak grouping step, use more lenient parameters because retention time has not yet been corrected
dparam1 <- PeakDensityParam(sampleGroups = s_groups, bw=10, binSize=0.05, minSamples=1, minFraction = 0.01)
# second round of peak grouping uses more stringent parameters
dparam2 <- PeakDensityParam(sampleGroups = s_groups, bw = 3, binSize = 0.025, minSamples = 1, minFraction = 0.1)
# reads mzXML files into OnDiskMSnExp object
raw_data <- readMSData(files, msLevel. = 1, mode="onDisk")
# Run XCMS functions with parameters set above
xcmsexp <- findChromPeaks(object = raw_data, param = cwparam)
######ERROR######
Error in serialize(data, node$con, xdr = FALSE) :
error writing to connection
Calls: fillChromPeaks ... .send_EXEC -> <Anonymous> -> sendData.SOCK0node -> serialize
Execution halted
Return statement:1
End Date:Fri Apr 9 12:23:43 MDT 2021
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=2399234.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.