When working with well annotated species, i can straight call MSGFplus and run this on the pure fasta file without any memory issues. However, due to poor protein annotation of a species i am working on, i need to use a large fasta file (>150,000 protein sequences). I have split this fasta up into 20 individual fasta's, and am trying to figure out how to write a loop over the initial code.
I don't have much experience in writing code. At the moment, for the smaller fasta's, i am just manually running them individually and then concatenating the results at the end into one large identification file. Is there a way to circle through each file using one fasta and then call the next fasta and run through the full list of files again while outputting relatable names? The idea would be that parameter 1 would concur with fasta1, MSGFplus would run through the samples, produce mzid files for fasta1, and then move to fasta2 and repeat sequentially. I'm thinking a loop would work but not sure as its structurally difficult given you have to preset the parameters
library(MSGFplus)
files= file1.mzML, file2.mzML, file3.mzML, file4.mzML etc.
myFastas=fasta_split1.fasta,fasta_split2.fasta,fasta_split3.fasta,fasta_split4.fasta etc.
#The idea
for(i in 1:Length(myFastas)){
parameters[i]=msgfPar(database=[i]",
tolerance=c(low="3 ppm", high="50 ppm"),
instrument="TOF",
fragmentation="HCD",
enzyme="Trypsin")
mods(parameters)[[1]]=msgfParModification("Carbamidomethyl",
composition="C2H3N101",
residues="C",
type="fix",
position="any")
mods(parameters)[[2]]=msgfParModification("Oxidation",
mass=15.994915,
residues="M",
type="opt",
position="any")
nMod(parameters)=3
peptide_assignment=runMSGF(parameters[i],files)
}
However, currently this does not work and i'm wondering is it just easier to run through each group individually?
Iterating over multiple fasta is technically possible, but I am not sure it is valid. When identifying your peptides, a identification probability is computed that takes into account the risk of calling wrong peptide spectrum matches. That probability depends on the search space, i.e. the size of your fasta file. Running your search on smaller chunks of your database isn't equivalent to running a single search on the full database.
That's what i was beginning to think. At the moment, the run is failing with a GC overhead limit error, and i have been trying to figure out if there is a work around, such as splitting the fasta files. Thanks for answering.
You could try to run MSGF+ natively. I have no idea if running it through the
MSGFplus
package has an actual overhead in term of memory, but worth a try. If this fails, you'll probably need a computer with more memory.