Question

Writing a loop through multiple fasta files and named files (is it possible with MSGFplus?)

0

Entering edit mode

laural710 • 0

@laural710-14567

Last seen 6.2 years ago

When working with well annotated species, i can straight call MSGFplus and run this on the pure fasta file without any memory issues. However, due to poor protein annotation of a species i am working on, i need to use a large fasta file (>150,000 protein sequences). I have split this fasta up into 20 individual fasta's, and am trying to figure out how to write a loop over the initial code.

I don't have much experience in writing code. At the moment, for the smaller fasta's, i am just manually running them individually and then concatenating the results at the end into one large identification file. Is there a way to circle through each file using one fasta and then call the next fasta and run through the full list of files again while outputting relatable names? The idea would be that parameter 1 would concur with fasta1, MSGFplus would run through the samples, produce mzid files for fasta1, and then move to fasta2 and repeat sequentially. I'm thinking a loop would work but not sure as its structurally difficult given you have to preset the parameters

library(MSGFplus)
files= file1.mzML, file2.mzML, file3.mzML, file4.mzML etc.
myFastas=fasta_split1.fasta,fasta_split2.fasta,fasta_split3.fasta,fasta_split4.fasta etc. 

#The idea 
for(i in 1:Length(myFastas)){
parameters[i]=msgfPar(database=[i]",
                   tolerance=c(low="3 ppm", high="50 ppm"), 
                   instrument="TOF",
                   fragmentation="HCD", 
                   enzyme="Trypsin")
mods(parameters)[[1]]=msgfParModification("Carbamidomethyl",
                                          composition="C2H3N101",
                                          residues="C",
                                          type="fix",
                                          position="any")
mods(parameters)[[2]]=msgfParModification("Oxidation",
                                          mass=15.994915,
                                          residues="M",
                                          type="opt",
                                          position="any")
nMod(parameters)=3
peptide_assignment=runMSGF(parameters[i],files)
}

However, currently this does not work and i'm wondering is it just easier to run through each group individually?

MSGF+ Proteomics r • 2.2k views

ADD COMMENT • link updated 6.3 years ago by Martin Morgan 25k • written 6.3 years ago by laural710 • 0

0

Entering edit mode

Iterating over multiple fasta is technically possible, but I am not sure it is valid. When identifying your peptides, a identification probability is computed that takes into account the risk of calling wrong peptide spectrum matches. That probability depends on the search space, i.e. the size of your fasta file. Running your search on smaller chunks of your database isn't equivalent to running a single search on the full database.

ADD REPLY • link 6.3 years ago Laurent Gatto 1.6k

0

Entering edit mode

That's what i was beginning to think. At the moment, the run is failing with a GC overhead limit error, and i have been trying to figure out if there is a work around, such as splitting the fasta files. Thanks for answering.

ADD REPLY • link 6.3 years ago laural710 • 0

0

Entering edit mode

You could try to run MSGF+ natively. I have no idea if running it through the MSGFplus package has an actual overhead in term of memory, but worth a try. If this fails, you'll probably need a computer with more memory.

ADD REPLY • link 6.3 years ago Laurent Gatto 1.6k