Question: Writing a loop through multiple fasta files and named files (is it possible with MSGFplus?)
0
gravatar for laural710
4 weeks ago by
laural7100
laural7100 wrote:

When working with well annotated species, i can straight call MSGFplus and run this on the pure fasta file without any memory issues. However, due to poor protein annotation of a species i am working on, i need to use a large fasta file (>150,000 protein sequences). I have split this fasta up into 20 individual fasta's, and am trying to figure out how to write a loop over the initial code.

I don't have much experience in writing code. At the moment, for the smaller fasta's, i am just manually running them individually and then concatenating the results at the end into one large identification file. Is there a way to circle through each file using one fasta and then call the next fasta and run through the full list of files again while outputting relatable names? The idea would be that parameter 1 would concur with fasta1, MSGFplus would run through the samples, produce mzid files for fasta1, and then move to fasta2 and repeat sequentially. I'm thinking a loop would work but not sure as its structurally difficult given you have to preset the parameters

library(MSGFplus)
files= file1.mzML, file2.mzML, file3.mzML, file4.mzML etc.
myFastas=fasta_split1.fasta,fasta_split2.fasta,fasta_split3.fasta,fasta_split4.fasta etc. 

#The idea 
for(i in 1:Length(myFastas)){
parameters[i]=msgfPar(database=[i]",
                   tolerance=c(low="3 ppm", high="50 ppm"), 
                   instrument="TOF",
                   fragmentation="HCD", 
                   enzyme="Trypsin")
mods(parameters)[[1]]=msgfParModification("Carbamidomethyl",
                                          composition="C2H3N101",
                                          residues="C",
                                          type="fix",
                                          position="any")
mods(parameters)[[2]]=msgfParModification("Oxidation",
                                          mass=15.994915,
                                          residues="M",
                                          type="opt",
                                          position="any")
nMod(parameters)=3
peptide_assignment=runMSGF(parameters[i],files)
} 

However, currently this does not work and i'm wondering is it just easier to run through each group individually?

proteomics R msgf+ • 59 views
ADD COMMENTlink modified 4 weeks ago by Martin Morgan ♦♦ 24k • written 4 weeks ago by laural7100

Iterating over multiple fasta is technically possible, but I am not sure it is valid. When identifying your peptides, a identification probability is computed that takes into account the risk of calling wrong peptide spectrum matches. That probability depends on the search space, i.e. the size of your fasta file. Running your search on smaller chunks of your database isn't equivalent to running a single search on the full database.

ADD REPLYlink written 4 weeks ago by Laurent Gatto1.2k

That's what i was beginning to think. At the moment, the run is failing with a GC overhead limit error, and i have been trying to figure out if there is a work around, such as splitting the fasta files. Thanks for answering.

ADD REPLYlink written 4 weeks ago by laural7100

You could try to run MSGF+ natively. I have no idea if running it through the MSGFplus package has an actual overhead in term of memory, but worth a try. If this fails, you'll probably need a computer with more memory.

ADD REPLYlink written 4 weeks ago by Laurent Gatto1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 454 users visited in the last hour