Spectra and PSMatch- trouble in reducePSMs, joinSpectraData, or countIdentifications?
1
0
Entering edit mode
Jane_Huang • 0
@4ce3b8bd
Last seen 26 days ago
United States

Hello,

I used MS-GF+ on the mzML file, using the .faa file as reference, and produce the .mzid file. All files are in the following link: https://figshare.com/s/b65fc594da19f0f9347f . The raw data was produced by Bruker Impact II (Q-TOF MS/MS), and was transformed into mzML file by ProteoWizard msconvert.

However, when I tried to use the function countIdentifications, an error popped up. I am not sure whether there was something wrong about the reducePSMs, joinSpectraData, or countIdentifications.

My codes were as below. The MS-GF+ was run in Mac terminal; other codes were in R Studio.

java -version 
java version "1.8.0_421"

java -Xmx3500M -jar MSGFPlus.jar -s "HH090441864_2024-09-09_2648.mzML" -d "protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_2.mzid
 library(Spectra)
library(PSMatch)
sp<-Spectra("HH090441864_2024-09-09_2648.mzML")
id<-readPSMs("HH090441864_2.mzid")

id_filtered<-filterPSMs(id)
id_f_r<-reducePSMs(id_filtered)

sp_ident <- joinSpectraData(sp, id_f_r,
                      by.x = "spectrumId",
                      by.y = "spectrumID")

sp_ident_count <- countIdentifications(sp_ident)

#Error: BiocParallel errors
 # 1 remote errors, element index: 1
 # 0 unevaluated and other errors
 # first remote error:
#Error in as.vector(x, mode): coercing an AtomicList object to an atomic vector is supported only for
  #objects with top-level elements of length <= 1

Thank you in advance for any thoughts!

ProteomicsWorkflow PSMatch Spectra RforProteomics • 483 views
ADD COMMENT
1
Entering edit mode
@laurent-gatto-5645
Last seen 25 days ago
Belgium

Tthis happens because the sequence spectraVariable is expected to be an atomic vector or a List with elements of length 1, but it's not - some of your MS2 scans have > 1 sequence:

> lengths(sp_ident[["sequence"]])
  [1] 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 2 1 0 1 1 0 1 0
 [38] 0 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
 [75] 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 2 1 1 0 0 0 0 1 0 0
> sp_ident[["sequence"]][30]
CharacterList of length 1
[[1]] LERRFGR EFGKGIKR

You should filter out your PSMs first, and only keep the best one for each MS2 scan.

This can be traced back to the identification data:

> which(lengths(id_f_r$sequence) > 1)
scan=1054  scan=109  scan=127 scan=1284  scan=133 scan=1377 scan=1459 scan=1464 
       22        26        72        79        93       106       135       138 
scan=1611 scan=1629 scan=1833 scan=2296 scan=2301 scan=2369 scan=2433 scan=2543 
      190       196       272       419       422       448       474       510 
scan=2547 scan=2586   scan=30 scan=3089 scan=3157 scan=3201  scan=322 scan=3269 
      512       517       650       680       705       720       728       750 
scan=3333 scan=3718 scan=3758 scan=3864 scan=4104 scan=4108 scan=4111 scan=4126 
      775       914       925       956      1035      1037      1038      1044 
scan=4168 scan=4289 scan=4292 scan=4709 scan=4742 scan=4802 scan=4822 scan=4876 
     1053      1090      1091      1230      1241      1267      1272      1297 
scan=4923 scan=4961 scan=5122 scan=5524 scan=5692  scan=571 scan=5713 scan=5718 
     1318      1328      1392      1538      1579      1585      1586      1588 
scan=5948 scan=6086 scan=6177 scan=6206 scan=6234 scan=6414 scan=6422 scan=6424 
     1671      1735      1757      1772      1790      1852      1856      1858 
scan=6429 scan=6463 scan=6618 scan=6668 scan=6788 scan=6943 scan=6954 scan=7136 
     1861      1872      1924      1942      1984      2060      2065      2137 
scan=7152 scan=7171 scan=7282 scan=7369 scan=7650 scan=7660 scan=7896 scan=7904 
     2145      2155      2194      2215      2314      2316      2375      2376 
scan=7934  scan=804 scan=8308   scan=91 
     2380      2412      2498      2568

I suggest you check why this happens.

ADD COMMENT
0
Entering edit mode

Thank you very much for the detailed explanation!! Sorry for the delay response. Indeed, there were multiple MS2 scan matching to more than one peptide- and strangely all peptides were ranked 1.

In fact, all of the matches in my original data were ranked 1, and when I did filterPSMs, no peak was removed due to rank >1. I also didn't use decoy database- can this be the reason why there's a problem with ranking?

I used MSGFPlus for generating my mzID file, and also tried posting the question there (https://github.com/MSGFPlus/msgfplus/issues/156#issuecomment-2380245837), but haven't got a definite solution yet (currently trying to create a new column for redundant sequence).

> id_0927_filtered<-filterPSMs(id_0927)
Starting with 32852 PSMs:
Removed 0 decoy hits.
Removed 0 PSMs with rank > 1.
Removed 29928 shared peptides.
2924 PSMs left.
ADD REPLY
0
Entering edit mode

I used scan 30 as an example:

>test2<-as.data.frame(id_filtered)
>test3<-filter(test2,spectrumID="scan=30")

>test3
  sequence spectrumID chargeState rank passThreshold experimentalMassToCharge calculatedMassToCharge
1  LERRFGR    scan=30           3    1          TRUE                 312.1834               311.8507
2 EFGKGIKR    scan=30           3    1          TRUE                 312.1834               312.1871
    peptideRef modNum isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
1  Pep_LERRFGR      0   FALSE    M   R   147 153 XP_023107941.2         321            
2 Pep_EFGKGIKR      0   FALSE    G   R   872 879 XP_044889830.1         935            
                                                                                  DatabaseDescription
1 XP_023107941.2 LOW QUALITY PROTEIN: putative testis-specific Y-encoded-like protein 3 [Felis catus]
2               XP_044889830.1 xin actin-binding repeat-containing protein 2 isoform X4 [Felis catus]
  scan.number.s. scan.start.time acquisitionNum                     spectrumFile                idFile
1             30         189.737             30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
2             30         189.737             30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
  MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue modPeptideRef modName modMass modLocation
1             20                33     8.410246e-07     9.655486          <NA>    <NA>      NA          NA
2             20                33     8.410246e-07     9.804787          <NA>    <NA>      NA          NA
  subOriginalResidue subReplacementResidue subLocation
1               <NA>                  <NA>          NA
2               <NA>                  <NA>          NA
ADD REPLY
1
Entering edit mode

Hmm, it is a bit surprising that you have so many the same scores. I would try to look at the actual spectra, and check the matching peaks, while filtering them out and continue with the unique matches.

ADD REPLY

Login before adding your answer.

Traffic: 633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6