Which preprocessing steps are needed before running limpa?
2
0
Entering edit mode
@5b6cc501
Last seen 9 days ago
Slovakia

I have DIA experiment with 16 subjects, 96 samples in total (two independent groups, and each subject measured at 6 time points). The data were first processed with quantms nextflow pipeline (--min_peptide_length 7 --max_peptide_length 30 --allowed_missing_cleavages 1 --targeted_only false --min_pr_mz 400 --max_pr_mz 1000) that produced 'diann_report.tsv' output file, first row of which (when transposed) looks like

File.Name                 "6007_TIMS3_RSLC-110m_TFA_DIA_031_1_BB1_1_36147.d"                  
Run                       "6007_TIMS3_RSLC-110m_TFA_DIA_031_1_BB1_1_36147"                    
Protein.Group             "P36578"                                                            
Protein.Ids               "P36578"                                                            
Protein.Names             "RL4_HUMAN"                                                         
Genes                     "RPL4"                                                              
PG.Quantity               "117.003"                                                           
PG.Normalised             "103.357"                                                           
PG.MaxLFQ                 "463.488"                                                           
Genes.Quantity            "117.003"                                                           
Genes.Normalised          "103.357"                                                           
Genes.MaxLFQ              "463.488"                                                           
Genes.MaxLFQ.Unique       "463.488"                                                           
Modified.Sequence         "AAAAAAALQAK"                                                       
Stripped.Sequence         "AAAAAAALQAK"                                                       
Precursor.Id              "AAAAAAALQAK2"                                                      
Precursor.Charge          "2"                                                                 
Q.Value                   "0.000165728"                                                       
PEP                       "0.012917"                                                          
Global.Q.Value            "0.211434"                                                          
Protein.Q.Value           "0.0979681"                                                         
PG.Q.Value                "0.0983379"                                                         
Global.PG.Q.Value         "0.00277264"                                                        
GG.Q.Value                "0.0983379"                                                         
Translated.Q.Value        "0"                                                                 
Proteotypic               "1"                                                                 
Precursor.Quantity        "117.003"                                                           
Precursor.Normalised      "103.357"                                                           
Precursor.Translated      "132.89"                                                            
Translated.Quality        NA                                                                  
Ms1.Translated            "171.507"                                                           
Quantity.Quality          "0.946464"                                                          
RT                        "35.1009"                                                           
RT.Start                  "35.0157"                                                           
RT.Stop                   "35.1577"                                                           
iRT                       "-13.7274"                                                          
Predicted.RT              "35.0973"                                                           
Predicted.iRT             "-13.7144"                                                          
First.Protein.Description "Large ribosomal subunit protein uL4"                               
Lib.Q.Value               "0.00548697"                                                        
Lib.PG.Q.Value            "0.00129199"                                                        
Ms1.Profile.Corr          "0.649789"                                                          
Ms1.Area                  "151.004"                                                           
Evidence                  "1.59567"                                                           
Spectrum.Similarity       "0.0838866"                                                         
Averagine                 "0.035785"                                                          
Mass.Evidence             "1.36832"                                                           
CScore                    "0.987797"                                                          
Decoy.Evidence            "0"                                                                 
Decoy.CScore              "-1e+07"                                                            
Fragment.Quant.Raw        "117.003;0;0;0;63.0011;47.0011;74.0021;0;0;62.0011;56.0011;0;"      
Fragment.Quant.Corrected  "117.003;0;0;0;63.0011;47.0011;74.0021;0;0;62.0011;56.0011;0;"      
Fragment.Correlations     "0.946464;0;0;0;0.691234;0.244415;0.649222;0;0;0.691234;0.244415;0;"
MS2.Scan                  "37086"                                                             
IM                        "0.848068"                                                          
iIM                       "0.852708"                                                          
Predicted.IM              "0.846738"                                                          
Predicted.iIM             "0.853832"

Now

library(limpa)

y.peptide <- readDIANN("diann_report.tsv", q.cutoffs = 0.01, q.columns = c("Q.Value","Lib.Q.Value","Lib.PG.Q.Value"))
print(dim(y.peptide))
[1] 16974    96

After import, limma::plotDensities() plot looks like plotDensities after import

and regarding missing values

y.peptide$E %>% is.na() %>% apply(2, mean) %>% plot()

plot3

# some basic filters
y.peptide <- filterNonProteotypicPeptides(y.peptide)
y.peptide <- filterCompoundProteins(y.peptide)
print(dim(y.peptide))
[1] 15471    96

dpc gives 0.63, dpcON(robust=TRUE) gives 0.64, dpcCN gives 1.08. I chosen dpcCN:

y.protein <- dpcQuant(y.peptide, "Protein.Group", dpc=dpcCN(y.peptide))
dim(y.protein)
[1] 2002   96

Now densities look like

plotDensities after dpcQuant

Should I normalize my data or remove outlying samples before using limpa? And if yes, at which stage - immediately after import, before dpc(), before dpcQuant(), or before dpcDE()?

Normalization limpa • 284 views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

Please see the example limpa analyses given at http://github.com/smythlab/limpa, especially the kidney cancer case study.

limpa does not generally require any preprocessing. If normalization is required, it is applied after dpcQuant() and before dpcDE(). We do not do automatic normalization, because the requirement for normalization is dependent on the dataset and on the scientific purpose of the study. For example, in Corso el al 2026 we used limpa to compare IPs to IgG controls, and normalization would have been completely inappropriate because it would remove the global differences in detection between the IP and the control.

Outlier samples can be removed, but are usually retained and handled automatically with downweighting by dpcDE(). I would judge outliers from a plotMDSUsingSEs() rather than from a density plot.

I usually keep all the data in until after dpcQuant(), then I remove proteins that are detected in very few samples. For example, if I want to keep proteins that were detected in least 20 samples, I would use:

keep <- rowSums(y$other$n.observations > 0) >= 20
yfilt <- y[keep,]

If I then want to make every sample have the same density of log-intensities values, I could apply quantile normalization:

yfilt$E <- normalizeQuantiles(yfilt$E)

I assume you are inputing precursor ions rather than actual peptides. The second density plot you show looks surprising, but I can't comment on plots in isolation, without any accompanying code or QC plots. If samples have radically different percentages of missing values, then it is quite correct that they will not have the same distribution of log-intensities after quantification. I would be able to give more meaningful advice if I knew the percentage of missing values across samples and what dpc was estimated. I also like to see actual code. I don't put a log of emphasis on density plots, like you the ones you have shown, because there is no assumption in limma or limpa that log-intensities need to have a normal-like densities for each sample.

ADD COMMENT
0
Entering edit mode

Hi Gordon Smyth, thank you a lot for very fast and valuable response. I edited the question to provide more details.

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

It looks to me that everything is working ok and limpa has done the right thing. You data has a lot of missing values, with the proportion of missing values being very high, around 50% or more, for all but two samples. There are two outier samples that have only 20-30% missing values. Missing values are evidence that the underlying protein expression is low, so limpa has naturally concluded that the two outlier samples with few NAs have higher average expression than the other samples. The two samples with few NAs show up as outlier peaks in the density plots. This behavior is as intended.

The question now is, what is biologically different about the two samples with fewer NAs? Does the higher detection rate have a biological cause, or is it a technical effect, for example a function of the total amount of protein/peptide material you started with for those samples. If you think it is likely to be a technical effect, then should normalize the protein quantifications, using either quantile or cyclic-loess normalization. If it is a genuine biological effect, that you need to be more cautious about normalization.

By the way, I think that the nice looking densities plots from y.peptide is a bit of a mirage. The samples differ greatly in terms of percentage of missing values, so there is no mathematical reason why the observed intensities should have the same distribution for every sample. One would expect the observed values for samples with lots of NAs to be a more selective subset. I suspect that the DIA-NN normalization has made the precursor-level density plots look more uniform than they really should.

ADD COMMENT

Login before adding your answer.

Traffic: 1422 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6