Question

msmsTests, NA values in test.results output

0

Entering edit mode

laural710 • 0

@laural710-14567

Last seen 6.2 years ago

Hi All

I'm strugglin with the analysis of DEP in a protein dataset i have. I have managed to analyse to this point in R using variants of MSGF etc and quantified via SI. Having run the samples via DAPAR and Prostar, i have knowldege of the number (or at least the region) of DEP to expect in this particular comparison (eg. CTRL to 0.01 mg/L). Using msmsTests, i create the hypothesis tests (H0 and H1), subset the data set to only these two groups and then run msms.EdgeR() with div = to the colSums().

This all runs fine but when i get to the portion or adding FDR to the data via test.results, something weird happens where instead of getting a dataset (so to speak) of protein id, raw spectral counts for control condition, raw spectral conts for comparitive condition, logfold change (LFC.AV) (logfc, LR, p.value etc), the second column (which is the raw spectral counts for the ctrl condition is always NA irrespective of which concentration i choose. I''m unsure why this is the case, when the msms.EdgeR runs ok with the comparisons. Has this happened to anyone else? Is there a particular reason this might occur that i can check to see if my data has been entered incorrectly?

I'm scratching my head over what would cause this. I did wonder whether it had anything to do with my estimates of dispersion used in the msms.EdgeR model build but i'm unsure how to proceed now.

Any help would be much appreciated. I can of course provide the code but i am hoping someone else has stumbled across this.

Many thanks :)

Proteomics msmstests FDR • 2.6k views

ADD COMMENT • link updated 7.3 years ago by Laurent Gatto 1.6k • written 7.3 years ago by laural710 • 0

0

Entering edit mode

Do you have any `NA` values in your original count data by any chance?

ADD REPLY • link 7.3 years ago Laurent Gatto 1.6k

0

Entering edit mode

No, this is what is frustrating me. As part of the preparation of my experimental MSnSet, i remove any lines with NA (pNA=1/3 and then impute any remaining NA values with knn). When combining the data set, i use imputation QRILC to ensure that there are no NA values and then check with is.na(exprs(e2)) || exprs(e2) < 0) which is always False.

ADD REPLY • link 7.3 years ago laural710 • 0

0

Entering edit mode

It would probably be helpful to provide code, detailed output and possibly example data. Also, don't forget to add the output of `sessionInfo()`.

ADD REPLY • link 7.3 years ago Laurent Gatto 1.6k

0

Entering edit mode

Many thanks for your quick reply. I've attached the code that i have used but i'm unsure if it is in the correct format? Any help would be greatly appreicated

ADD REPLY • link 7.3 years ago laural710 • 0

0

Entering edit mode

BaP_Exp consists of 12 samples, 3 biological replicates of 4 conditions. The data below as been subsetted to two comparisons. BaP_exp is a large MSnSet.

Code for analysis mix1=BaP_exp[,BaP_exp$Group %in% c("Ctrl","5")]
new=pp.msms.data(mix1) pData(new) ##6 samples
H0="y~" H1="y~Group" dispersion=colSums(exprs(new))

check for na values

is.na(exprs(new))|| exprs(new) < 0 #always FALSE
mixed=msms.edgeR(new,H0,H1,div=dispersion, fnm="Group")

returns NA values or more recently error: negative counts not allowed

SessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:

[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252

[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:

[1] parallel statsgraphic grDevices utils datasets methods base

other attached packages:

[1] msmsTests_1.18.0 msmsEDA_1.18.0 MSnID_1.14.0 MSGFplus_1.14.0 BiocInstaller_1.30.0 MSnbase_2.6.1 
[7] ProtGenerics_1.12.0 BiocParallel_1.14.1 mzR_2.14.0 Rcpp_0.12.19 Biobase_2.40.0 BiocGenerics_0.26.0

loaded via a namespace (and not attached):

Error in x[["Version"]] : subscript out of bounds

In addition: Warning messages:

1: In FUN(X[[i]], ...) :

DESCRIPTION file of package 'IRanges' is missing or broken

2: In FUN(X[[i]], ...) :

DESCRIPTION file of package 'limma' is missing or broken

ADD REPLY • link updated 7.3 years ago by Laurent Gatto 1.6k • written 7.3 years ago by laural710 • 0

0

Entering edit mode

It's difficult to say much without the actual code and it's output, as it is executed in R. Two things though

there are warnings at the end of your sessionInfo; I would recommend fixing them befoe doing anything else. Re-installing the packages light help. If you have issues or more questions about this step, please ask them in a new question on the forum.
Following exactly what is presented in the msmsTests vignette, your H0 should be defined as H0 = "y~1". I don't know if this is the reason for your issue though.

ADD REPLY • link 7.3 years ago Laurent Gatto 1.6k

0

Entering edit mode

How can i provide the code as its produced in R? Do i just copy it under the line of run code?

I've never been able to attach code before on this forum and am slightly lost. Sorry for the inconvenience

ADD REPLY • link 7.3 years ago laural710 • 0

0

Entering edit mode

I didn't mean to attach it, but run all the commands in your R session, then copy the whole lot and their output and paste it here. For example:

> library(msdata)
> library(MSnbase)
> f <- msdata::proteomics(full.names = TRUE, pattern = "TMT11")
> f
[1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/3.5/msdata/proteomics/MS3TMT11.mzML"
> x <- readMSData(f, mode = "onDisk")
> x
MSn experiment data ("OnDiskMSnExp")
Object size in memory: 0.5 Mb
- - - Spectra data - - -
 MS level(s): 1 2 3
 Number of spectra: 994
 MSn retention times: 45:27 - 47:6 minutes
- - - Processing information - - -
Data loaded [Thu Oct 18 15:37:06 2018]
 MSnbase version: 2.7.8
- - - Meta data  - - -
phenoData
  rowNames: MS3TMT11.mzML
  varLabels: sampleNames
  varMetadata: labelDescription
Loaded from:
  MS3TMT11.mzML
protocolData: none
featureData
  featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
  fvarLabels: fileIdx spIdx ... spectrum (30 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'

ADD REPLY • link 7.3 years ago Laurent Gatto 1.6k

0

Entering edit mode

Many thanks for your quick reply. I've ran the code again and am now getting a new error message. I've updated my packages since i first asked my question, but i don't think that is what is going wrong. Library(MSGFplus, MSnbase, MSnID,vsn, imputeLCMD)

> t1=c("20171124VS37.mzML")

> q1=c("20171124VS37.mzid")
> bap_0a=readMSData(t1, mode="onDisk")
> bap_0a=addIdentificationData(bap_0a,q1)
> t2=c("20171124VS38.mzML")
> q2=c("20171124VS38.mzid")
> bap_0b=readMSData(t2, mode="onDisk")
> bap_0b=addIdentificationData(bap_0b,q2)
> bap_0c=readMSData(t3, mode="onDisk") .... (3 MORE SAMPLES)

> nms=c(paste0("bap_",0,c("a","b","c")),
+ paste0("bap_",5,c("a","b","c")))
> tmp <- sapply(nms, function(.bap) {
+ cat("Processing", .bap, "... ")
+ x <- get(.bap, envir = .GlobalEnv)
+ x=quantify(x,method="SI")
+ x <- topN(x, groupBy = fData(x)$DatabaseAccess, n = 3)
+ nPeps <- nQuants(x, fData(x)$DatabaseAccess)
+ x <- combineFeatures(x, fData(x)$DatabaseAccess,
+ redundancy.handler = "unique",
+ fun="sum", na.rm = TRUE, CV=FALSE)
+ exprs(x) <- exprs(x) * (3/nPeps)
+ x <- updateFvarLabels(x, .bap)
+ varnm <- sub("bap", "bap", .bap)
+ assign(varnm, x, envir = .GlobalEnv)
+ cat("done\n")
+ })
> ctrl=combine(bap_0a,bap_0b)
> ctrl=combine(ctrl, bap_0c)
> ctrl=filterNA(ctrl, pNA=1/3)
> ctrl=impute(ctrl, method="knn")
> sampleNames(ctrl)=c("CTRL.a","CTRL.b","CTRL.c")
> bap5=combine(bap_5a,bap_5b)
> bap5=combine(bap5,bap_5c)
> bap5=filterNA(bap5,pNA=1/3)
> bap5=impute(bap5, method="knn")
> sampleNames(bap5)=c("BaP.5a","BaP.5b","BaP.5c")
> BaP_exp=combine(ctrl,bap5)
> BaP_exp=impute(BaP_exp, "QRILC")
> BaP_exp
MSnSet (storageMode: lockedEnvironment)
assayData: 2265 features, 6 samples
element names: exprs
protocolData: none
phenoData
sampleNames: CTRL.a CTRL.b ... BaP.5c (6 total)
varLabels: sampleNames
varMetadata: labelDescription
featureData
featureNames: sp|P56839|PEPM_MYTED sp|P91958|TPM_MYTGA ... tr|U3PWG3|U3PWG3_9BIVA (2265 total)
fvarLabels: fileIdx.bap_0a spIdx.bap_0a ... CV.20171124VS42.mzML.1.bap_5c (390 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
- - - Processing information - - -
Combined [1461,3] and [1285,3] MSnSets Thu Oct 18 15:33:34 2018
Data imputation using QRILC Thu Oct 18 15:33:41 2018
Using default parameters
MSnbase version: 2.6.1

ADD REPLY • link 7.3 years ago laural710 • 0

0

Entering edit mode

DEP (load msmsTests & msmsEDA)

pData(BaP_exp)$Group=rep(c("Ctrl","5"),each=3)
> pData(BaP_exp)
sampleNames Group
CTRL.a 20171124VS37.mzML Ctrl
CTRL.b 20171124VS38.mzML Ctrl
CTRL.c 20171124VS39.mzML Ctrl
BaP.5a 20171124VS40.mzML 5
BaP.5b 20171124VS41.mzML 5
BaP.5c 20171124VS42.mzML 5

> bap_data=pp.msms.data(BaP_exp)
> processingData(BaP_exp)
- - - Processing information - - -
Combined [1461,3] and [1285,3] MSnSets Thu Oct 18 15:33:34 2018
Data imputation using QRILC Thu Oct 18 15:33:41 2018
Using default parameters
MSnbase version: 2.6.1

> H0="y~1"

> H1="y~Group"
> disp2=colSums(exprs(bap_data))
> disp2
CTRL.a CTRL.b CTRL.c BaP.5a BaP.5b BaP.5c
40684.87 40461.76 40564.24 38956.89 39035.39 38819.56

> is.na(exprs(bap_data)) || exprs(bap_data) < 0
[1] FALSE

mixed=msms.edgeR(bap_data,H0,H1,div=disp2, facs=facs)
Error in glmLRT(fit, coef = vc) :
Need at least two columns for design, usually the first is the intercept column

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] imputeLCMD_2.0 impute_1.54.0 pcaMethods_1.72.0 norm_1.0-9.5 tmvtnorm_1.4-10 gmm_1.6-2
[7] sandwich_2.5-0 Matrix_1.2-14 mvtnorm_1.0-8 vsn_3.48.1 MSnID_1.14.0 MSGFplus_1.14.0
[13] msmsTests_1.18.0 msmsEDA_1.18.0 BiocInstaller_1.30.0 MSnbase_2.6.1 ProtGenerics_1.12.0 BiocParallel_1.14.1
[19] mzR_2.14.0 Rcpp_0.12.19 Biobase_2.40.0 BiocGenerics_0.26.0

ADD REPLY • link 7.3 years ago laural710 • 0

score 2 · Answer 1 · 2018-10-19

I'll attempt an answer here. I would suggest you reformulate your logical operator that checks whether you have any missing or negative values. The double operators && and || can be very confusing. Here's an example showing that they don't necessarily work in cases such as yours.

> x
   F1  F2   F3
P1  1 1.2  1.3
P2  1 1.0  1.0
P3  1 1.0  1.0
P4  1 1.0 -1.0
>  is.na(x) || x < 0
[1] FALSE
> anyNA(x)
[1] FALSE
> any(x < 0)
[1] TRUE

Also, care should be taken in the presence of NAs

> x[4, 3] <- 1
> x[2, 2] <- NA
> x
   F1  F2  F3
P1  1 1.2 1.3
P2  1  NA 1.0
P3  1 1.0 1.0
P4  1 1.0 1.0
>  is.na(x) || x < 0
[1] FALSE
> anyNA(x)
[1] TRUE
> any(x < 0)
[1] NA

You can also use any(..., na.rm = TRUE).

I would suggest to use the following syntax

> x
   F1  F2  F3
P1  1 1.2 1.3
P2  1  NA 1.0
P3  1 1.0 1.0
P4  1 1.0 1.0
> any(is.na(x) | x < 0)
[1] TRUE
> x[2, 2] <- 1
> x[4, 3] <- -1
> x
   F1  F2   F3
P1  1 1.2  1.3
P2  1 1.0  1.0
P3  1 1.0  1.0
P4  1 1.0 -1.0
> any(is.na(x) | x < 0)
[1] TRUE

Re the double form of the logical operators, have a look at the manual page (?& will get you there), which says:

‘&’ and ‘&&’ indicate logical AND and ‘|’ and ‘||’ indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in ‘if’ clauses.

Hope this help!