Getting data into (and out of) the MSnSet object
3
0
Entering edit mode
moldach ▴ 20
@moldach-8829
Last seen 15 months ago

I would like to use the msmsTests package for differential expression analysis of proteins (spectral counts); however, the vignette only covers working with a MSnSet class and not how to get your data into that class.

While the pRoloc vignette does mention how to convert csv files to the R data class MSnSet using the readMSnSet constructor function (section 2.2.2) the problem is when trying to follow either the msmsTests or RforProteomics vignettes they use a different type of MSnSet object which produces different results, for example:

## pRoloc creation of MSnSet:

f1 <- dir(system.file("extdata", package = "pRolocdata"),
full.names = TRUE, pattern = "exprsFile.csv")
f2 <- dir(system.file("extdata", package = "pRolocdata"),
full.names = TRUE, pattern = "fdataFile.csv")
f3 <- dir(system.file("extdata", package = "pRolocdata"),
full.names = TRUE, pattern = "pdataFile.csv")
featureDataFile = f2,
phenoDataFile = f3,
sep = ",")
pData(tan2009r1)

## RforProteomics creation of MSnSet:

library("msmsEDA")
library("msmsTests")
data(msms.dataset)
## Pre-process expression matrix
e <- pp.msms.data(msms.dataset)
pData(e)


As you can tell the pData() result is different for both of these. It would be nice if I could figure out how to convert the msms.dataset (MSnSet class) to something I can investigate to see how these objects differ - they won't let you investigate using head() or tail(). If I could see what it looks like I could figure out how to convert my data into this class for use in differential analysis and visualization.

proteomics MSnbase MSnSet readMSnSet msmsTests • 1.3k views
0
Entering edit mode
@laurent-gatto-5645
Last seen 2 days ago
Belgium

There are multiple ways to create MSnSet objects, but I'll assume you have a spreadsheet.

The easiest way is the to use readMSnSet2, that only takes one file as input and populates the quantitative data (i.e. what is returned by exprs) and feature meta-data (i.e fData). The sample meta-data (i.e. pData) need to be added manually, as this information is not available in that single spreadsheet. From ?readMSnSet2:

> ?readMSnSet2
> library("pRolocdata")
> f0 <- dir(system.file("extdata", package = "pRolocdata"),
+                full.names = TRUE,
+                pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")
> basename(f0)
[1] "hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"
> res <- readMSnSet2(f0, ecol = 8:27)
> res
MSnSet (storageMode: lockedEnvironment)
assayData: 5033 features, 20 samples
element names: exprs
protocolData: none
phenoData: none
featureData
featureNames: 1 2 ... 5033 (5033 total)
fvarLabels: UniProt.Accession.for.Protein.Group
UniProt.ID.for.Protein.Group ... X.35 (25 total)
experimentData: use 'experimentData(object)'
Annotation:
- - - Processing information - - -
MSnbase version: 2.5.5
> pData(res)
data frame with 0 columns and 20 rows


That dataset has 20 columns/samples. Assuming you have 2 groups (A and B) with 10 samples each, you could populate the pData slots with:

> ## set simple sample names
> sampleNames(res) <- 1:20
> ## update pData
> res$group <- rep(c("A", "B"), each = 10) > head(pData(res)) group 1 A 2 A 3 A 4 A 5 A 6 A > tail(pData(res)) group 15 B 16 B 17 B 18 B 19 B 20 B  You can as many sample variables as you need; for example replicate status: > res$rep <- rep(1:2, 10)
group rep
1     A   1
2     A   2
3     A   1
4     A   2
5     A   1
6     A   2


Hope this helps.

0
Entering edit mode

Hello and thank you for trying to help me Laurent.

I'll put this into two comments for clarity.
One thing I wanted to know was how to actually inspect the sample data provided in the pRoloc data set. I'll use hyperLOPIT-SIData-ms3-rep12-intersect.csv since this is what you showed above.

I figured out how to inspect it by doing the following:

> library("pRolocdata")
> f0 <- dir(system.file("extdata", package = "pRolocdata"),
+                full.names = TRUE,
+                pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")

>f0 # Shows where the file from the package is located on hard drive
"C:/Users/moldach/Documents/R/win-library/3.4/pRolocdata/extdata/hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"

Then you can simply read that location in via read.csv() as a data.frame and inspect what it looks like with head(), tail(), str() etc.

>csv <- read.csv("C:/Users/moldach/Documents/R/win-library/3.4/pRolocdata/extdata/hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz")
UniProt.Accession.for.Protein.Group UniProt.ID.for.Protein.Group
1
2                              Q9JHU4                  DYHC1_MOUSE
UniProt.Protein.Description Quantified.Peptides            X
1                                                                                                Experiment 1 Experiment 2
2 Cytoplasmic dynein 1 heavy chain 1 OS=Mus musculus GN=Dync1h1 PE=1 SV=2 - [DYHC1_MOUSE]                 175          166
Quantified.PSMs..Rep.1.          X.1 Normalized.TMT.10.plex.reporter.ion.distribution..Replicate.1.   X.2   X.3   X.4   X.5   X.6
1            Experiment 1 Experiment 2                                                        126.000  127N  127C  128N  128C  129N
2                     322          273                                                          0.028 0.034 0.024 0.014 0.026 0.045
X.7   X.8   X.9    X.10 Normalized.TMT.10.plex.reporter.ion.distribution..Replicate.2.  X.11  X.12  X.13  X.14  X.15  X.16  X.17
1  129C  130N  130C 131.000                                                        126.000  127N  127C  128N  128C  129N  129C  130N
2 0.107 0.341 0.059   0.321                                                          0.037 0.064 0.058 0.059 0.067 0.078 0.140 0.208
X.18    X.19 Phenotype.discovery.and.SVM.classification.of.protein.localization              X.20                      X.21
1  130C 131.000                                                   phenoDisco Input phenoDisco Output Curated phenoDisco Output
2 0.141   0.147                                                            unknown           unknown                   unknown
X.22               X.23      X.24                              X.25                          X.26
1 SVM marker set SVM classification SVM score SVM classification (top quartile) Final Localization Assignment
2        unknown         Peroxisome     0.303                      unclassified                  unclassified
X.27 Subcellular.Curation                    X.28                 X.29              X.30
1 First localization evidence?   Curated Organelles Cytoskeletal Components Trafficking Proteins Protein Complexes
2                            N                                       Dynein                                 Dynein
X.31             X.32              X.33             X.34                  X.35
1 Signaling Cascades Oct4 Interactome Nanog Interactome Sox2 Interactome Cell Surface Proteins
2 

What may not be clear to future people visiting this page is that when you did

res <- readMSnSet2(f0, ecol = 8:27)

8:27 are referring to the columns of NSAF (normalized) spectral counts.

Okay now that's clear to me let's move to the second part of the issue as this example you provided does not work with the msmsTests package (see comment below).

0
Entering edit mode

Just one clarification to avoid confusion to other readers - the protein expression data (in columns 8 to 27) aren't NSAF but TMT 10-plex data.

0
Entering edit mode
moldach ▴ 20
@moldach-8829
Last seen 15 months ago

So the example from the msmsTests vignette for how to do differential expression analysis with edgeR goes like this:

## Example
library(msmsTests)
data(msms.dataset)
e <- pp.msms.data(msms.dataset)
e
null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e),2,sum)
res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat")
str(res)
head(res)

If we glance at what it looks like with pData() it looks like this

          treat batch
U2.2502.1  U200  2502
U2.2502.2  U200  2502
U2.2502.3  U200  2502
U2.2502.4  U200  2502
U6.2502.1  U600  2502
U6.2502.2  U600  2502

Compare this to what your example above looks like:

> head(pData(res))   group rep 1     A   1 2     A   2 3     A   1 4     A   2 5     A   1 6     A   2

By making a few adjustment to the msmsTests() example to match yours, it throws an error:

>null.f <- "y~rep"
>alt.f <- "y~group+rep"
>div <- apply(exprs(res),2,sum)
>res <- msms.edgeR(res,alt.f,null.f,div=div,fnm="group")
Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument
>res <- msms.edgeR(res,alt.f,null.f,div=div,fnm="group")
Error in .isAllZero(counts) :
count matrix must be integer or double-precision



I guess the problem I don't know how to investigate what e looks like nor why their example works but not yours?

Similarly if I load my data like your example and run msms.edgeR() it doesn't work either (but I get a different error):

>head(csv)
Gene.names     Treat_1     Treat_2      Ctrl_1      Ctrl_2
1       PLEC 0.000000000 0.000000000 0.000061100 0.000065400
2      PRKDC 0.000587599 0.000532021 0.000238412 0.000207283
3      AHNAK 0.000045800 0.000077700 0.000000000 0.000000000
4    DYNC1H1 0.000000000 0.000000000 0.000177959 0.000166366
5      MKI67 0.000055200 0.000112417 0.000152282 0.000141068
6     SPTAN1 0.000218052 0.000222106 0.000211693 0.000200792
>res <- readMSnSet2(csv, ecol = 2:5)
>sampleNames(res)<- 1:4
>res$group <- rep(c("Treatment","Control"),each=2) >res$rep <- rep(1:2,2)
group rep
1 Treatment   1
2 Treatment   2
3   Control   1
4   Control   2
> res <- msms.edgeR(res,alt.f,null.f,div=div,fnm="group")
Error in dispCoxReid(y, design = design, offset = offset, subset = subset,  :
no data rows with required number of counts

Again, I really appreciate your help!

0
Entering edit mode

The reason for this is that the data that you try to analyse with msmsTests isn't count data (see my comment above) - at least I think that's the reason thing go wrong. Here's how you can pretend it is; but first, let's reproduce the error:

> library("pRolocdata")
> f0 <- dir(system.file("extdata", package = "pRolocdata"),
+           full.names = TRUE,
+           pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")
> res <- readMSnSet2(f0, ecol = 8:27, header = TRUE, skip = 1)
> sampleNames(res) <- 1:20
> res$group <- rep(c("A", "B"), each = 10) > res$rep <- rep(1:2, 10)
group rep
1     A   1
2     A   2
3     A   1
4     A   2
5     A   1
6     A   2
1     2     3     4     5     6     7     8     9    10    11    12    13
1 0.028 0.034 0.024 0.014 0.026 0.045 0.107 0.341 0.059 0.321 0.037 0.064 0.058
2 0.039 0.134 0.095 0.053 0.084 0.121 0.107 0.128 0.122 0.117 0.033 0.073 0.074
3 0.021 0.013 0.014 0.009 0.024 0.054 0.116 0.257 0.209 0.284 0.026 0.017 0.023
4 0.120 0.255 0.148 0.091 0.135 0.095 0.041 0.057 0.014 0.043 0.111 0.181 0.141
5 0.055 0.139 0.078 0.050 0.077 0.098 0.093 0.171 0.079 0.160 0.062 0.108 0.091
14    15    16    17    18    19    20
1 0.059 0.067 0.078 0.140 0.208 0.141 0.147
2 0.062 0.081 0.142 0.190 0.069 0.151 0.125
3 0.029 0.039 0.071 0.105 0.171 0.304 0.215
4 0.144 0.152 0.119 0.075 0.028 0.017 0.033
5 0.086 0.099 0.111 0.117 0.095 0.144 0.087
[ reached getOption("max.print") -- omitted 1 row ]

> null.f <- "y~rep"
> alt.f <- "y~group+rep"
> div <- apply(exprs(res), 2, sum)
>  msms.edgeR(res, alt.f, null.f, div = div, fnm = "group")
Error in dispCoxReid(y, design = design, offset = offset, subset = subset,  :
no data rows with required number of counts

If you look closely, it's not the same error as you have. Somehow, your exprs were characters (at least I believe so) - just make sure that before proceeding, they are numerics

> mode(exprs(res))
[1] "numeric"

If at this stage you get "character", stop and and report back.

Let's now convert the data to counts:

> exprs(res) <- round(exprs(res) * 1000)
X126  X127N  X127C  X128N  X128C  X129N  X129C  X130N  X130C   X131 X126.1
1  28000  34000  24000  14000  26000  45000 107000 341000  59000 321000  37000
2  39000 134000  95000  53000  84000 121000 107000 128000 122000 117000  33000
3  21000  13000  14000   9000  24000  54000 116000 257000 209000 284000  26000
4 120000 255000 148000  91000 135000  95000  41000  57000  14000  43000 111000
5  55000 139000  78000  50000  77000  98000  93000 171000  79000 160000  62000
X127N.1 X127C.1 X128N.1 X128C.1 X129N.1 X129C.1 X130N.1 X130C.1 X131.1
1   64000   58000   59000   67000   78000  140000  208000  141000 147000
2   73000   74000   62000   81000  142000  190000   69000  151000 125000
3   17000   23000   29000   39000   71000  105000  171000  304000 215000
4  181000  141000  144000  152000  119000   75000   28000   17000  33000
5  108000   91000   86000   99000  111000  117000   95000  144000  87000
[ reached getOption("max.print") -- omitted 1 row ]

And repeat the test:

> x <- msms.edgeR(res, alt.f, null.f, div = div, fnm = "group")
LogFC           LR   p.value
1  0.30452241 0.2343101026 0.6283462
2 -0.08757739 0.0247167596 0.8750749
3  0.07259539 0.0100301864 0.9202246
4 -0.01883603 0.0008671924 0.9765072
5  0.04444376 0.0069856511 0.9333902
6 -0.22945977 0.0473566608 0.8277284

Let me know how it goes.

0
Entering edit mode

Thank you very much,

exprs(res) <- round(exprs(res) * 1000)

appears to work although I'm not exactly sure why?

To be clear my data is normalized spectral counts (i.e. NSAF) prior to multiplication by 1000 and rounding. Is that the correct data format to be inputting?

0
Entering edit mode

The rounding is meant to get something that looks like count data, as the original values are scaled between 0 and 1. I would advise to check the distribution of you data and make sure it fits what is expected when running edgeR. If you could get the raw counts instead of NSAF, then you would get what is expected in msmsTests.

0
Entering edit mode
sutturka • 0
@sutturka-14580
Last seen 7 weeks ago
United States

I have a simple counts data corresponding to raw spectral counts. I followed the msmsTests Vignette and above suggestions and imported data as follows:

Data format: CSV

Accession,Control1,Control2,Treatment1,Treatment2
Protein1,103,16,46,48
Protein2,81,0,1,0
Protein3,81,0,1,0
Protein4,36,2,193,19
Protein5,36,2,193,19
Protein6,34,0,87,0
Protein7,34,0,87,0



R code:

f0 = read.csv("data.csv", sep = ",", header = TRUE, quote = "")

### fnames = 1 ensures the IDs are read as row-names and retained in subsequent steps.

dataObj$group <- rep(c("Control", "Treatment"), each = 2) ### Remove all zero rows dataObj <- pp.msms.data(dataObj) dim(dataObj) null.f <- "y~1" alt.f <- "y~group" div <- apply(exprs(dataObj),2,sum) ResultsObj <- msms.edgeR(dataObj, alt.f, null.f, div=div, fnm="group") str(ResultsObj) DE_results = as.data.frame(ResultsObj) ### DEPs on multitest adjusted p-values DE_results$adjp <- p.adjust(DE_results$p.value,method="BH") sum(DE_results$adjp<=0.05)


Generating the volcano plot:

### You will notice using fname =1 was handy as the IDs are retained in the results table
ql.tbl <- test.results(ResultsObj,dataObj,pData(dataObj)$group,"Control","Treatment",div, alpha=0.05,minSpC=2,minLFC=1,method="BH")$tres

res.volcanoplot(ql.tbl,
max.pval=0.05, min.LFC=1, maxx=3, maxy=NULL,
ylbls=3)