Search
Question: Getting data into (and out of) the MSnSet object
0
gravatar for moldach
4 months ago by
moldach10
United States
moldach10 wrote:

I would like to use the msmsTests package for differential expression analysis of proteins (spectral counts); however, the vignette only covers working with a MSnSet class and not how to get your data into that class.

While the pRoloc vignette does mention how to convert csv files to the R data class MSnSet using the readMSnSet constructor function (section 2.2.2) the problem is when trying to follow either the msmsTests or RforProteomics vignettes they use a different type of MSnSet object which produces different results, for example:

pRoloc creation of MSnSet:

f1 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE, pattern = "exprsFile.csv")
f2 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE, pattern = "fdataFile.csv")
f3 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE, pattern = "pdataFile.csv")
          tan2009r1 <- readMSnSet(exprsFile = f1,
                        featureDataFile = f2,
                        phenoDataFile = f3,
                        sep = ",")
pData(tan2009r1)

RforProteomics creation of MSnSet:

library("msmsEDA")
library("msmsTests")
data(msms.dataset)
## Pre-process expression matrix
e <- pp.msms.data(msms.dataset)
pData(e)

As you can tell the pData() result is different for both of these. It would be nice if I could figure out how to convert the msms.dataset (MSnSet class) to something I can investigate to see how these objects differ - they won't let you investigate using head() or tail(). If I could see what it looks like I could figure out how to convert my data into this class for use in differential analysis and visualization.

 

 

ADD COMMENTlink modified 3 months ago • written 4 months ago by moldach10
0
gravatar for Laurent Gatto
4 months ago by
Laurent Gatto950
United Kingdom
Laurent Gatto950 wrote:

There are multiple ways to create MSnSet objects, but I'll assume you have a spreadsheet.

The easiest way is the to use readMSnSet2, that only takes one file as input and populates the quantitative data (i.e. what is returned by exprs) and feature meta-data (i.e fData). The sample meta-data (i.e. pData) need to be added manually, as this information is not available in that single spreadsheet. From ?readMSnSet2:

> ?readMSnSet2
> library("pRolocdata")
> f0 <- dir(system.file("extdata", package = "pRolocdata"),
+                full.names = TRUE,
+                pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")
> basename(f0)
[1] "hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"
> res <- readMSnSet2(f0, ecol = 8:27)
> res
MSnSet (storageMode: lockedEnvironment)
assayData: 5033 features, 20 samples 
  element names: exprs 
protocolData: none
phenoData: none
featureData
  featureNames: 1 2 ... 5033 (5033 total)
  fvarLabels: UniProt.Accession.for.Protein.Group
    UniProt.ID.for.Protein.Group ... X.35 (25 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:  
- - - Processing information - - -
 MSnbase version: 2.5.5 
> pData(res)
data frame with 0 columns and 20 rows

That dataset has 20 columns/samples. Assuming you have 2 groups (A and B) with 10 samples each, you could populate the pData slots with:

> ## set simple sample names
> sampleNames(res) <- 1:20
> ## update pData
> res$group <- rep(c("A", "B"), each = 10)
> head(pData(res))
  group
1     A
2     A
3     A
4     A
5     A
6     A
> tail(pData(res))
   group
15     B
16     B
17     B
18     B
19     B
20     B

You can as many sample variables as you need; for example replicate status:

> res$rep <- rep(1:2, 10)
> head(pData(res))
  group rep
1     A   1
2     A   2
3     A   1
4     A   2
5     A   1
6     A   2

Hope this helps.

ADD COMMENTlink written 4 months ago by Laurent Gatto950

Hello and thank you for trying to help me Laurent.

I'll put this into two comments for clarity.
One thing I wanted to know was how to actually inspect the sample data provided in the pRoloc data set. I'll use hyperLOPIT-SIData-ms3-rep12-intersect.csv since this is what you showed above.

I figured out how to inspect it by doing the following:

> library("pRolocdata")
> f0 <- dir(system.file("extdata", package = "pRolocdata"),
+                full.names = TRUE,
+                pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")

>f0 # Shows where the file from the package is located on hard drive
"C:/Users/moldach/Documents/R/win-library/3.4/pRolocdata/extdata/hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"

Then you can simply read that location in via read.csv() as a data.frame and inspect what it looks like with head(), tail(), str() etc.

>csv <- read.csv("C:/Users/moldach/Documents/R/win-library/3.4/pRolocdata/extdata/hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz")
>head(csv,n=2)
  UniProt.Accession.for.Protein.Group UniProt.ID.for.Protein.Group
1                                                                 
2                              Q9JHU4                  DYHC1_MOUSE
                                                              UniProt.Protein.Description Quantified.Peptides            X
1                                                                                                Experiment 1 Experiment 2
2 Cytoplasmic dynein 1 heavy chain 1 OS=Mus musculus GN=Dync1h1 PE=1 SV=2 - [DYHC1_MOUSE]                 175          166
  Quantified.PSMs..Rep.1.          X.1 Normalized.TMT.10.plex.reporter.ion.distribution..Replicate.1.   X.2   X.3   X.4   X.5   X.6
1            Experiment 1 Experiment 2                                                        126.000  127N  127C  128N  128C  129N
2                     322          273                                                          0.028 0.034 0.024 0.014 0.026 0.045
    X.7   X.8   X.9    X.10 Normalized.TMT.10.plex.reporter.ion.distribution..Replicate.2.  X.11  X.12  X.13  X.14  X.15  X.16  X.17
1  129C  130N  130C 131.000                                                        126.000  127N  127C  128N  128C  129N  129C  130N
2 0.107 0.341 0.059   0.321                                                          0.037 0.064 0.058 0.059 0.067 0.078 0.140 0.208
   X.18    X.19 Phenotype.discovery.and.SVM.classification.of.protein.localization              X.20                      X.21
1  130C 131.000                                                   phenoDisco Input phenoDisco Output Curated phenoDisco Output
2 0.141   0.147                                                            unknown           unknown                   unknown
            X.22               X.23      X.24                              X.25                          X.26
1 SVM marker set SVM classification SVM score SVM classification (top quartile) Final Localization Assignment
2        unknown         Peroxisome     0.303                      unclassified                  unclassified
                          X.27 Subcellular.Curation                    X.28                 X.29              X.30
1 First localization evidence?   Curated Organelles Cytoskeletal Components Trafficking Proteins Protein Complexes
2                            N                                       Dynein                                 Dynein
                X.31             X.32              X.33             X.34                  X.35
1 Signaling Cascades Oct4 Interactome Nanog Interactome Sox2 Interactome Cell Surface Proteins
2 

What may not be clear to future people visiting this page is that when you did 

res <- readMSnSet2(f0, ecol = 8:27)

8:27 are referring to the columns of NSAF (normalized) spectral counts.

Okay now that's clear to me let's move to the second part of the issue as this example you provided does not work with the msmsTests package (see comment below).

ADD REPLYlink written 4 months ago by moldach10

Just one clarification to avoid confusion to other readers - the protein expression data (in columns 8 to 27) aren't NSAF but TMT 10-plex data.

ADD REPLYlink written 4 months ago by Laurent Gatto950
0
gravatar for moldach
4 months ago by
moldach10
United States
moldach10 wrote:

So the example from the msmsTests vignette for how to do differential expression analysis with edgeR goes like this:

## Example
library(msmsTests)
data(msms.dataset)
e <- pp.msms.data(msms.dataset)
e
null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e),2,sum)
res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat")
str(res)
head(res)

If we glance at what it looks like with pData() it looks like this

>head(pData(e))

          treat batch
U2.2502.1  U200  2502
U2.2502.2  U200  2502
U2.2502.3  U200  2502
U2.2502.4  U200  2502
U6.2502.1  U600  2502
U6.2502.2  U600  2502

Compare this to what your example above looks like:

> head(pData(res))
  group rep
1     A   1
2     A   2
3     A   1
4     A   2
5     A   1
6     A   2

By making a few adjustment to the msmsTests() example to match yours, it throws an error:

>null.f <- "y~rep"
>alt.f <- "y~group+rep"
>div <- apply(exprs(res),2,sum)
>res <- msms.edgeR(res,alt.f,null.f,div=div,fnm="group")
Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument
>res <- msms.edgeR(res,alt.f,null.f,div=div,fnm="group")
Error in .isAllZero(counts) :
  count matrix must be integer or double-precision

I guess the problem I don't know how to investigate what e looks like nor why their example works but not yours?

Similarly if I load my data like your example and run msms.edgeR() it doesn't work either (but I get a different error):

>head(csv)
  Gene.names     Treat_1     Treat_2      Ctrl_1      Ctrl_2
1       PLEC 0.000000000 0.000000000 0.000061100 0.000065400
2      PRKDC 0.000587599 0.000532021 0.000238412 0.000207283
3      AHNAK 0.000045800 0.000077700 0.000000000 0.000000000
4    DYNC1H1 0.000000000 0.000000000 0.000177959 0.000166366
5      MKI67 0.000055200 0.000112417 0.000152282 0.000141068
6     SPTAN1 0.000218052 0.000222106 0.000211693 0.000200792
>res <- readMSnSet2(csv, ecol = 2:5)
>sampleNames(res)<- 1:4
>res$group <- rep(c("Treatment","Control"),each=2)
>res$rep <- rep(1:2,2)
> head(pData(res))
      group rep
1 Treatment   1
2 Treatment   2
3   Control   1
4   Control   2
> res <- msms.edgeR(res,alt.f,null.f,div=div,fnm="group")
Error in dispCoxReid(y, design = design, offset = offset, subset = subset,  : 
  no data rows with required number of counts

Again, I really appreciate your help!

ADD COMMENTlink modified 3 months ago • written 4 months ago by moldach10

The reason for this is that the data that you try to analyse with msmsTests isn't count data (see my comment above) - at least I think that's the reason thing go wrong. Here's how you can pretend it is; but first, let's reproduce the error:

> library("pRolocdata")
> f0 <- dir(system.file("extdata", package = "pRolocdata"),
+           full.names = TRUE,
+           pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")
> res <- readMSnSet2(f0, ecol = 8:27, header = TRUE, skip = 1)
> sampleNames(res) <- 1:20
> res$group <- rep(c("A", "B"), each = 10)
> res$rep <- rep(1:2, 10)
> head(pData(res))
  group rep
1     A   1
2     A   2
3     A   1
4     A   2
5     A   1
6     A   2
> head(exprs(res))
      1     2     3     4     5     6     7     8     9    10    11    12    13
1 0.028 0.034 0.024 0.014 0.026 0.045 0.107 0.341 0.059 0.321 0.037 0.064 0.058
2 0.039 0.134 0.095 0.053 0.084 0.121 0.107 0.128 0.122 0.117 0.033 0.073 0.074
3 0.021 0.013 0.014 0.009 0.024 0.054 0.116 0.257 0.209 0.284 0.026 0.017 0.023
4 0.120 0.255 0.148 0.091 0.135 0.095 0.041 0.057 0.014 0.043 0.111 0.181 0.141
5 0.055 0.139 0.078 0.050 0.077 0.098 0.093 0.171 0.079 0.160 0.062 0.108 0.091
     14    15    16    17    18    19    20
1 0.059 0.067 0.078 0.140 0.208 0.141 0.147
2 0.062 0.081 0.142 0.190 0.069 0.151 0.125
3 0.029 0.039 0.071 0.105 0.171 0.304 0.215
4 0.144 0.152 0.119 0.075 0.028 0.017 0.033
5 0.086 0.099 0.111 0.117 0.095 0.144 0.087
[ reached getOption("max.print") -- omitted 1 row ]

> null.f <- "y~rep"
> alt.f <- "y~group+rep"
> div <- apply(exprs(res), 2, sum)
>  msms.edgeR(res, alt.f, null.f, div = div, fnm = "group")
Error in dispCoxReid(y, design = design, offset = offset, subset = subset,  :
  no data rows with required number of counts

If you look closely, it's not the same error as you have. Somehow, your exprs were characters (at least I believe so) - just make sure that before proceeding, they are numerics

> mode(exprs(res))
[1] "numeric"

If at this stage you get "character", stop and and report back.

Let's now convert the data to counts:

> exprs(res) <- round(exprs(res) * 1000)
> head(exprs(res))
    X126  X127N  X127C  X128N  X128C  X129N  X129C  X130N  X130C   X131 X126.1
1  28000  34000  24000  14000  26000  45000 107000 341000  59000 321000  37000
2  39000 134000  95000  53000  84000 121000 107000 128000 122000 117000  33000
3  21000  13000  14000   9000  24000  54000 116000 257000 209000 284000  26000
4 120000 255000 148000  91000 135000  95000  41000  57000  14000  43000 111000
5  55000 139000  78000  50000  77000  98000  93000 171000  79000 160000  62000
  X127N.1 X127C.1 X128N.1 X128C.1 X129N.1 X129C.1 X130N.1 X130C.1 X131.1
1   64000   58000   59000   67000   78000  140000  208000  141000 147000
2   73000   74000   62000   81000  142000  190000   69000  151000 125000
3   17000   23000   29000   39000   71000  105000  171000  304000 215000
4  181000  141000  144000  152000  119000   75000   28000   17000  33000
5  108000   91000   86000   99000  111000  117000   95000  144000  87000
[ reached getOption("max.print") -- omitted 1 row ]

And repeat the test:

> x <- msms.edgeR(res, alt.f, null.f, div = div, fnm = "group")
> head(x)
        LogFC           LR   p.value
1  0.30452241 0.2343101026 0.6283462
2 -0.08757739 0.0247167596 0.8750749
3  0.07259539 0.0100301864 0.9202246
4 -0.01883603 0.0008671924 0.9765072
5  0.04444376 0.0069856511 0.9333902
6 -0.22945977 0.0473566608 0.8277284

Let me know how it goes.

ADD REPLYlink modified 4 months ago • written 4 months ago by Laurent Gatto950

Thank you very much,

exprs(res) <- round(exprs(res) * 1000)

appears to work although I'm not exactly sure why?

To be clear my data is normalized spectral counts (i.e. NSAF) prior to multiplication by 1000 and rounding. Is that the correct data format to be inputting?

ADD REPLYlink written 3 months ago by moldach10

The rounding is meant to get something that looks like count data, as the original values are scaled between 0 and 1. I would advise to check the distribution of you data and make sure it fits what is expected when running edgeR. If you could get the raw counts instead of NSAF, then you would get what is expected in msmsTests.

ADD REPLYlink written 3 months ago by Laurent Gatto950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 305 users visited in the last hour