Question: Extracting Data from MSn experiment data ("OnDiskMSnExp")
1
gravatar for jamesrgraham
3 months ago by
jamesrgraham20
jamesrgraham20 wrote:

Hello All,

I am doing some targeted metabolomics and was wondering if there were a quicker way to extract data from an MSn experiment data ("OnDiskMSnExp") data object.

I read in a number of files:

raw_data <- readMSData(files = FILES, pdata = new("NAnnotatedDataFrame", pd),
               mode = "onDisk", centroided = FALSE, msLevel = 1)

And perform peakPicking:

comp_sg_cent_mz <- raw_data %>%
  smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>%
  pickPeaks(refineMz = "descendPeak") %>%
  filterRt(initial_rtr) %>%
  filterMz(initial_mzr)

I then write out the data:

write.table(comp_sg_cent_mz, file = main_peak_file_name, row.names = FALSE, append = TRUE, col.names = TRUE, sep = "\t")

And get something like this:

"file"  "rt"    "mz"    "i"
1       404.2169952     391.283958025663        14271.6536796537
1       404.7310068     391.283864868878        14570.7012987013
1       405.245991      391.2839380729  13788.5194805195
1       405.760002      391.28338580945 10999.5714285714

Which I then process further.

The issue is that the write.table function takes at least one minute to write out (which is problematic with many files and compounds). Is there a faster way to access this data?

Thanks for any and all advice! james

msnbase • 99 views
ADD COMMENTlink modified 3 months ago by Laurent Gatto1.2k • written 3 months ago by jamesrgraham20
Answer: Extracting Data from MSn experiment data ("OnDiskMSnExp")
1
gravatar for Johannes Rainer
3 months ago by
Johannes Rainer1.5k
Italy
Johannes Rainer1.5k wrote:

Hi James,

the reason the write.table function takes so long is that in that call all processings are applied to the data. In the on disk mode all data manipulation operations are cached and only applied whenever you access the data (which in your case is when you call write.table, which in turn (I guess) calls the as.data.frame function). This means that when you call e.g. smooth on your data the smooth function is only added to a lazy processing queue and not applied to the data (because the data is not kept in memory it can also not be changed/modified). Now, each time you access intensity or m/z values, the data is imported from the original (mzML) files and the smooth function is applied before the values are returned.

To improve the speed of your function calls you have however two possibilities:

1) call the filterRt and filterMz before you call smooth and pickPeaks. That way the data processing will only applied to the subset you are actually interested in. With your code you are performing the smoothing and peak picking on the full data set for each compound.

2) Call smooth and pickPeaks once on the full data set and export the data as mzML files (with writeMSData). Then re-read this data and call the filterRt and filterMz on this already processed data.

hope this helps.

cheers, jo

ADD COMMENTlink written 3 months ago by Johannes Rainer1.5k
1

Thank you, jo! I will try this out.

ADD REPLYlink written 3 months ago by jamesrgraham20
Answer: Extracting Data from MSn experiment data ("OnDiskMSnExp")
1
gravatar for Laurent Gatto
3 months ago by
Laurent Gatto1.2k
Belgium
Laurent Gatto1.2k wrote:

You can use the respective accessors to extract these information from the object. Below, I use the serine object created in the MSnbase centroiding vignette:

> str(head(intensity(serine)))
List of 6
 $ F1.S628: num [1:10] 0 48 0 48 48 48 0 144 48 0
 $ F1.S629: num [1:12] 0 43 43 87 0 0 173 43 0 0 ...
 $ F1.S630: num [1:9] 0 84 84 42 84 0 42 42 0
 $ F1.S631: num [1:9] 0 90 134 90 45 90 45 45 0
 $ F1.S632: num [1:7] 0 42 42 42 42 42 0
 $ F1.S633: num [1:8] 0 37 0 111 74 37 148 0
> str(rtime(serine))
 Named num [1:43] 175 175 176 176 176 ...
 - attr(*, "names")= chr [1:43] "F1.S628" "F1.S629" "F1.S630" "F1.S631" ...
> str(head(mz(serine)))
List of 6
 $ F1.S628: num [1:10] 106 106 106 106 106 ...
 $ F1.S629: num [1:12] 106 106 106 106 106 ...
 $ F1.S630: num [1:9] 106 106 106 106 106 ...
 $ F1.S631: num [1:9] 106 106 106 106 106 ...
 $ F1.S632: num [1:7] 106 106 106 106 106 ...
 $ F1.S633: num [1:8] 106 106 106 106 106 ...
> head(fromFile(serine))
F1.S628 F1.S629 F1.S630 F1.S631 F1.S632 F1.S633 
      1       1       1       1       1       1 

And if you need a data.frame, you can coerce your object with

> head(as(serine, "data.frame"))
  file      rt       mz  i
1    1 175.212 106.0407  0
2    1 175.212 106.0422 48
3    1 175.212 106.0437  0
4    1 175.212 106.0451 48
5    1 175.212 106.0466 48
6    1 175.212 106.0480 48

In addition, you can read in multiple files at once with readMSData, and the processing of these will be done in parallel on a file by file basis using BiocParallel.

ADD COMMENTlink written 3 months ago by Laurent Gatto1.2k

Thank you, Laurent! I will do a bunch of testing.

ADD REPLYlink written 3 months ago by jamesrgraham20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 196 users visited in the last hour