flowCore: using read.FCS with which.lines is not time efficient?
1
1
Entering edit mode
skiaphrene ▴ 10
@skiaphrene-6914
Last seen 9.3 years ago
New Zealand

Dear flowCore team,

 

I have recently started using flowCore (and other packages) to analyse flow cytometry data.

I have a collection of 8 FCS files of 80-180 Mb each and I can easily load these into R using read.FCS.

However, to initially practice using the packages, I wanted to limit the number of events read. Reading the read.FCS help, I wanted to use the which.lines parameter to limit what was being read. I was expecting this to make reading the files in faster, however the opposite was true.

 

Using

ff <- read.FCS( my.fcs.file, transformation=FALSE)

takes 4 to 8 seconds per file.

 

However, both

ff <- read.FCS( my.fcs.file, transformation=FALSE, which.lines=1:100000)
ff <- read.FCS( my.fcs.file, transformation=FALSE, which.lines=100000)

were much slower (neither had finished after 2 minutes for the first file).

 

So effectively, reading in the full files and then sub-selecting rows with either

ff <- ff[1:100000,]
ff <- ff[sample.int(nrow(ff),10000),]

is much faster (though, obviously, has higher memory requirements)!

 

Is this normal? Am I missing something? Should I just stick to my workaround?

 

Thanks in advance for your help!

 

Best regards,

 

-- Alex


My R session info is:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] flowType_2.4.0  BH_1.54.0-4     Rcpp_0.11.3     flowCore_1.32.1

loaded via a namespace (and not attached):
 [1] Biobase_2.26.0      BiocGenerics_0.12.0 clue_0.3-48         cluster_1.15.3      coda_0.16-1         corpcor_1.6.7      
 [7] DEoptimR_1.0-2      feature_1.2.10      flowClust_3.4.0     flowMeans_1.18.0    flowMerge_2.14.0    flowViz_1.30.0     
[13] graph_1.44.0        grid_3.1.1          hexbin_1.27.0       IDPmisc_1.1.17      KernSmooth_2.23-13  ks_1.9.2           
[19] lattice_0.20-29     latticeExtra_0.6-26 MASS_7.3-35         MCMCpack_1.3-3      misc3d_0.8-4        mvtnorm_1.0-0      
[25] parallel_3.1.1      pcaPP_1.9-60        RColorBrewer_1.0-5  rgl_0.93.1098       Rgraphviz_2.10.0    robustbase_0.91-1  
[31] rrcov_1.3-4         sfsmisc_1.0-26      stats4_3.1.1        tools_3.1.1
flowcore read.FCS flowcytometry • 1.5k views
ADD COMMENT
0
Entering edit mode
Jiang, Mike ★ 1.3k
@jiang-mike-4886
Last seen 2.5 years ago
(Private Address)

FCS file's data section is stored as a stream of raw bytes, thus reading entire data chunk is more efficient.

'which.lines' is provided mainly for the circumstances when there is not enough memory to read one single FCS (which almost never happens nowadays).  As you have experienced, it takes more time because multiple disk IO is involved. And there is also extra overhead in R for calculating the location of each data slab and concatenating them together afterwards.

Therefore, it is not recommended to use `which.lines` unless you have to.  ( I may add this note to `help')

 

ADD COMMENT

Login before adding your answer.

Traffic: 656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6