Search
Question: Proper way to read in a DataFrame with CharacterList columns that was saved to a text file
1
gravatar for Leonardo Collado Torres
2.2 years ago by
United States
Leonardo Collado Torres610 wrote:

Hi,

What is the proper way to read in a DataFrame from a text file that has CharacterList columns? With the code below, I can see that write.table() writes the text file in such a way that the CharacterList column has c() calls. I'm guessing that there's a simple argument change or a function that then allows you to read this information, but I'm not finding it.

Thank you,

Leonardo

 

> library('S4Vectors')
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply,
    parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

> library('GenomicRanges')
Loading required package: IRanges
Loading required package: GenomeInfoDb
There were 12 warnings (use warnings() to see them)
> df <- DataFrame(x = 1:5, y = CharacterList(lapply(1:5, function(i) {
+     letters[seq_len(i)]}
+ )))
> 
> write.table(df, file = 'test.tsv', sep = '\t', row.names = FALSE, quote = FALSE)
> system('head test.tsv')
x    y
1    a
2    c("a", "b")
3    c("a", "b", "c")
4    c("a", "b", "c", "d")
5    c("a", "b", "c", "d", "e")
> 
> df2 <- read.table('test.tsv', header = TRUE, sep = '\t', stringsAsFactors = FALSE)
> df2
  x                y
1 1                a
2 2          c(a, b)
3 3       c(a, b, c)
4 4    c(a, b, c, d)
5 5 c(a, b, c, d, e)
> 
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                 
 version  R version 3.3.0 RC (2016-05-01 r70572)
 system   x86_64, darwin13.4.0                  
 ui       AQUA                                  
 language (EN)                                  
 collate  en_US.UTF-8                           
 tz       America/New_York                      
 date     2016-06-16                            

Packages ---------------------------------------------------------------------------------------------------------------
 package       * version date       source        
 BiocGenerics  * 0.19.1  2016-06-11 Bioconductor  
 devtools        1.11.1  2016-04-21 CRAN (R 3.3.0)
 digest          0.6.9   2016-01-08 CRAN (R 3.3.0)
 GenomeInfoDb  * 1.9.1   2016-05-13 Bioconductor  
 GenomicRanges * 1.25.4  2016-06-10 Bioconductor  
 IRanges       * 2.7.6   2016-06-10 Bioconductor  
 memoise         1.0.0   2016-01-29 CRAN (R 3.3.0)
 S4Vectors     * 0.11.4  2016-06-11 Bioconductor  
 withr           1.0.1   2016-02-04 CRAN (R 3.3.0)
 XVector         0.13.0  2016-05-05 Bioconductor  
 zlibbioc        1.19.0  2016-05-05 Bioconductor  


## Doesn't work to simply use DataFrame

> DataFrame(df2)
DataFrame with 5 rows and 2 columns
          x                y
  <integer>      <character>
1         1                a
2         2          c(a, b)
3         3       c(a, b, c)
4         4    c(a, b, c, d)
5         5 c(a, b, c, d, e)
ADD COMMENTlink modified 2.2 years ago by Michael Lawrence10k • written 2.2 years ago by Leonardo Collado Torres610
1
gravatar for Michael Lawrence
2.2 years ago by
United States
Michael Lawrence10k wrote:

Calling write.table() implies as.data.frame(), which coerces the CharacterList to a list. write.table() does not actually handle list columns (what should it do?) but as it turns out, the coercion from DataFrame to data.frame classes the list columns as "AsIs" which coincidentally ends up coercing the list to a character vector at write time.  There's no obvious way to coerce a list to a character vector, and the current implementation just uses dput()

I would generally avoid writing list columns (is expand() an option?), but if you have to, list columns are typically encoded as comma-separated cells in tabular text. You could of course use strsplit() and unstrsplit() to move back and forth. It might be a good idea for read.table() to support compound cells. I think data.table::fread() already does. But it's definitely pushing the limits of tabular text.

 

 

ADD COMMENTlink written 2.2 years ago by Michael Lawrence10k

Thanks for the info Michael. If I need to read these files, I'll use `strsplit()`.

Best,
Leonardo

ADD REPLYlink written 2.2 years ago by Leonardo Collado Torres610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 126 users visited in the last hour