Proper way to read in a DataFrame with CharacterList columns that was saved to a text file
1
2
Entering edit mode
@lcolladotor
Last seen 4 weeks ago
United States

Hi,

What is the proper way to read in a DataFrame from a text file that has CharacterList columns? With the code below, I can see that write.table() writes the text file in such a way that the CharacterList column has c() calls. I'm guessing that there's a simple argument change or a function that then allows you to read this information, but I'm not finding it.

Thank you,

Leonardo

 

> library('S4Vectors')
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply,
    parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

> library('GenomicRanges')
Loading required package: IRanges
Loading required package: GenomeInfoDb
There were 12 warnings (use warnings() to see them)
> df <- DataFrame(x = 1:5, y = CharacterList(lapply(1:5, function(i) {
+     letters[seq_len(i)]}
+ )))
> 
> write.table(df, file = 'test.tsv', sep = '\t', row.names = FALSE, quote = FALSE)
> system('head test.tsv')
x    y
1    a
2    c("a", "b")
3    c("a", "b", "c")
4    c("a", "b", "c", "d")
5    c("a", "b", "c", "d", "e")
> 
> df2 <- read.table('test.tsv', header = TRUE, sep = '\t', stringsAsFactors = FALSE)
> df2
  x                y
1 1                a
2 2          c(a, b)
3 3       c(a, b, c)
4 4    c(a, b, c, d)
5 5 c(a, b, c, d, e)
> 
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                 
 version  R version 3.3.0 RC (2016-05-01 r70572)
 system   x86_64, darwin13.4.0                  
 ui       AQUA                                  
 language (EN)                                  
 collate  en_US.UTF-8                           
 tz       America/New_York                      
 date     2016-06-16                            

Packages ---------------------------------------------------------------------------------------------------------------
 package       * version date       source        
 BiocGenerics  * 0.19.1  2016-06-11 Bioconductor  
 devtools        1.11.1  2016-04-21 CRAN (R 3.3.0)
 digest          0.6.9   2016-01-08 CRAN (R 3.3.0)
 GenomeInfoDb  * 1.9.1   2016-05-13 Bioconductor  
 GenomicRanges * 1.25.4  2016-06-10 Bioconductor  
 IRanges       * 2.7.6   2016-06-10 Bioconductor  
 memoise         1.0.0   2016-01-29 CRAN (R 3.3.0)
 S4Vectors     * 0.11.4  2016-06-11 Bioconductor  
 withr           1.0.1   2016-02-04 CRAN (R 3.3.0)
 XVector         0.13.0  2016-05-05 Bioconductor  
 zlibbioc        1.19.0  2016-05-05 Bioconductor  


## Doesn't work to simply use DataFrame

> DataFrame(df2)
DataFrame with 5 rows and 2 columns
          x                y
  <integer>      <character>
1         1                a
2         2          c(a, b)
3         3       c(a, b, c)
4         4    c(a, b, c, d)
5         5 c(a, b, c, d, e)
s4vectors genomicranges • 1.0k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 6 weeks ago
United States

Calling write.table() implies as.data.frame(), which coerces the CharacterList to a list. write.table() does not actually handle list columns (what should it do?) but as it turns out, the coercion from DataFrame to data.frame classes the list columns as "AsIs" which coincidentally ends up coercing the list to a character vector at write time.  There's no obvious way to coerce a list to a character vector, and the current implementation just uses dput()

I would generally avoid writing list columns (is expand() an option?), but if you have to, list columns are typically encoded as comma-separated cells in tabular text. You could of course use strsplit() and unstrsplit() to move back and forth. It might be a good idea for read.table() to support compound cells. I think data.table::fread() already does. But it's definitely pushing the limits of tabular text.

 

 

ADD COMMENT
0
Entering edit mode

Thanks for the info Michael. If I need to read these files, I'll use `strsplit()`.

Best,
Leonardo

ADD REPLY

Login before adding your answer.

Traffic: 208 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6