Proper way to read in a DataFrame with CharacterList columns that was saved to a text file
Entering edit mode
Last seen 22 days ago
United States


What is the proper way to read in a DataFrame from a text file that has CharacterList columns? With the code below, I can see that write.table() writes the text file in such a way that the CharacterList column has c() calls. I'm guessing that there's a simple argument change or a function that then allows you to read this information, but I'm not finding it.

Thank you,



> library('S4Vectors')
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply,

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append,, cbind, colnames,, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax,, pmin,, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

> library('GenomicRanges')
Loading required package: IRanges
Loading required package: GenomeInfoDb
There were 12 warnings (use warnings() to see them)
> df <- DataFrame(x = 1:5, y = CharacterList(lapply(1:5, function(i) {
+     letters[seq_len(i)]}
+ )))
> write.table(df, file = 'test.tsv', sep = '\t', row.names = FALSE, quote = FALSE)
> system('head test.tsv')
x    y
1    a
2    c("a", "b")
3    c("a", "b", "c")
4    c("a", "b", "c", "d")
5    c("a", "b", "c", "d", "e")
> df2 <- read.table('test.tsv', header = TRUE, sep = '\t', stringsAsFactors = FALSE)
> df2
  x                y
1 1                a
2 2          c(a, b)
3 3       c(a, b, c)
4 4    c(a, b, c, d)
5 5 c(a, b, c, d, e)
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                 
 version  R version 3.3.0 RC (2016-05-01 r70572)
 system   x86_64, darwin13.4.0                  
 ui       AQUA                                  
 language (EN)                                  
 collate  en_US.UTF-8                           
 tz       America/New_York                      
 date     2016-06-16                            

Packages ---------------------------------------------------------------------------------------------------------------
 package       * version date       source        
 BiocGenerics  * 0.19.1  2016-06-11 Bioconductor  
 devtools        1.11.1  2016-04-21 CRAN (R 3.3.0)
 digest          0.6.9   2016-01-08 CRAN (R 3.3.0)
 GenomeInfoDb  * 1.9.1   2016-05-13 Bioconductor  
 GenomicRanges * 1.25.4  2016-06-10 Bioconductor  
 IRanges       * 2.7.6   2016-06-10 Bioconductor  
 memoise         1.0.0   2016-01-29 CRAN (R 3.3.0)
 S4Vectors     * 0.11.4  2016-06-11 Bioconductor  
 withr           1.0.1   2016-02-04 CRAN (R 3.3.0)
 XVector         0.13.0  2016-05-05 Bioconductor  
 zlibbioc        1.19.0  2016-05-05 Bioconductor  

## Doesn't work to simply use DataFrame

> DataFrame(df2)
DataFrame with 5 rows and 2 columns
          x                y
  <integer>      <character>
1         1                a
2         2          c(a, b)
3         3       c(a, b, c)
4         4    c(a, b, c, d)
5         5 c(a, b, c, d, e)
s4vectors genomicranges • 1.2k views
Entering edit mode
Last seen 5 months ago
United States

Calling write.table() implies, which coerces the CharacterList to a list. write.table() does not actually handle list columns (what should it do?) but as it turns out, the coercion from DataFrame to data.frame classes the list columns as "AsIs" which coincidentally ends up coercing the list to a character vector at write time.  There's no obvious way to coerce a list to a character vector, and the current implementation just uses dput()

I would generally avoid writing list columns (is expand() an option?), but if you have to, list columns are typically encoded as comma-separated cells in tabular text. You could of course use strsplit() and unstrsplit() to move back and forth. It might be a good idea for read.table() to support compound cells. I think data.table::fread() already does. But it's definitely pushing the limits of tabular text.



Entering edit mode

Thanks for the info Michael. If I need to read these files, I'll use `strsplit()`.



Login before adding your answer.

Traffic: 273 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6