Search
Question: bioconductor packages for importing custom annotation tables into R?
0
gravatar for k.askland
4.1 years ago by
k.askland20
United States
k.askland20 wrote:

Hello,

I would like to know if there is a bioconductor (or more general R) package that facilitates efficient importing/reading of very large genomic annotation files. I have a custom annotation set of ~10,000 columns for 1.1M BP positions (rows) .

The read.table and read.csv options fail because of the table size (but works if I only import 100,000 rows at a time). I know that Bioconductor has many packages for annotating sequences, etc..., but I couldn't find anything that helps to streamline the process of bringing your own annotations into the workspace for further processing/manipulation.

Thanks for any suggestions,

Kathleen

ANNOT <- read.table('D:/DATA/ANNOT.txt', stringsAsFactors=FALSE, 
    colClasses=classes, skip=1, na.strings="", sep="\t", nrows=1112880)
#Error: cannot allocate vector of size 8.5 Mb
#In addition: Warning messages:
#1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)
#2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)
#3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)
#4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)

system.time(ANNOT100000 <- read.table('D:/DATA/ANNOT.txt', stringsAsFactors=FALSE, 
    colClasses=classes, skip=1, na.strings="", sep="\t", nrows=100000))

   #user  system elapsed 
 #377.74   15.71  393.54

sessionInfo() #I'm using Revolution R, which has R version 3.0.3, but I have the same problem in R 3.1.1

R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Revobase_7.2.0   RevoMods_7.2.0   RevoScaleR_7.2.0 lattice_0.20-27 
[5] rpart_4.1-5     

loaded via a namespace (and not attached):
[1] codetools_0.2-9   foreach_1.4.2     grid_3.0.3        iterators_1.0.7  
[5] pkgXMLBuilder_1.0 revoIpe_1.0       tools_3.0.3       XML_3.98-1.1 
ADD COMMENTlink modified 4.1 years ago by Martin Morgan ♦♦ 22k • written 4.1 years ago by k.askland20
2
gravatar for Martin Morgan
4.1 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

Hi Kathleen -- one solution is to use a data base (SQLite via the RSQLite package being a good bet for cross-platform portability; it's used in all of our annotation packages). You would want to query the data base for relevant information, rather than reading the entire file in at a time. I really don't know whether this is suitable for the size (especially number of columns) you're talking about. The data.table (maybe requires in-memory data, so likely to be as problematic as other approaches). and dplyr packages are also very effective at dealing with large data.frame style data.

It's hard to imagine what 10,000 annotation columns represent? Maybe many of these are somehow more like a matrix (e.g., of SNPs across samples)? One could then use something like rhdf5.

It might help if you provided a little bit more information on what your data looks like.

From your answer, it sounds like you have an 'adjacency matrix'. Likely it is very sparse (few '1' values). Here's some toy data

    m <- matrix(as.integer(runif(100 * 26) < .05), 100,
                dimnames=list(paste0("rs", 1:100), LETTERS))
    fl <- tempfile()
    write.csv(m, fl, quote=FALSE)

 A strategy is to represent it instead as a matrix of 'from' and 'to' edges, and to populate the matrix by iterating through (below, in chunks of size N = 10, but in actuality maybe N=100000) the large file. I implemented this as

    fromto <- matrix(character(), 0, 2,
                     dimnames=list(NULL, c("from", "to")))
    N <- 10

    con <- file(fl)
    open(con)
    colnames <- strsplit(readLines(con, 1), ",")[[1]][-1]
    repeat {
        val <- tryCatch({
            read.csv(con, header=FALSE, row.names=1, nrows=N)
        }, error=function(err) {
            message("error? ", conditionMessage(err))
            matrix(integer(), 0, 0)
        })
        if (nrow(val) == 0)
            break
        idx <- val == 1
        m <- matrix(c(rownames(val)[row(val)[idx]],
                      colnames[col(val)[idx]]), ncol=2)
        fromto <- rbind(fromto, m)
    }
    close(con)

To work with these, one might use a list of some sort, or perhaps a 'sparse' matrix from the Matrix package or a graph from the graph package.

    split(fromto[,2], fromto[, 1])
    IRanges::splitAsList(fromto[,2], fromto[,1])

Maybe there are some ideas in there that are helpful...

ADD COMMENTlink modified 4.0 years ago • written 4.1 years ago by Martin Morgan ♦♦ 22k

Hi Martin, Thanks for your response.

The problem I am having is that I am not using the annotation file to annotate a set of SNPs or genes, as is usually the case. I actually want to bring the entire file in  because I want to do some processing on the matrix itself. 

I tried to take a look at rhdf5, but I'm a bit confused as to whether .txt files can be converted to this format. Moreover, it seems that, even if so, it would require that I import the file into R in order to convert to a new file type, but I can't get the file imported. The other option I have, obviously, is to break up the file into smaller chunks and import them separately. But, I was hoping to avoid that. 

ADD REPLYlink written 4.1 years ago by k.askland20

The idea would be to do a one-time transformation of the data from its unwieldy plain text representation to a more reasonable representation; in sql or hdf5 likely you'd do this with 'standard' tools outside R, or iterate through the file in (not break it up into) chunks in R, and with each chunk append it to an existing sql / hdf5 file. Are these actually SNPs? Because then you might want to look at the snpStats package. Again it'll help to be more explicit about what the 'annotation' data you have looks like.

ADD REPLYlink written 4.1 years ago by Martin Morgan ♦♦ 22k

Thanks again, Martin. Sorry. It's 1.1M SNPs (rs IDs) with 10,000 gene ontology annotations - the gene in which the SNP resides is/is not annotated to that gene ontology term.

ADD REPLYlink written 4.1 years ago by k.askland20

Hi Kathleen -- I added to my answer with some suggestions; not sure if these are helpful...

ADD REPLYlink written 4.0 years ago by Martin Morgan ♦♦ 22k

Hi Martin,

I apologize for the delay. I was away from the office.

I will see if I can implement the function you provided.

Thanks again,

Kathleen

 

ADD REPLYlink written 4.0 years ago by k.askland20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 388 users visited in the last hour