Question

Extracting first member of each element of a matrix of lists

1

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

I have a matrix of lists in a geno() field of a VCF:

> class(geno(vcf2)$NR)
[1] "matrix"
> mode(geno(vcf2)$NR)
[1] "list"

I am trying to find a quick way of manipulating this matrix of lists into a numeric matrix of the same dimensions. In almost every case, the list is of length 1, so I would simply like to extract the first element of each matrix cell list and drop all others. Any suggestions for how best to do this? The matrix is rather large, >5M rows.

variantannotation vcf • 1.2k views

ADD COMMENT • link updated 9.0 years ago by Michael Lawrence ★ 11k • written 9.0 years ago by Sean Davis 21k

1

Entering edit mode

Stephanie M. Gogarten ▴ 870

@stephanie-m-gogarten-5121

Last seen 25 days ago

University of Washington

VariantAnnotation has an internal function .matrixOfListsToArray() that gets used in genotypeToSnpMatrix(). But Martin's solution is probably faster.

.matrixOfListsToArray <- function(x) {
    # find number of elements of each cell of x
    n <- elementLengths(x)
    maxn <- max(n)
 
    # for cells with less than the max number of elements, add NAs
    idx <- n < maxn
    x[idx] <- lapply(x[idx], function(a){c(a, rep(NA, maxn-length(a)))})
 
    # unlist and convert to array
    x <- array(unlist(x), dim=c(maxn, nrow(x), ncol(x)),
               dimnames=list(NULL, rownames(x), colnames(x)))
    x <- aperm(x, c(2,3,1))

    x
}

ADD COMMENT • link 9.0 years ago Stephanie M. Gogarten ▴ 870

score 3 · Accepted Answer · 2015-05-07

IRanges already does this. But first you need to tell it more about your data. If the list matrix is numeric, then coerce to NumericList:

library(IRanges)
l <- NumericList(m) # 'm' is from Martin's answer

Then ask for the first element of each list element:

l1 <- phead(l, 1)

Finally, convert the result back to a matrix:

m1 <- unlist(l1)
dim(m1) <- dim(m)

score 2 · Accepted Answer · 2015-04-29

I think this is the data structure

> l = list(list(1), list(2), list(31, 32), list(4))
> m = matrix(l, 2)
> m
     [,1]   [,2]  
[1,] List,1 List,2
[2,] List,1 List,1
> class(m)
[1] "matrix"
> mode(m)
[1] "list"

You efficiently query for the length of each element

> lengths(m)
[1] 1 1 2 1

and then think of these as indexes into the unlisted m

> cumsum(c(1L, lengths(m)[-length(m)]))

Then actually do the unlisting (avoiding the cost of creating / copying names, if any) and subsetting

> unlist(m, use.names=FALSE)[cumsum(c(1L, lengths(m)[-length(m)]))]
[1]  1  2 31  4

This could be re-shaped into a matrix by adding the original dimensions

> n = unlist(m, use.names=FALSE)[cumsum(c(1, lengths(m)[-length(m)]))]
> dim(n) = dim(m); dimnames(n) = dimnames(m); n
     [,1] [,2]
[1,]    1   31
[2,]    2    4

An alternative would I guess be to vapply-subset (guessing that the content of NR is numeric())

n = vapply(m, `[[`, numeric(1), 1L)

I don't honestly know which is faster; maybe a microbenchmark() on real data would give some future guidance? Also there might be some built-in cleverness for the cumsum() step, since it seems like it would be common in S4Vectors land.

Hmm, it seems like there are at least two other structures that are consistent with your description, and these pose different challenges

l = list(list(1), list(2), list(c(31, 32)), list(4))
l = list(1, 2, c(31, 32), 4)

Maybe a bit of clarification on the actual data structure?