Question: Extracting first member of each element of a matrix of lists
1
4.6 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:

I have a matrix of lists in a geno() field of a VCF:

> class(geno(vcf2)$NR) [1] "matrix" > mode(geno(vcf2)$NR)
[1] "list"

I am trying to find a quick way of manipulating this matrix of lists into a numeric matrix of the same dimensions.  In almost every case, the list is of length 1, so I would simply like to extract the first element of each matrix cell list and drop all others.  Any suggestions for how best to do this?  The matrix is rather large, >5M rows.

variantannotation vcf • 566 views
modified 4.5 years ago by Michael Lawrence11k • written 4.6 years ago by Sean Davis21k
Answer: Extracting first member of each element of a matrix of lists
3
4.5 years ago by
United States
Michael Lawrence11k wrote:

IRanges already does this. But first you need to tell it more about your data. If the list matrix is numeric, then coerce to NumericList:

library(IRanges)
l <- NumericList(m) # 'm' is from Martin's answer

Then ask for the first element of each list element:

l1 <- phead(l, 1)


Finally, convert the result back to a matrix:

m1 <- unlist(l1)
dim(m1) <- dim(m)


Answer: Extracting first member of each element of a matrix of lists
2
4.6 years ago by
Martin Morgan ♦♦ 24k
United States
Martin Morgan ♦♦ 24k wrote:

I think this is the data structure

> l = list(list(1), list(2), list(31, 32), list(4))
> m = matrix(l, 2)
> m
[,1]   [,2]
[1,] List,1 List,2
[2,] List,1 List,1
> class(m)
[1] "matrix"
> mode(m)
[1] "list"

You efficiently query for the length of each element

> lengths(m)
[1] 1 1 2 1

and then think of these as indexes into the unlisted m

> cumsum(c(1L, lengths(m)[-length(m)]))

Then actually do the unlisting (avoiding the cost of creating / copying names, if any) and subsetting

> unlist(m, use.names=FALSE)[cumsum(c(1L, lengths(m)[-length(m)]))]
[1]  1  2 31  4

This could be re-shaped into a matrix by adding the original dimensions

> n = unlist(m, use.names=FALSE)[cumsum(c(1, lengths(m)[-length(m)]))]
> dim(n) = dim(m); dimnames(n) = dimnames(m); n
[,1] [,2]
[1,]    1   31
[2,]    2    4

An alternative would I guess be to vapply-subset (guessing that the content of NR is numeric())

n = vapply(m, [[, numeric(1), 1L)

I don't honestly know which is faster; maybe a microbenchmark() on real data would give some future guidance? Also there might be some built-in cleverness for the cumsum() step, since it seems like it would be common in S4Vectors land.

Hmm, it seems like there are at least two other structures that are consistent with your description, and these pose different challenges

l = list(list(1), list(2), list(c(31, 32)), list(4))
​l = list(1, 2, c(31, 32), 4)

Maybe a bit of clarification on the actual data structure?

Answer: Extracting first member of each element of a matrix of lists
1
4.5 years ago by
University of Washington
Stephanie M. Gogarten740 wrote:

VariantAnnotation has an internal function .matrixOfListsToArray() that gets used in genotypeToSnpMatrix(). But Martin's solution is probably faster.

.matrixOfListsToArray <- function(x) {
# find number of elements of each cell of x
n <- elementLengths(x)
maxn <- max(n)

# for cells with less than the max number of elements, add NAs
idx <- n < maxn
x[idx] <- lapply(x[idx], function(a){c(a, rep(NA, maxn-length(a)))})

# unlist and convert to array
x <- array(unlist(x), dim=c(maxn, nrow(x), ncol(x)),
dimnames=list(NULL, rownames(x), colnames(x)))
x <- aperm(x, c(2,3,1))

x
}