Extracting first member of each element of a matrix of lists
3
1
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States

I have a matrix of lists in a geno() field of a VCF:

> class(geno(vcf2)$NR)
[1] "matrix"
> mode(geno(vcf2)$NR)
[1] "list"

I am trying to find a quick way of manipulating this matrix of lists into a numeric matrix of the same dimensions.  In almost every case, the list is of length 1, so I would simply like to extract the first element of each matrix cell list and drop all others.  Any suggestions for how best to do this?  The matrix is rather large, >5M rows.

 

variantannotation vcf • 1.2k views
ADD COMMENT
3
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

IRanges already does this. But first you need to tell it more about your data. If the list matrix is numeric, then coerce to NumericList:

library(IRanges)
l <- NumericList(m) # 'm' is from Martin's answer

Then ask for the first element of each list element:

l1 <- phead(l, 1)

Finally, convert the result back to a matrix:

m1 <- unlist(l1)
dim(m1) <- dim(m)

 

ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 2 days ago
United States

I think this is the data structure

> l = list(list(1), list(2), list(31, 32), list(4))
> m = matrix(l, 2)
> m
     [,1]   [,2]  
[1,] List,1 List,2
[2,] List,1 List,1
> class(m)
[1] "matrix"
> mode(m)
[1] "list"

You efficiently query for the length of each element

> lengths(m)
[1] 1 1 2 1

and then think of these as indexes into the unlisted m

> cumsum(c(1L, lengths(m)[-length(m)]))

Then actually do the unlisting (avoiding the cost of creating / copying names, if any) and subsetting

> unlist(m, use.names=FALSE)[cumsum(c(1L, lengths(m)[-length(m)]))]
[1]  1  2 31  4

This could be re-shaped into a matrix by adding the original dimensions

> n = unlist(m, use.names=FALSE)[cumsum(c(1, lengths(m)[-length(m)]))]
> dim(n) = dim(m); dimnames(n) = dimnames(m); n
     [,1] [,2]
[1,]    1   31
[2,]    2    4

An alternative would I guess be to vapply-subset (guessing that the content of NR is numeric())

n = vapply(m, `[[`, numeric(1), 1L)

I don't honestly know which is faster; maybe a microbenchmark() on real data would give some future guidance? Also there might be some built-in cleverness for the cumsum() step, since it seems like it would be common in S4Vectors land.

Hmm, it seems like there are at least two other structures that are consistent with your description, and these pose different challenges

l = list(list(1), list(2), list(c(31, 32)), list(4))
​l = list(1, 2, c(31, 32), 4)

Maybe a bit of clarification on the actual data structure?

ADD COMMENT
1
Entering edit mode
@stephanie-m-gogarten-5121
Last seen 25 days ago
University of Washington

VariantAnnotation has an internal function .matrixOfListsToArray() that gets used in genotypeToSnpMatrix(). But Martin's solution is probably faster.

.matrixOfListsToArray <- function(x) {
    # find number of elements of each cell of x
    n <- elementLengths(x)
    maxn <- max(n)
 
    # for cells with less than the max number of elements, add NAs
    idx <- n < maxn
    x[idx] <- lapply(x[idx], function(a){c(a, rep(NA, maxn-length(a)))})
 
    # unlist and convert to array
    x <- array(unlist(x), dim=c(maxn, nrow(x), ncol(x)),
               dimnames=list(NULL, rownames(x), colnames(x)))
    x <- aperm(x, c(2,3,1))

    x
}
ADD COMMENT

Login before adding your answer.

Traffic: 854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6