Question

Extract GRanges for gene data frame

1

Entering edit mode

mdeea123 ▴ 10

@mdeea123-11297

Last seen 7.7 years ago

I have a set of genes with associated data that I want to locate on the mouse genome

My approach is to

extract the genes in the genome
subset the genes to include only my list
add the extra data for my list into the GRanges as metadata
export the relevant parts of the GRanges object.

MouseGenesGR <- genes(Mus.musculus, columns="SYMBOL") 

head(means)
          Hmeans   Lmeans    Nmeans
Aspn    3.779835 7.795773 11.534907
Angpt1 10.251626 8.220874  4.713242
Gm773   2.867759 6.118540 10.058305
Lifr    6.828385 8.136443  5.265472
Il1rl1  5.832741 8.427293 11.550407
Ogn     7.865109 9.429068 12.435698

dim(means)
[1] 1000    3

MySet1 <- MouseGenesGR[any(mcols(MouseGenesGR)$SYMBOL %in% row.names(means))]  #subset my genes

length(MySet1)
[1] 911

What happened to the other 89?

And how do I query that?

Now I would like to add in the metadata with

mcols(MySet1)$HMeans <- means$Hmeans 
mcols(MySet1)$LMeans <- means$Lmeans 
mcols(MySet1)$NMeans <- means$Nmeans

but I can't because the two objects are now of different lengths

Error in `[[<-`(`*tmp*`, name, value = c(3.77983495075, 10.2516255123333,  : 
  1000 elements in value to replace 911 elements

Any help appreciated.

Mitch

GRange • 2.3k views

ADD COMMENT • link updated 7.7 years ago by Martin Morgan 25k • written 7.7 years ago by mdeea123 ▴ 10

score 0 · Answer 1 · 2016-08-15

I'm not sure if you're trying to end up with a data.frame or a GRanges object but either way you might be better off coercing to/from a data frame as it makes the join operation easier. For example:

library(EnsDb.Hsapiens.v75)
library(GenomicRanges)
library(dplyr)
library(tibble)

#info on genes
gene_info <- genes(EnsDb.Hsapiens.v75, filter=GenebiotypeFilter('protein_coding'))
gene_info <- as.data.frame(gene_info) %>% tbl_df()

#dummy mean info
mean_info <- matrix(nrow=7, rnorm(n=21), dimnames = list(c('BRAF','EGFR','NRAS','TP53','PTEN','KRAS', 'NOMATCHGENE'), c('Hmeans','Lmeans','Nmeans')))
mean_info <- mean_info %>% as.data.frame() %>% rownames_to_column('gene_name')

#join operation
merged_df <- gene_info %>% inner_join(mean_info, by='gene_name')
merged_gr <- merged_df %>% as.data.frame() %>% as('GRanges')
names(merged_gr) <-  merged_gr$gene_id

If you want to see where your missing symbols are coming from, you can just substitute the inner_join with a left_join to ensure that you receive a data frame the same length as the left hand data frame:

> lj_df <- mean_info %>% left_join(gene_info, by='gene_name')
> lj_df %>% dplyr::filter(is.na(gene_id)) %>% dplyr::select(1:10)
    gene_name   Hmeans    Lmeans     Nmeans seqnames start end width strand gene_id
1 NOMATCHGENE -0.23859 0.3558788 -0.9541681     <NA>    NA  NA    NA   <NA>    <NA>

score 0 · Answer 2 · 2016-08-15

Here's something reproducible

library(Mus.musculus)
gr = genes(Mus.musculus, columns="SYMBOL")
df = data.frame(1:5, row.names=unlist(gr$SYMBOL[1:5]))

Use merge() to merge the two objects

all = merge(gr, df, by.x="SYMBOL", by.y="row.names", sort=FALSE)

Coerce the result back to a genomic ranges for future computation

as(all, "GRanges")

As for the missing symbols, you could ask about the (asymetric) setdiff(unlist(gr$SYMBOLS), rownames(df)) or similar; one might have more luck with column="ALIAS", though the merge would require more work -- each element of gr might match to 0 or more rows of df.

For SYMBOL, I'd probably tweak gr after checking that the 'CharacterList' representation of SYMBOL is not needed

> all(lengths(gr$SYMBOL) == 1)
[1] TRUE
> gr$SYMBOL = unlist(gr$SYMBOL)