Question: Extract GRanges for gene data frame
gravatar for mdeea123
23 months ago by
mdeea1230 wrote:

I have a set of genes with associated data that I want to locate on the mouse genome

My approach is to

  • extract the genes in the genome
  • subset the genes to include only my list
  • add the extra data for my list into the GRanges as metadata
  • export the relevant parts of the GRanges object.
MouseGenesGR <- genes(Mus.musculus, columns="SYMBOL") 

          Hmeans   Lmeans    Nmeans
Aspn    3.779835 7.795773 11.534907
Angpt1 10.251626 8.220874  4.713242
Gm773   2.867759 6.118540 10.058305
Lifr    6.828385 8.136443  5.265472
Il1rl1  5.832741 8.427293 11.550407
Ogn     7.865109 9.429068 12.435698

[1] 1000    3

MySet1 <- MouseGenesGR[any(mcols(MouseGenesGR)$SYMBOL %in% row.names(means))]  #subset my genes

[1] 911

What happened to the other 89?

And how do I query that?

Now I would like to add in the metadata with 

mcols(MySet1)$HMeans <- means$Hmeans 
mcols(MySet1)$LMeans <- means$Lmeans 
mcols(MySet1)$NMeans <- means$Nmeans 

but I can't because the two objects are now of different lengths

Error in `[[<-`(`*tmp*`, name, value = c(3.77983495075, 10.2516255123333,  : 
  1000 elements in value to replace 911 elements

Any help appreciated.


ADD COMMENTlink modified 23 months ago by Martin Morgan ♦♦ 22k • written 23 months ago by mdeea1230
gravatar for phil.chapman
23 months ago by
United Kingdom
phil.chapman120 wrote:

I'm not sure if you're trying to end up with a data.frame or a GRanges object but either way you might be better off coercing to/from a data frame as it makes the join operation easier. For example:


#info on genes
gene_info <- genes(EnsDb.Hsapiens.v75, filter=GenebiotypeFilter('protein_coding'))
gene_info <- %>% tbl_df()

#dummy mean info
mean_info <- matrix(nrow=7, rnorm(n=21), dimnames = list(c('BRAF','EGFR','NRAS','TP53','PTEN','KRAS', 'NOMATCHGENE'), c('Hmeans','Lmeans','Nmeans')))
mean_info <- mean_info %>% %>% rownames_to_column('gene_name')

#join operation
merged_df <- gene_info %>% inner_join(mean_info, by='gene_name')
merged_gr <- merged_df %>% %>% as('GRanges')
names(merged_gr) <-  merged_gr$gene_id

If you want to see where your missing symbols are coming from, you can just substitute the inner_join with a left_join to ensure that you receive a data frame the same length as the left hand data frame:

> lj_df <- mean_info %>% left_join(gene_info, by='gene_name')
> lj_df %>% dplyr::filter( %>% dplyr::select(1:10)
    gene_name   Hmeans    Lmeans     Nmeans seqnames start end width strand gene_id
1 NOMATCHGENE -0.23859 0.3558788 -0.9541681     <NA>    NA  NA    NA   <NA>    <NA>





ADD COMMENTlink written 23 months ago by phil.chapman120
gravatar for Martin Morgan
23 months ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

Here's something reproducible

gr = genes(Mus.musculus, columns="SYMBOL")
df = data.frame(1:5, row.names=unlist(gr$SYMBOL[1:5]))

Use merge() to merge the two objects

all = merge(gr, df, by.x="SYMBOL", by.y="row.names", sort=FALSE)

Coerce the result back to a genomic ranges for future  computation

as(all, "GRanges")

As for the missing symbols, you could ask about the (asymetric) setdiff(unlist(gr$SYMBOLS), rownames(df)) or similar; one might have more luck with column="ALIAS", though the merge would require more work -- each element of gr might match to 0 or more rows of df.

For SYMBOL, I'd probably tweak gr after checking that the 'CharacterList' representation of SYMBOL is not needed

> all(lengths(gr$SYMBOL) == 1)
[1] TRUE
> gr$SYMBOL = unlist(gr$SYMBOL)



ADD COMMENTlink modified 23 months ago • written 23 months ago by Martin Morgan ♦♦ 22k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 162 users visited in the last hour