Question

Mapping genome regions to gene symbols

6

Entering edit mode

NS ▴ 60

@ns-7498

Last seen 6.0 years ago

United States

I have just downloaded CNV level 3 files from TCGA database.

As you know, in these files there are three columns: chromosome, start, and end which presents the coordinates of genes.

Now, I would like to map them to gene symbols, but I don't know.

I appreciate any help.

Thanks

tcga cnv gene symbol coordinate • 25k views

ADD COMMENT • link updated 2.3 years ago by James W. MacDonald 68k • written 9.9 years ago by NS ▴ 60

Martin Morgan · Accepted Answer · 2015-04-28

13

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 7 weeks ago

United States

I wrote two helper functions, explained below

geneRanges <- 
    function(db, column="ENTREZID")
{
    g <- genes(db, columns=column)
    col <- mcols(g)[[column]]
    genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
    mcols(genes)[[column]] <- as.character(unlist(col))
    genes
}

splitColumnByOverlap <-
    function(query, subject, column="ENTREZID", ...)
{
    olaps <- findOverlaps(query, subject, ...)
    f1 <- factor(subjectHits(olaps),
                 levels=seq_len(subjectLength(olaps)))
    splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)
}

Load the GenomicRanges package

library(GenomicRanges)

Get the CNV regions into a GRanges instance. For instance, if the data is in a BED file, perhaps using the rtracklayer package

cnv = rtracklayer::import("my.bed")

or if in a tab-delimited file

df = read.delim("my.tsv")
cnv = makeGRangesFromDataFrame(df)

(if you're a fan of dplyr / magrittr style pipes then cnv = read.delim("my.tsv") %>% makeGRangesFromDataFrame() works, too). Here I make a couple of genomic ranges 'by hand'

> cnv = GRanges(c("chr1", "chr2"), IRanges(94312388,244006886))
> cnv
GRanges object with 2 ranges and 0 metadata columns:
      seqnames                ranges strand
         <Rle>             <IRanges>  <Rle>
  [1]     chr1 [94312388, 244006886]      *
  [2]     chr2 [94312388, 244006886]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Load the Homo.sapiens package

library(Homo.sapiens)

Find the genomic coordinates for each (entrez) gene, as well as the gene symbol

> gns = geneRanges(Homo.sapiens, column="SYMBOL")
> gns
GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand   |      SYMBOL
           <Rle>              <IRanges>  <Rle>   | <character>
      1    chr19 [ 58858172,  58874214]      -   |        A1BG
     10     chr8 [ 18248755,  18258723]      +   |        NAT2
    100    chr20 [ 43248163,  43280376]      -   |         ADA
   1000    chr18 [ 25530930,  25757445]      -   |        CDH2
  10000     chr1 [243651535, 244006886]      -   |        AKT3
    ...      ...                    ...    ... ...         ...
   9991     chr9 [114979995, 115095944]      -   |       PTBP3
   9992    chr21 [ 35736323,  35743440]      +   |       KCNE2
   9993    chr22 [ 19023795,  19109967]      -   |       DGCR2
   9994     chr6 [ 90539619,  90584155]      +   |    CASP8AP2
   9997    chr22 [ 50961997,  50964905]      -   |        SCO2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Find which genes overlap which copy number regions

> symInCnv = splitByOverlap(gns, cnv, "SYMBOL")
> symInCnv
CharacterList of length 2
[["1"]] AKT3 MIR942 MIR190B MIR760 MIR921 ... FAM20B LPGAT1 KIF14 RBM8A NR1I3
[["2"]] TANK SNORD11B MIR933 LOC100129175 CCNT2-AS1 ... ZEB2 FARP2 TLK1 CD302

The result is a CharacterList (list of character vectors) where each element contains the genes overlapping the corresponding CNV region.

A CharacterList is very convenient to work with, but use as.vector(symInCnv) to work as a plain list-of-characters, or unstrsplit(symInCnv, sep=", ") to paste the symbols in each CNV together.

Here's the first helper function, disected

geneRanges <- 
    function(db, column="ENTREZID")

'db' is a so-called OrganismDb, containing information about genes (e.g., ENTREZ or SYMBOL gene ids) as well as gene models (e.g., TXSTART, TXEND). The Homo.sapiens OrangismDb is based on the org.Hs.eg.db gene annotation package, and the TxDb.Hsapiens.UCSC.hg19.knownGene package; it is relatively easy to make a custom package if you have different annotations.

'column' is the type of gene identifier you are interested in mapping to. See columns(Homo.sapiens) for some ideas.

{
    g <- genes(db, columns=column)

This line extracts the coordinates (min and max) of each gene, as well as the column (e.g., SYMBOL associated with that gene.

The TxDb is organized around a primary key (the ENTREZID), and there may be several gene SYMBOLS per ENTREZID. The next few lines makes each SYMBOL map to a single genomic location.

    col <- mcols(g)[[column]]
    genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
    mcols(genes)[[column]] <- as.character(unlist(col))
    genes
}

The end result is a GRanges instance, with a metadata column corresponding to the type of gene identifier of interest.

The basic idea behind splitByOverlap is that we can find which gene coordinates overlap which copy number variant coordinates, and then split the column of gene identifiers into lists corresponding to the regions of overlap

splitByOverlap <-
    function(x, f, column="ENTREZID", ...)

'query' is the gene coordinates, 'subject' the copy number coordinates. 'column' needs to match 'column' in the geneRanges() function.

{
    olaps <- findOverlaps(query, subject, ...)

findOverlaps() is a very powerful function. It returns a 'Hits' object that has two parallel vectors. The vectors can be extracted with queryHits() and subjectHits(). queryHits() are the indexes of the queries (the gene coordinates) that overlap the corresponding subjectHits(), i.e., the indexes of the subjects, the copy number regions.

The next lines line up the query column identifier (e.g., gene SYMBOL) that overlaps each subject.

    f1 <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
    splitAsList(mcols(x)[[column]][queryHits(olaps)], f1)

the use of factor() with exactly as many levels as there are subjects ensures that the splitAsList() command returns a 1:1 mapping between the subjects (copy number regions) and the genes in the corresponding CharacterList.

ADD COMMENT • link 9.9 years ago Martin Morgan 25k

1

Entering edit mode

function

elementLengths()

was deprecated in S4Vectors version update 0.16.0 (in 2016?). \ So I used:

elementNROWS()

https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/S4Vectors/NEWS

ADD REPLY • link 2.6 years ago michaelwegert ▴ 10

0

Entering edit mode

@Martin Thank you so much.

It was really helpful, I learned many thing!

ADD REPLY • link 9.9 years ago NS ▴ 60

0

Entering edit mode

Thanks a lot for this post. This is really helpful. I was trying to annotate cnv regions using the code snippets you provided.

For some of the regions the annotation appears to be incorrect or I am doing something wrong.

Example:

chr10

80763309

81342462

1

0.5693

ZEB2-AS1,NXPH2,TEX41,LRP1B,ARHGAP15,YY1P2,PABPC1P2,GTDC1,KYNU,ACVR2A,ZEB2

The genes returned are not in chr 10 . Could you please guide me to understand what I am doing wrong?

What I have:

>cnv

GRanges object with 15663 ranges and 0 metadata columns:
          seqnames                 ranges strand
             <Rle>              <IRanges>  <Rle>
      [1]     chr1 [152275770, 152286617]      *
      [2]     chr1 [152327008, 152328775]      *
      [3]     chr1 [225688735, 225695788]      *
      [4]     chr1 [226923138, 226925166]      *
      [5]     chr1 [228466864, 228481334]      *
      ...      ...                    ...    ...
  [15659]     chr7 [155187653, 155251762]      *
  [15660]     chr8 [145486495, 145488470]      *
  [15661]     chr8 [ 37707064,  37732844]      *
  [15662]     chr8 [145486495, 145488470]      *
  [15663]     chrY [  3447306,   4905323]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

>gns

cnv
GRanges object with 15663 ranges and 0 metadata columns:
          seqnames                 ranges strand
             <Rle>              <IRanges>  <Rle>
      [1]     chr1 [152275770, 152286617]      *
      [2]     chr1 [152327008, 152328775]      *
      [3]     chr1 [225688735, 225695788]      *
      [4]     chr1 [226923138, 226925166]      *
      [5]     chr1 [228466864, 228481334]      *
      ...      ...                    ...    ...
  [15659]     chr7 [155187653, 155251762]      *
  [15660]     chr8 [145486495, 145488470]      *
  [15661]     chr8 [ 37707064,  37732844]      *
  [15662]     chr8 [145486495, 145488470]      *
  [15663]     chrY [  3447306,   4905323]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

symInCnv = splitColumnByOverlap(gns, cnv, "SYMBOL")

ADD REPLY • link 8.6 years ago mkaushal • 0

1

Entering edit mode

By way of reproducible example, I loaded the packages and created the range that you indicate is problematic

library(GenomicRanges)
library(Homo.sapiens)
cnv = GRanges("chr10:80763309-81342462")

I then copy and pasted the two functions above (updated here) into my R session, and calculated the gene ranges and overlaps of my test range

gns = geneRanges(Homo.sapiens, column="SYMBOL")
xx = splitColumnByOverlap(gns, cnv, "SYMBOL")

I get

> gns[gns$SYMBOL %in% xx[[1]]]
GRanges object with 6 ranges and 1 metadata column:
         seqnames               ranges strand |      SYMBOL
            <Rle>            <IRanges>  <Rle> | <character>
   10105    chr10 [81107220, 81115089]      + |        PPIF
  143244    chr10 [81272357, 81276192]      + |     EIF5AL1
  219654    chr10 [81142083, 81205383]      - |     ZCCHC24
  283050    chr10 [80703083, 80828536]      - |   ZMIZ1-AS1
   57178    chr10 [80828792, 81076285]      + |       ZMIZ1
  729238    chr10 [81315608, 81320163]      - |      SFTPA2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

The versions of the various packages in use are

> sessionInfo()
R version 3.3.1 Patched (2016-07-20 r70947)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] Homo.sapiens_1.3.1                     
 [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [3] org.Hs.eg.db_3.3.0                     
 [4] GO.db_3.3.0                            
 [5] OrganismDbi_1.14.1                     
 [6] GenomicFeatures_1.24.4                 
 [7] AnnotationDbi_1.34.4                   
 [8] Biobase_2.32.0                         
 [9] GenomicRanges_1.24.2                   
[10] GenomeInfoDb_1.8.3                     
[11] IRanges_2.6.1                          
[12] S4Vectors_0.10.2                       
[13] BiocGenerics_0.18.0                    
[14] BiocInstaller_1.22.3                   

loaded via a namespace (and not attached):
 [1] graph_1.50.0               XVector_0.12.0            
 [3] zlibbioc_1.18.0            GenomicAlignments_1.8.4   
 [5] BiocParallel_1.6.2         tools_3.3.1               
 [7] SummarizedExperiment_1.2.3 DBI_0.4-1                 
 [9] RBGL_1.48.1                rtracklayer_1.32.1        
[11] bitops_1.0-6               RCurl_1.95-4.8            
[13] biomaRt_2.28.0             RSQLite_1.0.0             
[15] Biostrings_2.40.2          Rsamtools_1.24.0          
[17] XML_3.98-1.4

What do you get, what is your sessionInfo(), and which gene(s) are not on chr10?

ADD REPLY • link 8.6 years ago Martin Morgan 25k

0

Entering edit mode

It worked this time around. Not very sure what mistake I was doing last time. LRP1B was listed as gene on chr 10, which is not correct. It worked perfectly fine this time. Thanks a lot for your help.

ADD REPLY • link 8.6 years ago mkaushal • 0

0

Entering edit mode

It works fine @Martin, thank you very much. But I need to export the GRanges object with the SYMBOL column to a data frame, or to print it in a text file, but I can`t, it is possible? thank you very much in advance.

ADD REPLY • link 7.9 years ago julrodr80 ▴ 30

0

Entering edit mode

Convert a GRanges gr to data.frame with as.data.frame(gr). Paste the elements of a CharacterList x into a character vector with paste(x, collapse=", "). Think twice about whether either of these is something you really want to do.

ADD REPLY • link 7.9 years ago Martin Morgan 25k

0

Entering edit mode

Thank you very much @MartinMorgan, but if I do that, the name of the genes is substituted by numbers, why? How can I solve that?

Thank you very much and sorry the annoyances

ADD REPLY • link 7.9 years ago julrodr80 ▴ 30

0

Entering edit mode

Maybe the relevant column is a factor() not a character() vector.

ADD REPLY • link 7.9 years ago Martin Morgan 25k

0

Entering edit mode

Yes, is a factor... I`ve already solve the problem. Thanks!!

ADD REPLY • link 7.9 years ago julrodr80 ▴ 30

0

Entering edit mode

Something to consider is that this code uses the Homo.sapiens package which uses hg19 as its reference genome. This method makes it unusable if your data uses hg38.

ADD REPLY • link 2.3 years ago mflab • 0

0

Entering edit mode

Adding comments to ancient posts is not particularly useful, given the changes that may have happened in the intervening period. In addition, while what you state is technically correct, it is not correct in a larger sense, as you can swap in any TxDb version you might like.

ADD REPLY • link 2.3 years ago James W. MacDonald 68k

0

Entering edit mode

@Martin thank you so much

I need to trasnform

"symInCnv" into a data frame with two columns, exactly like the print

["1"]] AKT3 MIR942 MIR190B MIR760 MIR921 ... FAM20B LPGAT1 KIF14 RBM8A NR1I3
[["2"]] TANK SNORD11B MIR933 LOC100129175 CCNT2-AS1 ... ZEB2 FARP2 TLK1 CD302

because I need the genes for each cnv to joint as a column to other data frame (where I have my original cnvs with their chromosomal ranges)

Thank you very much

ADD REPLY • link 7.9 years ago julrodr80 ▴ 30

0

Entering edit mode

It would make more sense to have the CNV regions as GRanges objects, and simply add the copy number columns

cnv = GRanges(c("chr1:100-200", "chr10:200-300"))
cnv$genes = symInCnv

ADD REPLY • link 7.9 years ago Martin Morgan 25k

1

Entering edit mode

ok, thanks. It works perfect. Thank you very much!!

Just one thing, you should put the update version of the helper functions (updated here) in your first post, because is confusing... ;)

thank you very much!!

regards

ADD REPLY • link 7.9 years ago julrodr80 ▴ 30

0

Entering edit mode

@Martin I'm trying to map CNV to genes, I had the following error when I applied your code

symInCnv = splitByOverlap(gns, cnv, "SYMBOL")
Error in findOverlaps(query, subject, ...) : object 'query' not found

what should I do also I have 24 chromosomes , I would like to apply the code to the whole CNV file.

ADD REPLY • link updated 7.7 years ago by Martin Morgan 25k • written 7.7 years ago by mms140130 • 0

0

Entering edit mode

Use the function definitions at the top of the response -- splitColumnsByOverlap() -- the code and explanation seem to have drifted a bit.

ADD REPLY • link 7.7 years ago Martin Morgan 25k

0

Entering edit mode

@Martin Thanks for your answer, I got the following error

xx = splitColumnByOverlap(gns, cnv, "SYMBOL")
Warning message:
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)

my cnv is the copy number variation file for tumor tissue only

GRanges object with 284458 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 3218610, 95674710] *
[2] 1 [ 95676511, 95676518] *
[3] 1 [ 95680124, 167057183] *
[4] 1 [167057495, 167059336] *
[5] 1 [167059760, 181602002] *
... ... ... ...
[284454] 19 [ 284018, 58878226] *
[284455] 20 [ 455764, 62219837] *
[284456] 21 [15347621, 47678774] *
[284457] 22 [17423930, 49331012] *
[284458] 23 [ 3157107, 154905589] *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths

my data file look like the following(part of it)

Sample	Chromosome	Start	End	Num_Probes	Segment_Mean
TCGA-3C-AAAU-01A-11D-A41E-01	1	3218610	63469503	33369	0.1791
TCGA-3C-AAAU-01A-11D-A41E-01	1	63471492	63472103	3	-0.8257
TCGA-3C-AAAU-01A-11D-A41E-01	1	63472868	85632596	13663	0.2994
TCGA-3C-AAAU-01A-11D-A41E-01	1	85635413	85878136	146	0.6498
TCGA-3C-AAAU-01A-11D-A41E-01	1	85881278	149890533	20618	0.2781

ADD REPLY • link 7.7 years ago mms140130 • 0

0

Entering edit mode

compare seqlevels(gns) and seqlevels(cnv). see ?"seqlevelsStyle<-" to coerce between 'styles' (e.g,. NCBI, UCSC), but make sure it is sensible to do so (e.g., gns and cnv based on the same annotation). You're really asking a different (i.e., new) question, so start a new question rather than extending this comment.

ADD REPLY • link 7.7 years ago Martin Morgan 25k

score 5 · Accepted Answer · 2015-05-05

Another approach uses Steffen Durink's fine biomaRt package, a contribution to Bioconductor of many year's standing. In its simplest form just two lines of code gets you all the genes in one copy number region:

library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
results <- getBM(attributes = c("hgnc_symbol", "chromosome_name", 
                                 "start_position", "end_position"),
                 filters = c("chromosome_name", "start", "end"), 
                 values=list(1, 94312388, 96000000),
                 mart=mart)
dim(results)  # 34 hits, only 12 with gene symbols

If you are reading your data from a file (the TCGA level 3 data you mentioned), and if you wish to filter the regions, this approach may be useful:

library(GenomicRanges)
filename <- "TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_A01_155716.nocnv_hg19.seg.txt"
tbl <- read.table(filename, sep="\t", as.is=TRUE, header=TRUE);
gr <- makeGRangesFromDataFrame(tbl)

With the data in a standard Bioconductor GRanges object, filtering the data becomes easy. Let's submit just very short copy number regions to biomaRt:

gr.short <- subset(gr, width < 100)
length(gr) # 117 regions
length(gr.short) # just 2 regions
gr.short
GRanges object with 2 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]        2 [109993208, 109993264]      *
  [2]       10 [114452689, 114452764]      *

We need a different biomaRt filter in order to retrieve gene symbols from multiple regions with one call, and we have to specify the copy number the regions in a different format:

regions <- paste(seqnames(gr.short), start(gr.short), end(gr.short), sep=":")
regions
[1] "2:109993208:109993264"  "10:114452689:114452764"

results <- getBM(attributes = c("hgnc_symbol", "chromosome_name", 
                                "start_position","end_position"),
                 filters = c("chromosomal_region"),
                 values=regions,
                 mart=mart)

biomaRt returns two genes, one each from the two short regions:

hgnc_symbol chromosome_name start_position end_position
1 ABLIM1 10 114431113 114685003
2 LINC01123 2 109987063 109996140

You may wish to create a GRanges object from this data.frame in order to simplify subsequent analyses. In general we recommend this: the standard Bioc data types promote interoperability between different packages. In this particular case -- gene symbols mapped to copy number regions -- you may wish to preserve the original TCGA-reported regions in a GRanges object, and then represent the mapped genes as GRanges metadata. However (and as Herve' helpfully pointed out to me) this relationship could be one-to-many, or many-to-one, and the data structure can get a little complicated - though still very useful. I will followup with more detail on this later.