Question

Mapping regions to gene symbols

0

Entering edit mode

mms140130 • 0

@mms140130-13360

Last seen 6.7 years ago

I have used the code in another question https://support.bioconductor.org/p/67118/#97616,but it didn't work with me, I have those Segment_Mean, a start and end I want to know which genes overlap with those regions.

xx = splitColumnByOverlap(gns, cnv, "SYMBOL")
Warning message:
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)

my cnv is the copy number variation file for tumor tissue only

GRanges object with 284458 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 3218610, 95674710] *
[2] 1 [ 95676511, 95676518] *
[3] 1 [ 95680124, 167057183] *
[4] 1 [167057495, 167059336] *
[5] 1 [167059760, 181602002] *
... ... ... ...
[284454] 19 [ 284018, 58878226] *
[284455] 20 [ 455764, 62219837] *
[284456] 21 [15347621, 47678774] *
[284457] 22 [17423930, 49331012] *
[284458] 23 [ 3157107, 154905589] *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths

my data file look like the following(part of it)

Sample	Chromosome	Start	End	Num_Probes	Segment_Mean
TCGA-3C-AAAU-01A-11D-A41E-01	1	3218610	63469503	33369	0.1791
TCGA-3C-AAAU-01A-11D-A41E-01	1	63471492	63472103	3	-0.8257
TCGA-3C-AAAU-01A-11D-A41E-01	1	63472868	85632596	13663	0.2994
TCGA-3C-AAAU-01A-11D-A41E-01	1	85635413	85878136	146	0.6498
TCGA-3C-AAAU-01A-11D-A41E-01	1	85881278	149890533	20618	0.2781

annotation R copynumber • 1.6k views

ADD COMMENT • link updated 6.8 years ago by Vincent J. Carey, Jr. 6.7k • written 6.8 years ago by mms140130 • 0

score 0 · Answer 1 · 2017-07-01

For all its apparent simplicity, this question is not very clear. One warning sign is:

seqinfo: 23 sequences from an unspecified genome; no seqlengths

How are we supposed to know what reference genome defines the genomic coordinates? This is essential information for mapping between characteristics of genomic regions and gene annotation.

A very simple approach: use the 'mygr' given later in this post and then

subsetByOverlaps(genes(EnsDb.Hsapiens.v75), mygr)

--- however, before thinking of that I got all tangled up in the following:

I tried to produce a bit of code that allows flexible resolution. There are several ways to proceed and I hope annotation experts in the group will weigh in.

addSyms = function (chr, start, end, txdb, mapresource, keytype, columntag, 
    restag = "gene_id", slstyle = "UCSC") 
{
    requireNamespace("GenomeInfoDb")
    ingr = GRanges(chr, IRanges(start, end))
    seqlevelsStyle(ingr) = slstyle
    so = subsetByOverlaps(genes(txdb), ingr)
    df = select(mapresource, keys = mcols(so)[[restag]], columns = columntag, 
        keytype = keytype)
    mcols(so) = cbind(mcols(so), df)
    so
}

Your data snippet can be read as a data.frame that I'll call ss

> ss
                        Sample Chromosome    Start       End Num_Probes
1 TCGA-3C-AAAU-01A-11D-A41E-01          1  3218610  63469503      33369
2 TCGA-3C-AAAU-01A-11D-A41E-01          1 63471492  63472103          3
3 TCGA-3C-AAAU-01A-11D-A41E-01          1 63472868  85632596      13663
4 TCGA-3C-AAAU-01A-11D-A41E-01          1 85635413  85878136        146
5 TCGA-3C-AAAU-01A-11D-A41E-01          1 85881278 149890533      20618
  Segment_Mean
1       0.1791
2      -0.8257
3       0.2994
4       0.6498
5       0.2781

Simple invocation of the function with UCSC location->symbol, hg19 reference

r2syms = addSyms( chr=ss$Chromosome, start=ss$Start, end=ss$End,
   txdb=TxDb.Hsapiens.UCSC.hg19.knownGene, mapresource= Homo.sapiens,
   keytype="ENTREZID", columntag="SYMBOL")

> r2syms
GRanges object with 1197 ranges and 3 metadata columns:
            seqnames                 ranges strand |     gene_id    ENTREZID
               <Rle>              <IRanges>  <Rle> | <character> <character>
  100126331     chr1 [117637265, 117637350]      + |   100126331   100126331
  100126348     chr1 [ 94312388,  94312467]      + |   100126348   100126348
  100128071     chr1 [ 32826871,  32827844]      - |   100128071   100128071
  100128787     chr1 [101092606, 101112560]      - |   100128787   100128787
  100129046     chr1 [ 94057525,  94065587]      + |   100129046   100129046
        ...      ...                    ...    ... .         ...         ...
       9923     chr1 [ 22778344,  22857650]      + |        9923        9923
       9927     chr1 [ 12040238,  12073572]      + |        9927        9927
       9939     chr1 [145507557, 145513535]      + |        9939        9939
       9967     chr1 [ 36690017,  36770957]      + |        9967        9967
        998     chr1 [ 22379120,  22419436]      + |         998         998
                  SYMBOL
             <character>
  100126331       MIR942
  100126348       MIR760
  100128071      FAM229A
  100128787    LINC01349
  100129046 LOC100129046
        ...          ...
       9923       ZBTB40
       9927         MFN2
       9939        RBM8A
       9967       THRAP3
        998        CDC42
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Different annotation resources can be "plugged in":

r2syms_e = addSyms( chr=ss$Chromosome, start=ss$Start, end=ss$End,
   txdb=EnsDb.Hsapiens.v75, mapresource= EnsDb.Hsapiens.v75, slstyle="Ensembl",
   keytype="GENEID", columntag="SYMBOL")

The "results" differ.

If it is of interest to coordinate the original intervals with the lists of genes,

mygr = GRanges(ss$Chromosome, IRanges(ss$Start, ss$End))
fo = findOverlaps(mygr, r2syms_e)

is a start. But a little more work may be needed.