Search
Question: Mapping regions to gene symbols
0
gravatar for mms140130
4 months ago by
mms1401300
mms1401300 wrote:

I have used the code in another question https://support.bioconductor.org/p/67118/#97616,but it didn't work with me, I have those Segment_Mean, a start and end I want to know which genes overlap with those regions.

xx = splitColumnByOverlap(gns, cnv, "SYMBOL")
Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

my cnv is the copy number variation file for tumor tissue only

 GRanges object with 284458 ranges and 0 metadata columns:
           seqnames                 ranges strand
              <Rle>              <IRanges>  <Rle>
       [1]        1 [  3218610,  95674710]      *
       [2]        1 [ 95676511,  95676518]      *
       [3]        1 [ 95680124, 167057183]      *
       [4]        1 [167057495, 167059336]      *
       [5]        1 [167059760, 181602002]      *
       ...      ...                    ...    ...
  [284454]       19  [  284018,  58878226]      *
  [284455]       20  [  455764,  62219837]      *
  [284456]       21  [15347621,  47678774]      *
  [284457]       22  [17423930,  49331012]      *
  [284458]       23  [ 3157107, 154905589]      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

my data file look like the following(part of it)

Sample Chromosome Start End Num_Probes Segment_Mean
TCGA-3C-AAAU-01A-11D-A41E-01 1 3218610 63469503 33369 0.1791
TCGA-3C-AAAU-01A-11D-A41E-01 1 63471492 63472103 3 -0.8257
TCGA-3C-AAAU-01A-11D-A41E-01 1 63472868 85632596 13663 0.2994
TCGA-3C-AAAU-01A-11D-A41E-01 1 85635413 85878136 146 0.6498
TCGA-3C-AAAU-01A-11D-A41E-01 1 85881278 149890533 20618 0.2781
ADD COMMENTlink modified 4 months ago by Vincent J. Carey, Jr.6.2k • written 4 months ago by mms1401300
0
gravatar for Vincent J. Carey, Jr.
4 months ago by
United States
Vincent J. Carey, Jr.6.2k wrote:

For all its apparent simplicity, this question is not very clear.  One warning sign is:

seqinfo: 23 sequences from an unspecified genome; no seqlengths

How are we supposed to know what reference genome defines the genomic coordinates?  This is essential information for mapping between characteristics of genomic regions and gene annotation.

A very simple approach: use the 'mygr' given later in this post and then

subsetByOverlaps(genes(EnsDb.Hsapiens.v75), mygr)

--- however, before thinking of that I got all tangled up in the following:

I tried to produce a bit of code that allows flexible resolution.  There are several ways to proceed and I hope annotation experts in the group will weigh in.

addSyms = function (chr, start, end, txdb, mapresource, keytype, columntag, 
    restag = "gene_id", slstyle = "UCSC") 
{
    requireNamespace("GenomeInfoDb")
    ingr = GRanges(chr, IRanges(start, end))
    seqlevelsStyle(ingr) = slstyle
    so = subsetByOverlaps(genes(txdb), ingr)
    df = select(mapresource, keys = mcols(so)[[restag]], columns = columntag, 
        keytype = keytype)
    mcols(so) = cbind(mcols(so), df)
    so
}

Your data snippet can be read as a data.frame that I'll call ss

> ss
                        Sample Chromosome    Start       End Num_Probes
1 TCGA-3C-AAAU-01A-11D-A41E-01          1  3218610  63469503      33369
2 TCGA-3C-AAAU-01A-11D-A41E-01          1 63471492  63472103          3
3 TCGA-3C-AAAU-01A-11D-A41E-01          1 63472868  85632596      13663
4 TCGA-3C-AAAU-01A-11D-A41E-01          1 85635413  85878136        146
5 TCGA-3C-AAAU-01A-11D-A41E-01          1 85881278 149890533      20618
  Segment_Mean
1       0.1791
2      -0.8257
3       0.2994
4       0.6498
5       0.2781

Simple invocation of the function with UCSC location->symbol, hg19 reference

r2syms = addSyms( chr=ss$Chromosome, start=ss$Start, end=ss$End,
   txdb=TxDb.Hsapiens.UCSC.hg19.knownGene, mapresource= Homo.sapiens,
   keytype="ENTREZID", columntag="SYMBOL")

> r2syms
GRanges object with 1197 ranges and 3 metadata columns:
            seqnames                 ranges strand |     gene_id    ENTREZID
               <Rle>              <IRanges>  <Rle> | <character> <character>
  100126331     chr1 [117637265, 117637350]      + |   100126331   100126331
  100126348     chr1 [ 94312388,  94312467]      + |   100126348   100126348
  100128071     chr1 [ 32826871,  32827844]      - |   100128071   100128071
  100128787     chr1 [101092606, 101112560]      - |   100128787   100128787
  100129046     chr1 [ 94057525,  94065587]      + |   100129046   100129046
        ...      ...                    ...    ... .         ...         ...
       9923     chr1 [ 22778344,  22857650]      + |        9923        9923
       9927     chr1 [ 12040238,  12073572]      + |        9927        9927
       9939     chr1 [145507557, 145513535]      + |        9939        9939
       9967     chr1 [ 36690017,  36770957]      + |        9967        9967
        998     chr1 [ 22379120,  22419436]      + |         998         998
                  SYMBOL
             <character>
  100126331       MIR942
  100126348       MIR760
  100128071      FAM229A
  100128787    LINC01349
  100129046 LOC100129046
        ...          ...
       9923       ZBTB40
       9927         MFN2
       9939        RBM8A
       9967       THRAP3
        998        CDC42
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Different annotation resources can be "plugged in":

r2syms_e = addSyms( chr=ss$Chromosome, start=ss$Start, end=ss$End,
   txdb=EnsDb.Hsapiens.v75, mapresource= EnsDb.Hsapiens.v75, slstyle="Ensembl",
   keytype="GENEID", columntag="SYMBOL")

The "results" differ.

If it is of interest to coordinate the original intervals with the lists of genes,

mygr = GRanges(ss$Chromosome, IRanges(ss$Start, ss$End))
fo = findOverlaps(mygr, r2syms_e)

is a start.  But a little more work may be needed.

 

   

ADD COMMENTlink modified 4 months ago • written 4 months ago by Vincent J. Carey, Jr.6.2k

 

@Vincent J. Carey, Jr.

I had the following error too 

subsetByOverlaps(genes(EnsDb.Hsapiens.v75), mygr)
Error in genes(EnsDb.Hsapiens.v75) : 
  object 'EnsDb.Hsapiens.v75' not found

in addition to I have tried the code and I used the r2syms as you wrote then

> mygr = GRanges(ss$Chromosome, IRanges(ss$Start, ss$End))
> fo = findOverlaps(mygr, r2syms)
Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

what should I do?

ADD REPLYlink modified 4 months ago • written 4 months ago by mms1401300
look for the difference between seqinfo(mygr) and seqinfo(r2syms) seqlevelsStyle(mygr) and seqlevelsStyle(r2syms) get them to coincide. it can take some work. simplest is often seqlevelsStyle(mygr) = seqlevelsStyle(r2syms) assuming GenomeInfoDb package is attached On Sat, Jul 1, 2017 at 3:22 PM, mms140130 [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User mms140130 <https: support.bioconductor.org="" u="" 13360=""/> wrote Comment: > Mapping regions to gene symbols > <https: support.bioconductor.org="" p="" 97664="" #97674="">: > > > > I have tried the code and I used the r2syms as you wrote then > > > mygr = GRanges(ss$Chromosome, IRanges(ss$Start, ss$End)) > > fo = findOverlaps(mygr, r2syms) > Warning message: > In .Seqinfo.mergexy(x, y) : > The 2 combined objects have no sequence levels in common. (Use > suppressWarnings() to suppress this warning.) > > what should I do? > > ------------------------------ > > Post tags: annotation, R, copynumber > > You may reply via email or visit https://support.bioconductor. > org/p/97664/#97674 >
ADD REPLYlink written 4 months ago by Vincent J. Carey, Jr.6.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 140 users visited in the last hour