GRanges/GFeatures: Retain all parent columns in cdsByOverlaps output
Entering edit mode
vanbelj ▴ 10
Last seen 7 months ago

I used DiffBind to get a GRange object with differentially bound genomic sites. This GRange also contains several numeric columns with useful statistics:

GRanges object with 945 ranges and 6 metadata columns:
       seqnames        ranges strand |      Conc Conc_NoSalt Conc_PostSalt_0      Fold   p-value       FDR
          <Rle>     <IRanges>  <Rle> | <numeric>   <numeric>       <numeric> <numeric> <numeric> <numeric>
  5309  chrXIII 482721-482821      * |      8.72        7.35            9.41     -2.06  2.92e-42  2.13e-38
  2432    chrVI 107743-107843      * |      9.19        8.00            9.84     -1.83  8.45e-38  3.09e-34
  6320    chrXV 430498-430598      * |      8.53        7.21            9.20     -2.00  1.57e-36  3.82e-33
  4835   chrXII 783983-784083      * |      9.01        7.92            9.62     -1.71  7.88e-33  1.44e-29
  6286    chrXV 383445-383545      * |      9.51        8.56           10.08     -1.52  1.13e-28  1.66e-25
   ...      ...           ...    ... .       ...         ...             ...       ...       ...       ...
  1873    chrIX 247430-247530      * |      8.71        8.91            8.47      0.43   0.00634    0.0492
  4284    chrXI 483176-483276      * |      9.37        9.19            9.52     -0.33   0.00637    0.0494
  5806   chrXIV 392558-392658      * |      9.10        8.88            9.28     -0.40   0.00638    0.0494
  3376  chrVIII 271580-271680      * |      7.44        7.76            7.03      0.73   0.00643    0.0498
  6262    chrXV 349087-349187      * |      8.69        8.47            8.88     -0.41   0.00645    0.0499
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

Next, I use GenomicFeatures::cdsByOverlaps to retrieve meaningful annotations about the identified genomic regions (such as Gene Name) with the following bit of code:

HitsCDS <- cdsByOverlaps(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, Salt.DB,
                               maxgap = -1L, minoverlap = 0L,
                               type = c("any", "start", "end"),
                               columns = c("tx_id", "tx_name"))

However, the new GRange object lacks the statistics columns, and I haven't found a way to include them in the output (or merge them with the output):

GRanges object with 289 ranges and 2 metadata columns:
        seqnames        ranges strand |         tx_id         tx_name
           <Rle>     <IRanges>  <Rle> | <IntegerList> <CharacterList>
    [1]     chrI   31567-32940      + |             7         YAL062W
    [2]     chrI   71786-73288      + |            20         YAL038W
    [3]     chrI 169375-170295      + |            44         YAR015W
    [4]     chrI   42881-45022      - |            69         YAL054C
    [5]     chrI 139503-141431      - |           103         YAL005C
    ...      ...           ...    ... .           ...             ...
  [285]   chrXVI 697147-698265      - |          6599         YPR078C
  [286]   chrXVI 702114-703970      - |          6600         YPR081C
  [287]   chrXVI 813156-814058      - |          6627         YPR139C
  [288]   chrXVI 824690-824926      - |          6631       YPR145C-A
  [289]   chrXVI 825347-825676      - |          6632         YPR146C
  seqinfo: 17 sequences (1 circular) from sacCer3 genome

I tried simply adding additional column names, such as in the following, but this throws an error. I'm guessing because this references the Tx.Db GRange and not my DiffBind GRange?

columns = c("tx_id", "tx_name", "Conc"))

I'm sure there is a simple way to accomplish this, but I'm fairly inexperienced with R, GRanges, Bioconductor and haven't found a solution. Thanks for any help!

GRanges GenomicRanges GenomicFeatures cdsByOverlaps DiffBind • 394 views
Entering edit mode
Last seen 2 days ago
United States

The cdsByOverlaps function is intended to get you a set of cds that overlap your ranges. So the new GRanges object won't have any of the statistics because the only way your differentially bound ranges come into play is to see if a given cds region overlaps it.

If instead you want to say what genes a given range overlaps, where you could define 'overlap' in many different ways, depending on how close the binding region has to be to the gene, or if you just care about promoter sites or whatever, then you should use findOverlaps instead, which will give you a Hits object that provides an index for each of your GRanges objects.

So as an example, using completely untested code because I don't have your data, you would do something like

## I would use transcripts rather than cds, but you choose
> txs <- transcriptsBy(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
## here I call your GRanges from diffbind 'diffbindGR'
> fo <- findOverlaps(txs, diffbindGR)
## add column to diffbindGR
> mcols(diffbindGR)$gene_id <- NA
## and populate with gene ids
> mcols(diffbindGR)$gene_id[subjectHits(fo)] <- names(txs)[queryHits(fo)]

You could also use the names of the txs object to get the gene symbols if that's better for you, doing something like

> library(org.Sc.sgd.db)
> gns <- mapIds(org.Sc.sgd.db, names(txs), "GENENAME", "ORF")
'select()' returned 1:1 mapping between keys and columns
> head(gns)
     Q0010      Q0032      Q0055      Q0075      Q0080      Q0085 
        NA         NA      "AI2" "AI5_BETA"     "ATP8"     "ATP6" 
> mcols(diffbindGR)$genename <- NA
> mcols(diffbindGR)$genename[subjectHits(fo)] <- gns[queryHits(fo)]
Entering edit mode

Thanks James, this works for me!

Entering edit mode

Cool. I know nothing about yeast, so this might not be applicable, but there is a promoters function that gives you the region around the transcription start site, which might be relevant to what you are doing (assuming some sort of ChIP-seq for transcription factor binding?)

Entering edit mode

I actually use the promoter function, but unfortunately can only define the promoter regions based on position relative to the transcription start site. The yeast genome is fairly compact, so there are many instances where a reasonably promoter length would actually run into an upstream transcript.

Do you know if there is a way to define promoters from an annotation file? I'm thinking of making a set of rules to define promoters in a gene specific manner, which would result in an annotation file with both transcripts and promoters that would need to be made into a TxDb object (or perhaps something else from a different package?)


Entering edit mode

Thanks for bringing this use case to our attention. It's currently not possible to store an explicit set of promoters within a TxDb package. The easiest work around would be a separate BED file of promoters with names corresponding to the genes.

Entering edit mode

I can make the bed file pretty easily. Can you give me piece of example code for how I would reference the bed file that contains the promoter annotations?

For instance, what object type is "promoters" in the following:

promoters <- promoters(transcripts(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, columns=columns, use.names=TRUE), upstream=200, downstream=50)

If I could import the bed file with my promoter annotations and save it as a similar object-type in R, I think I could use the following code to find peak-overlaps with the promoters:

PromoterHits <- subsetByOverlaps(promoters, diffbindGR)

Think that would work?


Entering edit mode

Just use rtracklayer::import() to read the BED as a GRanges.


Login before adding your answer.

Traffic: 378 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6