Search
Question: Make window around TSS and get values from another data.frame for bins of the specified window
0
gravatar for ssabri
5 months ago by
ssabri20
ssabri20 wrote:

I would like to take the "start_position" for each gene in my data.frame and draw a window of 100000 bases upstream and 100000 bases downstream (each in 200 base windows, so 500 windows upstream/downstream of the "start_position") and intersect the window with a granges object that has a score column corresponding to 200 base bin positions. 

The gene data.frame looks like this: 

> head(gene)
  ensembl_transcript_id length eff_length est_counts         tpm    ensembl_gene_id mgi_symbol start_position end_position strand chromosome_name
1    ENSMUST00000000001   3262       3063   7876.000  78.1469000 ENSMUSG00000000001      Gnai3      107910198    107949064      -            chr3
2    ENSMUST00000000003    902        703      0.000   0.0000000 ENSMUSG00000000003       Pbsn       75083240     75098962      -            chrX
3    ENSMUST00000000010   2574       2375    502.564   6.4310300 ENSMUSG00000020875      Hoxb9       96132771     96137909      +           chr11
4    ENSMUST00000000028   2143       1944    570.259   8.9151500 ENSMUSG00000000028      Cdc45       18780540     18812080      -           chr16
5    ENSMUST00000000033   3708       3509  26338.300 228.1170000 ENSMUSG00000048583       Igf2      149836673    149852721      -            chr7
6    ENSMUST00000000049   1190        991      3.000   0.0920027 ENSMUSG00000000049       Apoh      108204668    108275710      +           chr11

The granges object looks like this: 

> score_data
UCSC track 'MEF_K27AC.downsampled.sorted'
UCSCData object with 13274466 ranges and 1 metadata column:
             seqnames               ranges strand |     score
                <Rle>            <IRanges>  <Rle> | <numeric>
         [1]     chr1          [  1,  200]      * |         0
         [2]     chr1          [201,  400]      * |         0
         [3]     chr1          [401,  600]      * |         0
         [4]     chr1          [601,  800]      * |         0
         [5]     chr1          [801, 1000]      * |         0
         ...      ...                  ...    ... .       ...
  [13274462]     chrY [15901401, 15901600]      * |         0
  [13274463]     chrY [15901601, 15901800]      * |         0
  [13274464]     chrY [15901801, 15902000]      * |         0
  [13274465]     chrY [15902001, 15902200]      * |         0
  [13274466]     chrY [15902201, 15902400]      * |         0
  -------
  seqinfo: 21 sequences from mm9 genome​

The goal here is to compute for each gene TSS (start_position), get the score values that are upstream and downstream 500 windows. I'd like to add these windows as columns to the gene data.frame is possible. Ideally I would like the data in some form resembling: 

  ensembl_transcript_id tpm upstream_500 upstream_499 downstream_499 downstream_500
1 ENSMUST00000000001 78.1469          
2 ENSMUST00000000003 0          
3 ENSMUST00000000010 6.43103          
4 ENSMUST00000000028 8.91515          
5 ENSMUST00000000033 228.117          
6 ENSMUST00000000049 0.0920027          

Any help would be much appreciated. I've posted links to the data and current code below: 

Gene data.frame (this can be read in the code as the exp variable)

Score object (this can be read in the code as the mef_chip variable)

Current code that should work with the inputs above

 

ADD COMMENTlink modified 5 months ago by Michael Lawrence10.0k • written 5 months ago by ssabri20
2
gravatar for Michael Lawrence
5 months ago by
Michael Lawrence10.0k
United States
Michael Lawrence10.0k wrote:

Use promoters() to generate the promoters, tile() to break them up into windows, and findOverlaps() to find which score windows overlap which promoter windows. Then you need to decide how to aggregate when multiple score windows overlap a single promoter window.

ADD COMMENTlink written 5 months ago by Michael Lawrence10.0k

Hi Michael, 

I was wondering if you could help me out here. I am creating a new Granges object with the expression data and reducing the range to only the TSS, then I draw a window around the TSS. The exp_granges object looks correct to me. Now I want to subset the scores data.frame with these windows. The code below seems to work with the first two rows of the exp_granges object but when I use the entire object I only get a faction of the total rows in the expression data. I think this is because some of the ranges overlap? or the TSS locations are the same? Any help would be much appreciated! 

extend <- function(x, upstream=0, downstream=0) {
  if (any(strand(x) == "*")){ warning("'*' ranges were treated as '+'") }
  on_plus <- strand(x) == "+" | strand(x) == "*"
  new_start <- start(x) - ifelse(on_plus, upstream, downstream)
  new_end <- end(x) + ifelse(on_plus, downstream, upstream)
  ranges(x) <- IRanges(new_start, new_end)
  # trim(x)
  x
}

reduce_to_tss <- function(x) {
  return( resize(x, 1, fix="start", use.names=TRUE, ignore.strand=FALSE) ) 
}
### Add chip signal data +/- 100kb from TSS
exp_granges <- GRanges(seqnames=Rle(exp$chromosome_name),
                ranges=IRanges(start=exp$start_position, end=exp$end_position, names=seq(1,nrow(exp))),
                strand=Rle(exp$strand),
                symbol=toupper(exp$ensembl_transcript_id))
genome(exp_granges) <- 'mm9'
seqlengths(exp_granges) <- chr_lens[names(seqlengths(exp_granges))]
exp_granges <- reduce_to_tss(exp_granges)
exp_granges <- extend(exp_granges, upstream = 1000, downstream = 1000)

exp_granges <- exp_granges[1:2,]
exp_granges

GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand |             symbol
         <Rle>              <IRanges>  <Rle> |        <character>
  [1]     chr3 [107948064, 107950064]      - | ENSMUST00000000001
  [2]     chrX [ 75097962,  75099962]      - | ENSMUST00000000003
  -------
  seqinfo: 21 sequences from mm9 genome

scores_around_tss <- score(subsetByOverlaps(chip, exp_granges))
scores_around_tss_df <- do.call(rbind.data.frame, split(scores_around_tss, ceiling(seq_along(scores_around_tss_df)/11)))

 

 

ADD REPLYlink written 5 months ago by ssabri20

I think the problem is that some transcripts have the same start and end position and these are not preserved as separate entries in subsetByOverlap. This seems to work but is VERY slow: 

scores_list <- list()
for(r in seq(1,length(exp_granges))){
  scores_around_tss <- score(subsetByOverlaps(chip, exp_granges[r,]))
  scores_list[[r]] <- scores_around_tss
}
output <- do.call(rbind.data.frame, scores_list)

ADD REPLYlink written 5 months ago by ssabri20

I'll again recommend using promoters() to form exp_granges. subsetByOverlaps() will just return the elements of chip that overlap any range in exp_granges. Instead, I think you want to match the ranges up using findOverlaps() or similar.  But you probably want to group the scores by TSS, so something like:

mat <- as.matrix(extractList(score(chip), as(findOverlaps(exp_granges, chip), "List")))

I think that will only work though if you first round off the TSS windows so that they align with the chip windows. That way, there's the same number of chip windows for every TSS window.

ADD REPLYlink written 5 months ago by Michael Lawrence10.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 150 users visited in the last hour