Question

Make window around TSS and get values from another data.frame for bins of the specified window

0

Entering edit mode

ssabri ▴ 20

@ssabri-9464

Last seen 4.0 years ago

I would like to take the "start_position" for each gene in my data.frame and draw a window of 100000 bases upstream and 100000 bases downstream (each in 200 base windows, so 500 windows upstream/downstream of the "start_position") and intersect the window with a granges object that has a score column corresponding to 200 base bin positions.

The gene data.frame looks like this:

> head(gene)
  ensembl_transcript_id length eff_length est_counts         tpm    ensembl_gene_id mgi_symbol start_position end_position strand chromosome_name
1    ENSMUST00000000001   3262       3063   7876.000  78.1469000 ENSMUSG00000000001      Gnai3      107910198    107949064      -            chr3
2    ENSMUST00000000003    902        703      0.000   0.0000000 ENSMUSG00000000003       Pbsn       75083240     75098962      -            chrX
3    ENSMUST00000000010   2574       2375    502.564   6.4310300 ENSMUSG00000020875      Hoxb9       96132771     96137909      +           chr11
4    ENSMUST00000000028   2143       1944    570.259   8.9151500 ENSMUSG00000000028      Cdc45       18780540     18812080      -           chr16
5    ENSMUST00000000033   3708       3509  26338.300 228.1170000 ENSMUSG00000048583       Igf2      149836673    149852721      -            chr7
6    ENSMUST00000000049   1190        991      3.000   0.0920027 ENSMUSG00000000049       Apoh      108204668    108275710      +           chr11

The granges object looks like this:

> score_data
UCSC track 'MEF_K27AC.downsampled.sorted'
UCSCData object with 13274466 ranges and 1 metadata column:
             seqnames               ranges strand |     score
                <Rle>            <IRanges>  <Rle> | <numeric>
         [1]     chr1          [  1,  200]      * |         0
         [2]     chr1          [201,  400]      * |         0
         [3]     chr1          [401,  600]      * |         0
         [4]     chr1          [601,  800]      * |         0
         [5]     chr1          [801, 1000]      * |         0
         ...      ...                  ...    ... .       ...
  [13274462]     chrY [15901401, 15901600]      * |         0
  [13274463]     chrY [15901601, 15901800]      * |         0
  [13274464]     chrY [15901801, 15902000]      * |         0
  [13274465]     chrY [15902001, 15902200]      * |         0
  [13274466]     chrY [15902201, 15902400]      * |         0
  -------
  seqinfo: 21 sequences from mm9 genome

The goal here is to compute for each gene TSS (start_position), get the score values that are upstream and downstream 500 windows. I'd like to add these windows as columns to the gene data.frame is possible. Ideally I would like the data in some form resembling:

	ensembl_transcript_id	tpm	upstream_500	upstream_499	…	downstream_499	downstream_500
1	ENSMUST00000000001	78.1469
2	ENSMUST00000000003	0
3	ENSMUST00000000010	6.43103
4	ENSMUST00000000028	8.91515
5	ENSMUST00000000033	228.117
6	ENSMUST00000000049	0.0920027

Any help would be much appreciated. I've posted links to the data and current code below:

Gene data.frame (this can be read in the code as the exp variable)

Score object (this can be read in the code as the mef_chip variable)

Current code that should work with the inputs above

R granges genomicranges chipseq • 1.6k views

ADD COMMENT • link updated 6.3 years ago by Michael Lawrence ★ 11k • written 6.3 years ago by ssabri ▴ 20

score 2 · Answer 1 · 2018-01-11

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

Use promoters() to generate the promoters, tile() to break them up into windows, and findOverlaps() to find which score windows overlap which promoter windows. Then you need to decide how to aggregate when multiple score windows overlap a single promoter window.

ADD COMMENT • link 6.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael,

I was wondering if you could help me out here. I am creating a new Granges object with the expression data and reducing the range to only the TSS, then I draw a window around the TSS. The exp_granges object looks correct to me. Now I want to subset the scores data.frame with these windows. The code below seems to work with the first two rows of the exp_granges object but when I use the entire object I only get a faction of the total rows in the expression data. I think this is because some of the ranges overlap? or the TSS locations are the same? Any help would be much appreciated!

extend <- function(x, upstream=0, downstream=0) {
  if (any(strand(x) == "*")){ warning("'*' ranges were treated as '+'") }
  on_plus <- strand(x) == "+" | strand(x) == "*"
  new_start <- start(x) - ifelse(on_plus, upstream, downstream)
  new_end <- end(x) + ifelse(on_plus, downstream, upstream)
  ranges(x) <- IRanges(new_start, new_end)
  # trim(x)
  x
}

reduce_to_tss <- function(x) {
  return( resize(x, 1, fix="start", use.names=TRUE, ignore.strand=FALSE) ) 
}

### Add chip signal data +/- 100kb from TSS
exp_granges <- GRanges(seqnames=Rle(exp$chromosome_name),
                ranges=IRanges(start=exp$start_position, end=exp$end_position, names=seq(1,nrow(exp))),
                strand=Rle(exp$strand),
                symbol=toupper(exp$ensembl_transcript_id))
genome(exp_granges) <- 'mm9'
seqlengths(exp_granges) <- chr_lens[names(seqlengths(exp_granges))]
exp_granges <- reduce_to_tss(exp_granges)
exp_granges <- extend(exp_granges, upstream = 1000, downstream = 1000)

exp_granges <- exp_granges[1:2,]
exp_granges

GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand |             symbol
         <Rle>              <IRanges>  <Rle> |        <character>
  [1]     chr3 [107948064, 107950064]      - | ENSMUST00000000001
  [2]     chrX [ 75097962,  75099962]      - | ENSMUST00000000003
  -------
  seqinfo: 21 sequences from mm9 genome

scores_around_tss <- score(subsetByOverlaps(chip, exp_granges))
scores_around_tss_df <- do.call(rbind.data.frame, split(scores_around_tss, ceiling(seq_along(scores_around_tss_df)/11)))

ADD REPLY • link 6.3 years ago ssabri ▴ 20

0

Entering edit mode

I think the problem is that some transcripts have the same start and end position and these are not preserved as separate entries in subsetByOverlap. This seems to work but is VERY slow:

scores_list <- list()
for(r in seq(1,length(exp_granges))){
  scores_around_tss <- score(subsetByOverlaps(chip, exp_granges[r,]))
  scores_list[[r]] <- scores_around_tss
}
output <- do.call(rbind.data.frame, scores_list)

ADD REPLY • link 6.3 years ago ssabri ▴ 20

0

Entering edit mode

I'll again recommend using promoters() to form exp_granges. subsetByOverlaps() will just return the elements of chip that overlap any range in exp_granges. Instead, I think you want to match the ranges up using findOverlaps() or similar. But you probably want to group the scores by TSS, so something like:

mat <- as.matrix(extractList(score(chip), as(findOverlaps(exp_granges, chip), "List")))

I think that will only work though if you first round off the TSS windows so that they align with the chip windows. That way, there's the same number of chip windows for every TSS window.

ADD REPLY • link 6.3 years ago Michael Lawrence ★ 11k