Question

trying to annotate overlaps with external file information

1

Entering edit mode

chrisclarkson100 ▴ 50

@chrisclarkson100-11114

Last seen 23 months ago

United Kingdom

I have found the overlapping regions between the genomic regions I am interested in and genes with 'intersectBed'- external of R.

intersectBed -a regions_of_interest.bed -b genes.bed > overlap_genes.bed

So now the file looks like this

>head overlap_genes.bed

chr1    180110925    180113591    Region_665    0    0

chr1    180110925    180113591    Region_665    0    0

chr1    180110925    180113591    Region_665    0    0

chr11    43113461    43117525    Region_837    0    0

This however is no good to me as I also need to include columns in this bed file which specify the gene that is being overlapped and where it starts and ends....

head genes.bed

chr1 109408122 109422169 + Serpinb2

chr1 174405505 174408706 + Slamf9

chr1 133724589 133743546 + Slc41a1

chr1 173847813 173848810 + Slamf6

i.e.. what I would like to see is:

head overlap_genes.bed

chr1    180110925    180113591    Region_665    0    0    gene_name .....

chr1    180110925    180113591    Region_665    0    0    another_gene 

chr1    180110925    180113591    Region_665    0    0    one_more_gene

So how do I bind the according gene_names and their start, end and orientation to the overlap_genes object? My first thought was to iterate through each row and check if there were overlaps between the i-th range intervals and if so- bind the concerned gene-name and extra info to that i-th overlap_genes row.

genes=read.table('genes.bed', sep='\t', header=F, stringsAsFactors=F) overlap_genes=read.table('overlap_genes.bed', sep='\t', header=F, stringsAsFactors=F)

informed=data.frame() for(i in nrow(overlap_genes)){ apply(genes,1, function(x){tryCatch(

if(intersect(seq(x[2], x[3]), seq(overlap_genes[i,1], overlap_genes[i,2])))

{rbind(informed,cbind(overlap_genes[i,],x))}

, error=function(e) e)}) }

This however has not worked for me ... The informed data frame comes up empty.... which is obviously wrong as the 'overlap_genes' is the product of being overlapped with genes so all of its rows should overlap with 'genes'... I have tried using Iranges but it seems to posses the same functionality as intersectBed but cannot retain extra columns from external data.....

Does anyone know of a way around this?? Thanks

iranges granges • 1.3k views

ADD COMMENT • link updated 7.0 years ago by Michael Lawrence ★ 11k • written 7.0 years ago by chrisclarkson100 ▴ 50

score 3 · Accepted Answer · 2017-05-06

It looks like you want what comes from the bedtools -wb option. It would write out the gene ranges after the intersections. The HelloRanges package maps bedtools command lines to IRanges code. For example,

> bedtools_intersect("-a regions_of_interest.bed -b genes.bed -wb")
{
    genome <- Seqinfo(genome = NA_character_)
    gr_a <- import("regions_of_interest.bed", genome = genome)
    gr_b <- import("genes.bed", genome = genome)
    pairs <- findOverlapPairs(gr_a, gr_b, ignore.strand = TRUE)
    ans <- Pairs(pintersect(pairs, ignore.strand = TRUE), second(pairs))
    ans
}

That gives a Pairs object that pairs up the intersections and genes (as GRanges). You could also just stick the gene GRanges as a meta column on the intersection GRanges, like:

intersection <- first(ans)
intersection$genes <- second(ans)

Hopefully this gets you on the right path. Oh, and note the call to Seqinfo(). You should fill in your genome identifier, eg, "GRCh38" to get the full benefit of the framework.