Question

Is there a way to remove or take the first overlapping gene(s) when using makeTxDbFromGFF?

0

Entering edit mode

endrebak85 ▴ 40

@endrebak85-10660

Last seen 5.3 years ago

github.com/endrebak/

If there are multiple genes that overlap, I only want the first gene. Is there a way to do this in genomicfeatures already or do I need to preprocess the GTF file before calling makeTxDbFromGFF?

Thanks.

genomicfeatures • 1.7k views

ADD COMMENT • link updated 8.5 years ago by Hervé Pagès 16k • written 8.5 years ago by endrebak85 ▴ 40

score 2 · Accepted Answer · 2016-05-23

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.0 years ago

United States

No need to preprocess any files. Just import the file as a GRanges, subset it, and pass it to makeTxDbFromGRanges.

ADD COMMENT • link 8.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

How do I remove overlapping genes by subsetting granges?

ADD REPLY • link 8.5 years ago endrebak85 ▴ 40

1

Entering edit mode

It would be difficult, because I think it would need to be iterative. For example, you could have this situation:

----
  ----
    ----
      ----

You do not want to simply remove ranges 2,3,4. Rather, I think you want to keep 1 and 3. Thus, the algorithm would need to remove range 2, decide to keep range 3 since 2 is gone, and remove range 4.

Maybe something this (untested) will help you get started:

m <- as.matrix(findOverlaps(gr))
drop <- integer(0L)
while(nrow(m) > 0L) {
    last <- max(m)
    drop <- c(drop, last)
    m <- m[m[,1] != last & m[,2] != last,]
}
gr <- gr[-drop]

Please consider whether findOverlaps should have ignore.strand=TRUE or not.

ADD REPLY • link 8.5 years ago Michael Lawrence ★ 11k

score 2 · Accepted Answer · 2016-05-23

Hi,

makeTxDbFromGFF() has nothing to do with finding overlaps. It just imports the genes, transcripts, exons, and CDS from a GFF file into a TxDb object. Once you've done this, you can extract the genomic coordinates of the genes, transcripts, exons, or CDS from the TxDb object with genes(), transcripts(), exons(), or cds():

txdb <- makeTxDbFromGFF("path/to/GFF/file")
gn <- genes(txdb)  # extract the genes (see ?genes for more information)

Then you can use findOverlaps() to find the overlaps between a set of genomic ranges (e.g. some aligned reads) and gn:

findOverlaps(reads, gn)

If you only want the 1st overlapping gene for each range in reads, call findOverlaps() with select="first":

findOverlaps(reads, gn, select="first")

Please see ?findOverlaps for more information.

Cheers,

H.