Search
Question: How to remove out of bound rows from granges
2
gravatar for vinod.acear
3.0 years ago by
vinod.acear30
India
vinod.acear30 wrote:

Hi 

I want to remove rows from out of chromosome length  ranges from granges. Please suggest

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by vinod.acear30

Hi i have ext_grn my badGr which ahve 253 ranges, m trying to remove location 6 and 10 that are out of bound. when i applied the above process on it it is still having 253 ranges.

> ext_grn
GRanges object with 253 ranges and 2 metadata columns:
        seqnames           ranges strand   |            id     score
           <Rle>        <IRanges>  <Rle>   |   <character> <numeric>
    [1]        1 [   250,   1250]      -   |   eaton_acs_1   24.6037
    [2]        1 [ 30518,  31518]      +   |   eaton_acs_2   18.0581
    [3]        1 [ 41560,  42560]      +   |   eaton_acs_3   15.4399
    [4]        1 [ 69917,  70917]      -   |   eaton_acs_4   10.2180
    [5]        1 [124012, 125012]      -   |   eaton_acs_5   19.1554
    ...      ...              ...    ... ...           ...       ...
  [249]       16 [776581, 777581]      -   | eaton_acs_249   18.2426
  [250]       16 [818826, 819826]      -   | eaton_acs_250   20.8951
  [251]       16 [880423, 881423]      +   | eaton_acs_251   21.7175
  [252]       16 [932651, 933651]      -   | eaton_acs_252   15.2781
  [253]       16 [941943, 942943]      +   | eaton_acs_253   21.6356
  -------
  seqinfo: 16 sequences from sacCer2 genome

> a=Seqinfo(genome="sacCer2")
> seqlevels(a,force=TRUE) <- c("chrI" ,   "chrII" ,  "chrIII" , "chrIV"  , "chrV"  ,  "chrVI" ,  "chrVII" , "chrVIII" ,"chrIX"  , "chrX" , "chrXI"  , "chrXII" , "chrXIII" ,"chrXIV" , "chrXV"  , "chrXVI" )

> a=renameSeqlevels(a,as.character(c(1:16)))
> a

Seqinfo object with 16 sequences from sacCer2 genome:
  seqnames seqlengths isCircular  genome
  1            230208      FALSE sacCer2
  2            813178      FALSE sacCer2
  3            316617      FALSE sacCer2
  4           1531919      FALSE sacCer2
  5            576869      FALSE sacCer2
  ...             ...        ...     ...
  12          1078175      FALSE sacCer2
  13           924429      FALSE sacCer2
  14           784333      FALSE sacCer2
  15          1091289      FALSE sacCer2
  16           948062      FALSE sacCer2

> seqinfo(ext_grn) <- a
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 2 out-of-bound ranges located on sequences 6 and 10. Note
  that only ranges located on a non-circular sequence whose length is not NA can be
  considered out-of-bound (use seqlengths() and isCircular() to get the lengths and
  circularity flags of the underlying sequences). You can use trim() to trim these
  ranges. See ?`trim,GenomicRanges-method` for more information.
> ext_grn <- trim(ext_grn)
> ext_grn
GRanges object with 253 ranges and 2 metadata columns:
        seqnames           ranges strand   |            id     score
           <Rle>        <IRanges>  <Rle>   |   <character> <numeric>
    [1]        1 [   250,   1250]      -   |   eaton_acs_1   24.6037
    [2]        1 [ 30518,  31518]      +   |   eaton_acs_2   18.0581
    [3]        1 [ 41560,  42560]      +   |   eaton_acs_3   15.4399
    [4]        1 [ 69917,  70917]      -   |   eaton_acs_4   10.2180
    [5]        1 [124012, 125012]      -   |   eaton_acs_5   19.1554
    ...      ...              ...    ... ...           ...       ...
  [249]       16 [776581, 777581]      -   | eaton_acs_249   18.2426
  [250]       16 [818826, 819826]      -   | eaton_acs_250   20.8951
  [251]       16 [880423, 881423]      +   | eaton_acs_251   21.7175
  [252]       16 [932651, 933651]      -   | eaton_acs_252   15.2781
  [253]       16 [941943, 942943]      +   | eaton_acs_253   21.6356
  -------
  seqinfo: 16 sequences from sacCer2 genome

 

 

ADD REPLYlink written 3.0 years ago by vinod.acear30

I see. I'm not sure there's a super-easy way to drop ranges that fall outside of the bounds. trim() is expected to return an object of the same length as its input, so we might need a new function. Maybe Herve has some ideas?

ADD REPLYlink written 3.0 years ago by Michael Lawrence10k
2

Hi,

trim() uses internal helper GenomicRanges:::get_out_of_bound_index() to find the out-of-bound ranges. So you could use this on your GRanges object:

idx <- GenomicRanges:::get_out_of_bound_index(ext_grn)
if (length(idx) != 0L)
    ext_grn <- ext_grn[-idx]

We should probably expose and document get_out_of_bound_index(), and rename it in the process.

H.

ADD REPLYlink written 3.0 years ago by Hervé Pagès ♦♦ 13k
1
gravatar for James W. MacDonald
3.0 years ago by
United States
James W. MacDonald48k wrote:

There might be an easy way to make a GRanges object for the chromosomes, but it wasn't readily apparent to me, so I did it the hard way. But the idea is to have a GRanges that encompasses the extent of the chromosomes, and then use subsetByOverlaps().

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## get chrlengths - there must actually be an accessor I don't know abou
> con <- dbConnect(SQLite(), paste0(path.package("TxDb.Hsapiens.UCSC.hg19.knownGene"), "/extdata/TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite"))
> chrs <- dbGetQuery(con, "select chrom, length from chrominfo limit 24;")
> chrGR <- GRanges(chrs[,1], IRanges(rep(1, 24), chrs[,2]))
## so chrGR is a GRanges that has start and end for all 24 chromosomes
> badGR <- GRanges(paste0("chr", c(1,2,3,4,5)), IRanges(c(249250600, 3,4,5,6), c(249250645, 6,7,8,9)))
## a 'bad' GRanges, where the first range extends past the end of chr1
> badGR
GRanges object with 5 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [249250600, 249250645]      *
  [2]     chr2 [        3,         6]      *
  [3]     chr3 [        4,         7]      *
  [4]     chr4 [        5,         8]      *
  [5]     chr5 [        6,         9]      *
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
## now remove all ranges that are not 'within' the chromosome boundaries
> subsetByOverlaps(badGR, chrGR, type="within")
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2    [3, 6]      *
  [2]     chr3    [4, 7]      *
  [3]     chr4    [5, 8]      *
  [4]     chr5    [6, 9]      *
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

ADD COMMENTlink written 3.0 years ago by James W. MacDonald48k

Hi Jim,

FWIW, getting the sequence lengths out of a TxDb object is a one-liner:

seqlengths(TxDb.Hsapiens.UCSC.hg19.knownGene)

If you want the full Seqinfo:

seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)

If you want it as a GRanges object (your chrGR object):

as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene), "GRanges")

Too easy I know ;-)

H.

ADD REPLYlink written 3.0 years ago by Hervé Pagès ♦♦ 13k

Thanks Hervé! I figured it had to be something easy...

ADD REPLYlink written 3.0 years ago by James W. MacDonald48k
0
gravatar for Michael Lawrence
3.0 years ago by
United States
Michael Lawrence10k wrote:

If your genome build is hg19 or some other key known to UCSC, you can do something like:

seqinfo(badGR) <- Seqinfo(genome="hg19")
goodGR <- trim(badGR)

 

ADD COMMENTlink written 3.0 years ago by Michael Lawrence10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 197 users visited in the last hour