Hi
I want to remove rows from out of chromosome length ranges from granges. Please suggest
Hi
I want to remove rows from out of chromosome length ranges from granges. Please suggest
There might be an easy way to make a GRanges object for the chromosomes, but it wasn't readily apparent to me, so I did it the hard way. But the idea is to have a GRanges that encompasses the extent of the chromosomes, and then use subsetByOverlaps().
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## get chrlengths - there must actually be an accessor I don't know abou
> con <- dbConnect(SQLite(), paste0(path.package("TxDb.Hsapiens.UCSC.hg19.knownGene"), "/extdata/TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite"))
> chrs <- dbGetQuery(con, "select chrom, length from chrominfo limit 24;")
> chrGR <- GRanges(chrs[,1], IRanges(rep(1, 24), chrs[,2]))
## so chrGR is a GRanges that has start and end for all 24 chromosomes
> badGR <- GRanges(paste0("chr", c(1,2,3,4,5)), IRanges(c(249250600, 3,4,5,6), c(249250645, 6,7,8,9)))
## a 'bad' GRanges, where the first range extends past the end of chr1
> badGR
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [249250600, 249250645] *
[2] chr2 [ 3, 6] *
[3] chr3 [ 4, 7] *
[4] chr4 [ 5, 8] *
[5] chr5 [ 6, 9] *
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
## now remove all ranges that are not 'within' the chromosome boundaries
> subsetByOverlaps(badGR, chrGR, type="within")
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 [3, 6] *
[2] chr3 [4, 7] *
[3] chr4 [5, 8] *
[4] chr5 [6, 9] *
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
Hi Jim,
FWIW, getting the sequence lengths out of a TxDb object is a one-liner:
seqlengths(TxDb.Hsapiens.UCSC.hg19.knownGene)
If you want the full Seqinfo:
seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)
If you want it as a GRanges object (your chrGR object):
as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene), "GRanges")
Too easy I know ;-)
H.
If your genome build is hg19 or some other key known to UCSC, you can do something like:
seqinfo(badGR) <- Seqinfo(genome="hg19")
goodGR <- trim(badGR)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi i have ext_grn my badGr which ahve 253 ranges, m trying to remove location 6 and 10 that are out of bound. when i applied the above process on it it is still having 253 ranges.
> ext_grn GRanges object with 253 ranges and 2 metadata columns: seqnames ranges strand | id score <Rle> <IRanges> <Rle> | <character> <numeric> [1] 1 [ 250, 1250] - | eaton_acs_1 24.6037 [2] 1 [ 30518, 31518] + | eaton_acs_2 18.0581 [3] 1 [ 41560, 42560] + | eaton_acs_3 15.4399 [4] 1 [ 69917, 70917] - | eaton_acs_4 10.2180 [5] 1 [124012, 125012] - | eaton_acs_5 19.1554 ... ... ... ... ... ... ... [249] 16 [776581, 777581] - | eaton_acs_249 18.2426 [250] 16 [818826, 819826] - | eaton_acs_250 20.8951 [251] 16 [880423, 881423] + | eaton_acs_251 21.7175 [252] 16 [932651, 933651] - | eaton_acs_252 15.2781 [253] 16 [941943, 942943] + | eaton_acs_253 21.6356 ------- seqinfo: 16 sequences from sacCer2 genome > a=Seqinfo(genome="sacCer2") > seqlevels(a,force=TRUE) <- c("chrI" , "chrII" , "chrIII" , "chrIV" , "chrV" , "chrVI" , "chrVII" , "chrVIII" ,"chrIX" , "chrX" , "chrXI" , "chrXII" , "chrXIII" ,"chrXIV" , "chrXV" , "chrXVI" ) > a=renameSeqlevels(a,as.character(c(1:16))) > a Seqinfo object with 16 sequences from sacCer2 genome: seqnames seqlengths isCircular genome 1 230208 FALSE sacCer2 2 813178 FALSE sacCer2 3 316617 FALSE sacCer2 4 1531919 FALSE sacCer2 5 576869 FALSE sacCer2 ... ... ... ... 12 1078175 FALSE sacCer2 13 924429 FALSE sacCer2 14 784333 FALSE sacCer2 15 1091289 FALSE sacCer2 16 948062 FALSE sacCer2 > seqinfo(ext_grn) <- a Warning message: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) : GRanges object contains 2 out-of-bound ranges located on sequences 6 and 10. Note that only ranges located on a non-circular sequence whose length is not NA can be considered out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags of the underlying sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information. > ext_grn <- trim(ext_grn) > ext_grn GRanges object with 253 ranges and 2 metadata columns: seqnames ranges strand | id score <Rle> <IRanges> <Rle> | <character> <numeric> [1] 1 [ 250, 1250] - | eaton_acs_1 24.6037 [2] 1 [ 30518, 31518] + | eaton_acs_2 18.0581 [3] 1 [ 41560, 42560] + | eaton_acs_3 15.4399 [4] 1 [ 69917, 70917] - | eaton_acs_4 10.2180 [5] 1 [124012, 125012] - | eaton_acs_5 19.1554 ... ... ... ... ... ... ... [249] 16 [776581, 777581] - | eaton_acs_249 18.2426 [250] 16 [818826, 819826] - | eaton_acs_250 20.8951 [251] 16 [880423, 881423] + | eaton_acs_251 21.7175 [252] 16 [932651, 933651] - | eaton_acs_252 15.2781 [253] 16 [941943, 942943] + | eaton_acs_253 21.6356 ------- seqinfo: 16 sequences from sacCer2 genomeI see. I'm not sure there's a super-easy way to drop ranges that fall outside of the bounds. trim() is expected to return an object of the same length as its input, so we might need a new function. Maybe Herve has some ideas?
Hi,
trim()uses internal helperGenomicRanges:::get_out_of_bound_index()to find the out-of-bound ranges. So you could use this on your GRanges object:idx <- GenomicRanges:::get_out_of_bound_index(ext_grn) if (length(idx) != 0L) ext_grn <- ext_grn[-idx]We should probably expose and document
get_out_of_bound_index(), and rename it in the process.H.