Question: Not getting the expected answers from GenomicRanges follow/precede
gravatar for fruce_ki
13 months ago by
University of Dundee
fruce_ki10 wrote:


I'm getting the opposite answer than I would expect. This has nothing to do with getting opposite behaviour between '+' and '-' strand. My problem here is all on the  '+' strand.


# Subject
ref <- GenomicRanges::makeGRangesFromDataFrame(
    structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = c("Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "mitochondria", "chloroplast")), 
                    start = c(3631L, 5928L, 11649L, 23146L, 28500L, 31170L, 33379L, 38752L, 44677L, 45296L), 
                    end = c(5899L, 8737L, 13714L, 31227L, 28706L, 33153L,  37871L, 40944L, 44787L, 47019L), 
                    width = c(2269L, 2810L, 2066L, 8082L, 207L, 1984L, 4493L, 2193L, 111L, 1724L), 
                    strand = structure(c(1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L), class = "factor", .Label = c("+", "-", "*")), 
                    gene_id = c("AT1G01010", "AT1G01020", "AT1G01030", "AT1G01040", "AT1G01046", "AT1G01050", "AT1G01060", "AT1G01070", "AT1G01073", "AT1G01080")), 
                .Names = c("seqnames", "start", "end", "width", "strand", "gene_id"), 
            row.names = c("AT1G01010", "AT1G01020", "AT1G01030", "AT1G01040", "AT1G01046", "AT1G01050", "AT1G01060", "AT1G01070", "AT1G01073", "AT1G01080"), 
            class = "data.frame") )
# Query
qry <- GenomicRanges::makeGRangesFromDataFrame(
    structure(list(seqnames = structure(1L, class = "factor", .Label = "Chr1"), 
                   start = 31137L, end = 31183L, width = 47L, 
                   strand = structure(1L, class = "factor", .Label = c("+", "-", "*")), 
                   value = 9.90299791781146, area = 465.440902137139, 
                   indexStart = 39L, indexEnd = 85L, cluster = 2L, clusterL = 47), 
              .Names = c("seqnames", "start", "end", "width", "strand", "value", "area", "indexStart", "indexEnd", "cluster", "clusterL"), 
              row.names = "2", class = "data.frame") )

# Overlap looks as expected. 
# The 4th reference directly overlaps and is on the correct strand. 
# The 6th would also overlap but is on the wrong strand.
GenomicRanges::findOverlaps(qry, ref, ignore.strand=FALSE)

# Getting the upstream gene does not produce the expected result!
# Region 9 is to the RIGHT, DOWNstream of my query.
GenomicRanges::precede(qry, ref, ignore.strand=FALSE)

# Getting the downstream gene does not produce the expected result either!
# Region 5 is to the LEFT, UPstream of my query!
GenomicRanges::follow(qry, ref, ignore.strand=FALSE)


Either follow/precede don't do what I think they do, or they have a very convoluted way of defining which way is up and down.

My session (Issue also reproducible on older versions of Bioconductor):

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] derfinder_1.10.6

loaded via a namespace (and not attached):
 [1] Biobase_2.36.2             bit64_0.9-7                splines_3.4.1              foreach_1.4.3              GenomicFiles_1.12.0       
 [6] Formula_1.2-2              bumphunter_1.16.0          stats4_3.4.1               latticeExtra_0.6-28        doRNG_1.6.6               
[11] blob_1.1.0                 BSgenome_1.44.1            GenomeInfoDbData_0.99.0    Rsamtools_1.28.0           RSQLite_2.0               
[16] backports_1.1.0            lattice_0.20-35            digest_0.6.12              GenomicRanges_1.28.4       RColorBrewer_1.1-2        
[21] XVector_0.16.0             checkmate_1.8.3            qvalue_2.8.0               colorspace_1.3-2           htmltools_0.3.6           
[26] Matrix_1.2-11              plyr_1.8.4                 XML_3.98-1.9               biomaRt_2.32.1             zlibbioc_1.22.0           
[31] xtable_1.8-2               scales_0.5.0               BiocParallel_1.10.1        htmlTable_1.9              tibble_1.3.4              
[36] pkgmaker_0.22              IRanges_2.10.3             ggplot2_2.2.1              SummarizedExperiment_1.6.3 GenomicFeatures_1.28.4    
[41] nnet_7.3-12                BiocGenerics_0.22.0        lazyeval_0.2.0             survival_2.41-3            magrittr_1.5              
[46] memoise_1.1.0              foreign_0.8-69             tools_3.4.1                registry_0.3               data.table_1.10.4         
[51] matrixStats_0.52.2         stringr_1.2.0              S4Vectors_0.14.3           locfit_1.5-9.1             munsell_0.4.3             
[56] rngtools_1.2.4             cluster_2.0.6              DelayedArray_0.2.7         AnnotationDbi_1.38.2       Biostrings_2.44.2         
[61] compiler_3.4.1             GenomeInfoDb_1.12.2        rlang_0.1.2                grid_3.4.1                 RCurl_1.95-4.8            
[66] iterators_1.0.8            VariantAnnotation_1.22.3   htmlwidgets_0.9            bitops_1.0-6               base64enc_0.1-3           
[71] derfinderHelper_1.10.0     gtable_0.2.0               codetools_0.2-15           DBI_0.7                    reshape2_1.4.2            
[76] GenomicAlignments_1.12.2   gridExtra_2.2.1            knitr_1.17                 rtracklayer_1.36.4         bit_1.1-12                
[81] Hmisc_4.0-3                stringi_1.1.5              parallel_3.4.1             Rcpp_0.12.12               rpart_4.1-11              
[86] acepack_1.4.1 
ADD COMMENTlink modified 13 months ago by Valerie Obenchain ♦♦ 6.6k • written 13 months ago by fruce_ki10
gravatar for Valerie Obenchain
13 months ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

I agree this can be confusing. The behavior of precede() and follow() were established years ago and are documented in both IRanges and GRanges on ?precede or ?follow. These definitions are from the GenomicRanges package, those in IRanges are very similar:

        • precede: For each range in ‘x’, ‘precede’ returns the index
          of the range in ‘subject’ that is directly preceded by the
          range in ‘x’. Overlapping ranges are excluded.  ‘NA’ is
          returned when there are no qualifying ranges in ‘subject’.

        • follow: The opposite of ‘precede’, ‘follow’ returns the index
          of the range in ‘subject’ that is directly followed by the
          range in ‘x’. Overlapping ranges are excluded. ‘NA’ is
          returned when there are no qualifying ranges in ‘subject’.

In your example ( "+" strand) we expect the result of precede() to be the index of a range in 'subject' that is preceded by the range in 'qry'. In other words, the range from 'subject' is behind, or "to the right of" the range in 'qry'. Here is an example with one range in 'x' and 2 in 'subject'.

Position 5 is preceded by position 3:

> gr1 <- GRanges("chr1", IRanges(3,width=1))
> gr2 <- GRanges("chr1", IRanges(c(1,5), width=1))
> gr1
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 3]      *
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [1, 1]      *
  [2]     chr1    [5, 5]      *
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> precede(gr1, gr2)
[1] 2

Position 1 is followed by position 3.

> follow(gr1, gr2)
[1] 1


ADD COMMENTlink written 13 months ago by Valerie Obenchain ♦♦ 6.6k

I did look at the docs repeatedly, though not carefully enough it seems. I focused on the strand orientation bit, which can also have the effect of giving the seemingly opposite answer. Friday evening overtime probably not the best time to be learning new things.

So the query precedes or follows the subject. That is an equally valid approach, but in that case, the term �subject� is misleading, as it is not the actual subject of the action implied by the function name but rather the object of it.

So, thanks for the clarification! I�m relatively new to Bioconductor and the learning curve is steep. Sent from my Windows 10 phone

ADD REPLYlink modified 13 months ago by Martin Morgan ♦♦ 22k • written 13 months ago by fruce_ki10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 139 users visited in the last hour