Not getting the expected answers from GenomicRanges follow/precede
1
1
Entering edit mode
fruce_ki ▴ 20
@fruce_ki-13220
Last seen 4.8 years ago
Austria/Vienna/Research Institute for M…

 

I'm getting the opposite answer than I would expect. This has nothing to do with getting opposite behaviour between '+' and '-' strand. My problem here is all on the  '+' strand.

 

# Subject
ref <- GenomicRanges::makeGRangesFromDataFrame(
    structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = c("Chr1", "Chr2", "Chr3", "Chr4", "Chr5", "mitochondria", "chloroplast")), 
                    start = c(3631L, 5928L, 11649L, 23146L, 28500L, 31170L, 33379L, 38752L, 44677L, 45296L), 
                    end = c(5899L, 8737L, 13714L, 31227L, 28706L, 33153L,  37871L, 40944L, 44787L, 47019L), 
                    width = c(2269L, 2810L, 2066L, 8082L, 207L, 1984L, 4493L, 2193L, 111L, 1724L), 
                    strand = structure(c(1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L), class = "factor", .Label = c("+", "-", "*")), 
                    gene_id = c("AT1G01010", "AT1G01020", "AT1G01030", "AT1G01040", "AT1G01046", "AT1G01050", "AT1G01060", "AT1G01070", "AT1G01073", "AT1G01080")), 
                .Names = c("seqnames", "start", "end", "width", "strand", "gene_id"), 
            row.names = c("AT1G01010", "AT1G01020", "AT1G01030", "AT1G01040", "AT1G01046", "AT1G01050", "AT1G01060", "AT1G01070", "AT1G01073", "AT1G01080"), 
            class = "data.frame") )
# Query
qry <- GenomicRanges::makeGRangesFromDataFrame(
    structure(list(seqnames = structure(1L, class = "factor", .Label = "Chr1"), 
                   start = 31137L, end = 31183L, width = 47L, 
                   strand = structure(1L, class = "factor", .Label = c("+", "-", "*")), 
                   value = 9.90299791781146, area = 465.440902137139, 
                   indexStart = 39L, indexEnd = 85L, cluster = 2L, clusterL = 47), 
              .Names = c("seqnames", "start", "end", "width", "strand", "value", "area", "indexStart", "indexEnd", "cluster", "clusterL"), 
              row.names = "2", class = "data.frame") )

# Overlap looks as expected. 
# The 4th reference directly overlaps and is on the correct strand. 
# The 6th would also overlap but is on the wrong strand.
GenomicRanges::findOverlaps(qry, ref, ignore.strand=FALSE)

# Getting the upstream gene does not produce the expected result!
# Region 9 is to the RIGHT, DOWNstream of my query.
GenomicRanges::precede(qry, ref, ignore.strand=FALSE)

# Getting the downstream gene does not produce the expected result either!
# Region 5 is to the LEFT, UPstream of my query!
GenomicRanges::follow(qry, ref, ignore.strand=FALSE)

 

Either follow/precede don't do what I think they do, or they have a very convoluted way of defining which way is up and down.


My session (Issue also reproducible on older versions of Bioconductor):

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] derfinder_1.10.6

loaded via a namespace (and not attached):
 [1] Biobase_2.36.2             bit64_0.9-7                splines_3.4.1              foreach_1.4.3              GenomicFiles_1.12.0       
 [6] Formula_1.2-2              bumphunter_1.16.0          stats4_3.4.1               latticeExtra_0.6-28        doRNG_1.6.6               
[11] blob_1.1.0                 BSgenome_1.44.1            GenomeInfoDbData_0.99.0    Rsamtools_1.28.0           RSQLite_2.0               
[16] backports_1.1.0            lattice_0.20-35            digest_0.6.12              GenomicRanges_1.28.4       RColorBrewer_1.1-2        
[21] XVector_0.16.0             checkmate_1.8.3            qvalue_2.8.0               colorspace_1.3-2           htmltools_0.3.6           
[26] Matrix_1.2-11              plyr_1.8.4                 XML_3.98-1.9               biomaRt_2.32.1             zlibbioc_1.22.0           
[31] xtable_1.8-2               scales_0.5.0               BiocParallel_1.10.1        htmlTable_1.9              tibble_1.3.4              
[36] pkgmaker_0.22              IRanges_2.10.3             ggplot2_2.2.1              SummarizedExperiment_1.6.3 GenomicFeatures_1.28.4    
[41] nnet_7.3-12                BiocGenerics_0.22.0        lazyeval_0.2.0             survival_2.41-3            magrittr_1.5              
[46] memoise_1.1.0              foreign_0.8-69             tools_3.4.1                registry_0.3               data.table_1.10.4         
[51] matrixStats_0.52.2         stringr_1.2.0              S4Vectors_0.14.3           locfit_1.5-9.1             munsell_0.4.3             
[56] rngtools_1.2.4             cluster_2.0.6              DelayedArray_0.2.7         AnnotationDbi_1.38.2       Biostrings_2.44.2         
[61] compiler_3.4.1             GenomeInfoDb_1.12.2        rlang_0.1.2                grid_3.4.1                 RCurl_1.95-4.8            
[66] iterators_1.0.8            VariantAnnotation_1.22.3   htmlwidgets_0.9            bitops_1.0-6               base64enc_0.1-3           
[71] derfinderHelper_1.10.0     gtable_0.2.0               codetools_0.2-15           DBI_0.7                    reshape2_1.4.2            
[76] GenomicAlignments_1.12.2   gridExtra_2.2.1            knitr_1.17                 rtracklayer_1.36.4         bit_1.1-12                
[81] Hmisc_4.0-3                stringi_1.1.5              parallel_3.4.1             Rcpp_0.12.12               rpart_4.1-11              
[86] acepack_1.4.1 
GenomicRanges • 1.1k views
ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

I agree this can be confusing. The behavior of precede() and follow() were established years ago and are documented in both IRanges and GRanges on ?precede or ?follow. These definitions are from the GenomicRanges package, those in IRanges are very similar:

        • precede: For each range in ‘x’, ‘precede’ returns the index
          of the range in ‘subject’ that is directly preceded by the
          range in ‘x’. Overlapping ranges are excluded.  ‘NA’ is
          returned when there are no qualifying ranges in ‘subject’.

        • follow: The opposite of ‘precede’, ‘follow’ returns the index
          of the range in ‘subject’ that is directly followed by the
          range in ‘x’. Overlapping ranges are excluded. ‘NA’ is
          returned when there are no qualifying ranges in ‘subject’.

In your example ( "+" strand) we expect the result of precede() to be the index of a range in 'subject' that is preceded by the range in 'qry'. In other words, the range from 'subject' is behind, or "to the right of" the range in 'qry'. Here is an example with one range in 'x' and 2 in 'subject'.
 

Position 5 is preceded by position 3:

> gr1 <- GRanges("chr1", IRanges(3,width=1))
> gr2 <- GRanges("chr1", IRanges(c(1,5), width=1))
> gr1
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 3]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [1, 1]      *
  [2]     chr1    [5, 5]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> precede(gr1, gr2)
[1] 2

Position 1 is followed by position 3.

> follow(gr1, gr2)
[1] 1

Valerie

ADD COMMENT
0
Entering edit mode

I did look at the docs repeatedly, though not carefully enough it seems. I focused on the strand orientation bit, which can also have the effect of giving the seemingly opposite answer. Friday evening overtime probably not the best time to be learning new things.

So the query precedes or follows the subject. That is an equally valid approach, but in that case, the term �subject� is misleading, as it is not the actual subject of the action implied by the function name but rather the object of it.

So, thanks for the clarification! I�m relatively new to Bioconductor and the learning curve is steep. Sent from my Windows 10 phone

ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6