function precede() not working with GRanges

0

Entering edit mode

Jeremiah Degenhardt ▴ 50

@jeremiah-degenhardt-4417

Last seen 9.6 years ago

Hello Bioconductor, I am trying to use the function precede() with two GRanges objects to find the closest element from gr1 to an element in gr2. precede() shows that it should work with a GRanges object, however, when I try to use it it returns only a vector of "NA". Pasted below is a simplified bit of code to reproduce the error and my session info. Additionally, it would be even nicer if the function nearest() were implemented for a GRanges object. Thanks in advance for the help Jeremiah > gr1 <- GRanges("chr1", IRanges(1,10)) > gr2<- GRanges("chr1", IRanges(20,30)) > precede(gr2, gr1) [1] NA > > sessionInfo() R version 2.13.0 Under development (unstable) (2010-12-07 r53809) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 [13] IRanges_1.9.16 multicore_0.1-3 loaded via a namespace (and not attached): [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 [[alternative HTML version deleted]]

• 1.9k views

ADD COMMENT • link updated 13.3 years ago by Martin Morgan 25k • written 13.3 years ago by Jeremiah Degenhardt ▴ 50

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote: > Hello Bioconductor, > > I am trying to use the function precede() with two GRanges objects to find > the closest element from gr1 to an element in gr2. precede() shows that it > should work with a GRanges object, however, when I try to use it it returns > only a vector of "NA". > > Pasted below is a simplified bit of code to reproduce the error and my > session info. > > Additionally, it would be even nicer if the function nearest() were > implemented for a GRanges object. > > Thanks in advance for the help Hi Jeremiah -- Provide strand information for at least one of the reads, e.g., > gr1 <- GRanges("chr1", IRanges(1,10), "-") > gr2 <- GRanges("chr1", IRanges(20,30)) > precede(gr2, gr1) [1] 1 or do the comparison on IRanges with > precede(ranges(gr1), ranges(gr2)) [1] 1 nearest will be implemented. Martin > Jeremiah > >> gr1 <- GRanges("chr1", IRanges(1,10)) >> gr2<- GRanges("chr1", IRanges(20,30)) >> precede(gr2, gr1) > [1] NA >> > >> sessionInfo() > R version 2.13.0 Under development (unstable) (2010-12-07 r53809) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 > [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 > [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 > [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 > [13] IRanges_1.9.16 multicore_0.1-3 > > loaded via a namespace (and not attached): > [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 13.3 years ago Martin Morgan 25k

0

Entering edit mode

On Tue, Jan 4, 2011 at 11:52 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote: > > Hello Bioconductor, > > > > I am trying to use the function precede() with two GRanges objects to > find > > the closest element from gr1 to an element in gr2. precede() shows that > it > > should work with a GRanges object, however, when I try to use it it > returns > > only a vector of "NA". > > > > Pasted below is a simplified bit of code to reproduce the error and my > > session info. > > > > Additionally, it would be even nicer if the function nearest() were > > implemented for a GRanges object. > > > > Thanks in advance for the help > > Hi Jeremiah -- > > Provide strand information for at least one of the reads, e.g., > > > gr1 <- GRanges("chr1", IRanges(1,10), "-") > > gr2 <- GRanges("chr1", IRanges(20,30)) > > precede(gr2, gr1) > [1] 1 > > or do the comparison on IRanges with > > > precede(ranges(gr1), ranges(gr2)) > [1] 1 > > This does not work in general, because it discards the chromosome. GRanges is often treated as a flat RangesList/RangedData, with the strand set to "*". Jeremiah and I think it would be reasonable to treat a '*' range as a '+' range when comparing to another '*' range. That is, assume strand is unknown/irrelevant and just go by the starts and ends as if it were a Ranges object, except considering the sequence. Obviously, one could often just explicitly coerce the strand to '+', but the user is not going to expect that such a coercion is required. Especially people like us who are too lazy to figure out how to find the documentation on an S4 method. Michael nearest will be implemented. > > Martin > > > Jeremiah > > > >> gr1 <- GRanges("chr1", IRanges(1,10)) > >> gr2<- GRanges("chr1", IRanges(20,30)) > >> precede(gr2, gr1) > > [1] NA > >> > > > >> sessionInfo() > > R version 2.13.0 Under development (unstable) (2010-12-07 r53809) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] grid stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 > > [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 > > [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 > > [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 > > [13] IRanges_1.9.16 multicore_0.1-3 > > > > loaded via a namespace (and not attached): > > [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On 01/04/2011 01:15 PM, Michael Lawrence wrote: > > > On Tue, Jan 4, 2011 at 11:52 AM, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote: > > Hello Bioconductor, > > > > I am trying to use the function precede() with two GRanges objects > to find > > the closest element from gr1 to an element in gr2. precede() shows > that it > > should work with a GRanges object, however, when I try to use it > it returns > > only a vector of "NA". > > > > Pasted below is a simplified bit of code to reproduce the error and my > > session info. > > > > Additionally, it would be even nicer if the function nearest() were > > implemented for a GRanges object. > > > > Thanks in advance for the help > > Hi Jeremiah -- > > Provide strand information for at least one of the reads, e.g., > > > gr1 <- GRanges("chr1", IRanges(1,10), "-") > > gr2 <- GRanges("chr1", IRanges(20,30)) > > precede(gr2, gr1) > [1] 1 > > or do the comparison on IRanges with > > > precede(ranges(gr1), ranges(gr2)) > [1] 1 > > > This does not work in general, because it discards the chromosome. > GRanges is often treated as a flat RangesList/RangedData, with the > strand set to "*". > > Jeremiah and I think it would be reasonable to treat a '*' range as a > '+' range when comparing to another '*' range. That is, assume strand is > unknown/irrelevant and just go by the starts and ends as if it were a > Ranges object, except considering the sequence. Was wondering about use case details, e.g., are all query & subject '*'? Martin > Obviously, one could often just explicitly coerce the strand to '+', but > the user is not going to expect that such a coercion is required. > Especially people like us who are too lazy to figure out how to find the > documentation on an S4 method. > > Michael > > nearest will be implemented. > > Martin > > > Jeremiah > > > >> gr1 <- GRanges("chr1", IRanges(1,10)) > >> gr2<- GRanges("chr1", IRanges(20,30)) > >> precede(gr2, gr1) > > [1] NA > >> > > > >> sessionInfo() > > R version 2.13.0 Under development (unstable) (2010-12-07 r53809) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] grid stats graphics grDevices utils datasets > methods > > [8] base > > > > other attached packages: > > [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 > > [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 > > [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 > > [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 > > [13] IRanges_1.9.16 multicore_0.1-3 > > > > loaded via a namespace (and not attached): > > [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 13.3 years ago Martin Morgan 25k

0

Entering edit mode

On Wed, Jan 5, 2011 at 10:44 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 01/04/2011 01:15 PM, Michael Lawrence wrote: > > > > > > On Tue, Jan 4, 2011 at 11:52 AM, Martin Morgan <mtmorgan@fhcrc.org> > <mailto:mtmorgan@fhcrc.org>> wrote: > > > > On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote: > > > Hello Bioconductor, > > > > > > I am trying to use the function precede() with two GRanges objects > > to find > > > the closest element from gr1 to an element in gr2. precede() shows > > that it > > > should work with a GRanges object, however, when I try to use it > > it returns > > > only a vector of "NA". > > > > > > Pasted below is a simplified bit of code to reproduce the error and > my > > > session info. > > > > > > Additionally, it would be even nicer if the function nearest() were > > > implemented for a GRanges object. > > > > > > Thanks in advance for the help > > > > Hi Jeremiah -- > > > > Provide strand information for at least one of the reads, e.g., > > > > > gr1 <- GRanges("chr1", IRanges(1,10), "-") > > > gr2 <- GRanges("chr1", IRanges(20,30)) > > > precede(gr2, gr1) > > [1] 1 > > > > or do the comparison on IRanges with > > > > > precede(ranges(gr1), ranges(gr2)) > > [1] 1 > > > > > > This does not work in general, because it discards the chromosome. > > GRanges is often treated as a flat RangesList/RangedData, with the > > strand set to "*". > > > > Jeremiah and I think it would be reasonable to treat a '*' range as a > > '+' range when comparing to another '*' range. That is, assume strand is > > unknown/irrelevant and just go by the starts and ends as if it were a > > Ranges object, except considering the sequence. > > Was wondering about use case details, e.g., are all query & subject '*'? > Well it turns out that in our use case, they weren't all supposed to be "*", but there are many cases where one does not have strand information. For example, one might have two sets of peaks, for different TF's, and want to find those where one precedes the other (and then see if they are upstream of a positive strand gene, then do the same with follow and negative strand genes). As was mentioned, one could coerce the strand to "+"/"-" in that case to achieve the desired effect, but it's not clear why that is necessary and whether it even makes sense (peaks do not have a direction). One thing that often confuses me with GenomicRanges is the dual meaning of strand. It seems to indicate both a direction and a coordinate. It makes sense to me to have resize() and precede() consider strand as a direction. But findOverlaps() treats strands as coordinates such that items on different strands do not overlap. This makes less sense. Another good example is gaps() which will yield sequence-long ranges on the strands not represented in the original object. This is very often undesirable. But of course my perspective is limited. Michael Martin > > > Obviously, one could often just explicitly coerce the strand to '+', but > > the user is not going to expect that such a coercion is required. > > Especially people like us who are too lazy to figure out how to find the > > documentation on an S4 method. > > > > Michael > > > > nearest will be implemented. > > > > Martin > > > > > Jeremiah > > > > > >> gr1 <- GRanges("chr1", IRanges(1,10)) > > >> gr2<- GRanges("chr1", IRanges(20,30)) > > >> precede(gr2, gr1) > > > [1] NA > > >> > > > > > >> sessionInfo() > > > R version 2.13.0 Under development (unstable) (2010-12-07 r53809) > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > > > attached base packages: > > > [1] grid stats graphics grDevices utils datasets > > methods > > > [8] base > > > > > > other attached packages: > > > [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 > > > [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 > > > [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 > > > [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 > > > [13] IRanges_1.9.16 multicore_0.1-3 > > > > > > loaded via a namespace (and not attached): > > > [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Computational Biology > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > > > Location: M1-B861 > > Telephone: 206 667-2793 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]]

ADD REPLY • link 13.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, On 01/05/2011 11:30 AM, Michael Lawrence wrote: > On Wed, Jan 5, 2011 at 10:44 AM, Martin Morgan<mtmorgan at="" fhcrc.org=""> wrote: > >> On 01/04/2011 01:15 PM, Michael Lawrence wrote: >>> >>> >>> On Tue, Jan 4, 2011 at 11:52 AM, Martin Morgan<mtmorgan at="" fhcrc.org="">>> <mailto:mtmorgan at="" fhcrc.org="">> wrote: >>> >>> On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote: >>> > Hello Bioconductor, >>> > >>> > I am trying to use the function precede() with two GRanges objects >>> to find >>> > the closest element from gr1 to an element in gr2. precede() shows >>> that it >>> > should work with a GRanges object, however, when I try to use it >>> it returns >>> > only a vector of "NA". >>> > >>> > Pasted below is a simplified bit of code to reproduce the error and >> my >>> > session info. >>> > >>> > Additionally, it would be even nicer if the function nearest() were >>> > implemented for a GRanges object. >>> > >>> > Thanks in advance for the help >>> >>> Hi Jeremiah -- >>> >>> Provide strand information for at least one of the reads, e.g., >>> >>> > gr1<- GRanges("chr1", IRanges(1,10), "-") >>> > gr2<- GRanges("chr1", IRanges(20,30)) >>> > precede(gr2, gr1) >>> [1] 1 >>> >>> or do the comparison on IRanges with >>> >>> > precede(ranges(gr1), ranges(gr2)) >>> [1] 1 >>> >>> >>> This does not work in general, because it discards the chromosome. >>> GRanges is often treated as a flat RangesList/RangedData, with the >>> strand set to "*". >>> >>> Jeremiah and I think it would be reasonable to treat a '*' range as a >>> '+' range when comparing to another '*' range. That is, assume strand is >>> unknown/irrelevant and just go by the starts and ends as if it were a >>> Ranges object, except considering the sequence. >> >> Was wondering about use case details, e.g., are all query& subject '*'? >> > > Well it turns out that in our use case, they weren't all supposed to be "*", > but there are many cases where one does not have strand information. For > example, one might have two sets of peaks, for different TF's, and want to > find those where one precedes the other (and then see if they are upstream > of a positive strand gene, then do the same with follow and negative strand > genes). As was mentioned, one could coerce the strand to "+"/"-" in that > case to achieve the desired effect, but it's not clear why that is necessary > and whether it even makes sense (peaks do not have a direction). How general is the "peaks do not have a direction" statement? In my experience (RNA-seq experiments), peaks are "stranded features" i.e. they belong to a particular strand. > > One thing that often confuses me with GenomicRanges is the dual meaning of > strand. It seems to indicate both a direction and a coordinate. It makes > sense to me to have resize() and precede() consider strand as a direction. > But findOverlaps() treats strands as coordinates such that items on > different strands do not overlap. This makes less sense. Why? IMO a positive read (i.e. a read that was aligned to the plus strand) shouldn't hit features (genes/transcripts/exons) that are located on the minus strand. If for whatever reason this is not what the user wants, then s/he can always set the strand of the query and subject to '*' but I wouldn't say this is the typical usecase. > Another good > example is gaps() which will yield sequence-long ranges on the strands not > represented in the original object. This is very often undesirable. But of > course my perspective is limited. I agree that the behaviour of gaps() on GRanges is awkward. You not only get those sequence-long ranges on the strands not represented in the original object but you also get them for sequences not represented at all (as long as they are listed in levels(seqnames(x))). But if gaps() wasn't doing this, gaps() wouldn't be reversible anymore (i.e. 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an important property for gaps(). In particular, gaps(x) would give the same result whether 'x' as a sequence-long range on one strand or not (very unlikely to happen with real data though). Maybe an extra arg to gaps() would help control this? Suggestions are welcome. I'll make the change. Thanks! H. > > Michael > > Martin >> >>> Obviously, one could often just explicitly coerce the strand to '+', but >>> the user is not going to expect that such a coercion is required. >>> Especially people like us who are too lazy to figure out how to find the >>> documentation on an S4 method. >>> >>> Michael >>> >>> nearest will be implemented. >>> >>> Martin >>> >>> > Jeremiah >>> > >>> >> gr1<- GRanges("chr1", IRanges(1,10)) >>> >> gr2<- GRanges("chr1", IRanges(20,30)) >>> >> precede(gr2, gr1) >>> > [1] NA >>> >> >>> > >>> >> sessionInfo() >>> > R version 2.13.0 Under development (unstable) (2010-12-07 r53809) >>> > Platform: x86_64-unknown-linux-gnu (64-bit) >>> > >>> > locale: >>> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>> > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> > [9] LC_ADDRESS=C LC_TELEPHONE=C >>> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> > >>> > attached base packages: >>> > [1] grid stats graphics grDevices utils datasets >>> methods >>> > [8] base >>> > >>> > other attached packages: >>> > [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 >>> > [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 >>> > [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 >>> > [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 >>> > [13] IRanges_1.9.16 multicore_0.1-3 >>> > >>> > loaded via a namespace (and not attached): >>> > [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> -- >>> Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 >>> Telephone: 206 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.3 years ago Hervé Pagès 16k

0

Entering edit mode

2011/1/11 Hervé Pagès <hpages@fhcrc.org> > Hi Michael, > > > On 01/05/2011 11:30 AM, Michael Lawrence wrote: > >> On Wed, Jan 5, 2011 at 10:44 AM, Martin Morgan<mtmorgan@fhcrc.org> >> wrote: >> >> On 01/04/2011 01:15 PM, Michael Lawrence wrote: >>> >>>> >>>> >>>> On Tue, Jan 4, 2011 at 11:52 AM, Martin Morgan<mtmorgan@fhcrc.org>>>> <mailto:mtmorgan@fhcrc.org>> wrote: >>>> >>>> On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote: >>>> > Hello Bioconductor, >>>> > >>>> > I am trying to use the function precede() with two GRanges >>>> objects >>>> to find >>>> > the closest element from gr1 to an element in gr2. precede() >>>> shows >>>> that it >>>> > should work with a GRanges object, however, when I try to use it >>>> it returns >>>> > only a vector of "NA". >>>> > >>>> > Pasted below is a simplified bit of code to reproduce the error >>>> and >>>> >>> my >>> >>>> > session info. >>>> > >>>> > Additionally, it would be even nicer if the function nearest() >>>> were >>>> > implemented for a GRanges object. >>>> > >>>> > Thanks in advance for the help >>>> >>>> Hi Jeremiah -- >>>> >>>> Provide strand information for at least one of the reads, e.g., >>>> >>>> > gr1<- GRanges("chr1", IRanges(1,10), "-") >>>> > gr2<- GRanges("chr1", IRanges(20,30)) >>>> > precede(gr2, gr1) >>>> [1] 1 >>>> >>>> or do the comparison on IRanges with >>>> >>>> > precede(ranges(gr1), ranges(gr2)) >>>> [1] 1 >>>> >>>> >>>> This does not work in general, because it discards the chromosome. >>>> GRanges is often treated as a flat RangesList/RangedData, with the >>>> strand set to "*". >>>> >>>> Jeremiah and I think it would be reasonable to treat a '*' range as a >>>> '+' range when comparing to another '*' range. That is, assume strand is >>>> unknown/irrelevant and just go by the starts and ends as if it were a >>>> Ranges object, except considering the sequence. >>>> >>> >>> Was wondering about use case details, e.g., are all query& subject '*'? >>> >>> >> Well it turns out that in our use case, they weren't all supposed to be >> "*", >> but there are many cases where one does not have strand information. For >> example, one might have two sets of peaks, for different TF's, and want to >> find those where one precedes the other (and then see if they are upstream >> of a positive strand gene, then do the same with follow and negative >> strand >> genes). As was mentioned, one could coerce the strand to "+"/"-" in that >> case to achieve the desired effect, but it's not clear why that is >> necessary >> and whether it even makes sense (peaks do not have a direction). >> > > How general is the "peaks do not have a direction" statement? > In my experience (RNA-seq experiments), peaks are "stranded features" > i.e. they belong to a particular strand. > > RNA-seq typically does not deal with the concepts of peaks... > >> One thing that often confuses me with GenomicRanges is the dual meaning of >> strand. It seems to indicate both a direction and a coordinate. It makes >> sense to me to have resize() and precede() consider strand as a direction. >> But findOverlaps() treats strands as coordinates such that items on >> different strands do not overlap. This makes less sense. >> > > Why? IMO a positive read (i.e. a read that was aligned to the plus > strand) shouldn't hit features (genes/transcripts/exons) that are > located on the minus strand. If for whatever reason this is not what > the user wants, then s/he can always set the strand of the query and > subject to '*' but I wouldn't say this is the typical usecase. > > In my experience, it is very typical. Although strand-specific protocols exist, they are far from the norm. Thus, one usually has reads from both strands, even in RNA-seq. Sure, it's always possible to coerce the strand, but that is not convenient, because it discards the strand information, unless a copy is made. It would be nice to have an option for some of these functions to specify whether strand is considered a separate coordinate/space (defaulting to TRUE). I realize that GRanges was initially motivated by the desire to perform calculations separately by strand, but GRanges is now the de facto data structure for all genomic interval manipulations. > Another good >> example is gaps() which will yield sequence-long ranges on the strands not >> represented in the original object. This is very often undesirable. But of >> course my perspective is limited. >> > > I agree that the behaviour of gaps() on GRanges is awkward. You not only > get those sequence-long ranges on the strands not represented in the > original object but you also get them for sequences not represented at > all (as long as they are listed in levels(seqnames(x))). But if gaps() > wasn't doing this, gaps() wouldn't be reversible anymore (i.e. > 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an > important property for gaps(). In particular, gaps(x) would give > the same result whether 'x' as a sequence-long range on one strand > or not (very unlikely to happen with real data though). > > Maybe an extra arg to gaps() would help control this? Suggestions are > welcome. I'll make the change. Thanks! > > > H. > > >> Michael >> >> Martin >> >>> >>> Obviously, one could often just explicitly coerce the strand to '+', but >>>> the user is not going to expect that such a coercion is required. >>>> Especially people like us who are too lazy to figure out how to find the >>>> documentation on an S4 method. >>>> >>>> Michael >>>> >>>> nearest will be implemented. >>>> >>>> Martin >>>> >>>> > Jeremiah >>>> > >>>> >> gr1<- GRanges("chr1", IRanges(1,10)) >>>> >> gr2<- GRanges("chr1", IRanges(20,30)) >>>> >> precede(gr2, gr1) >>>> > [1] NA >>>> >> >>>> > >>>> >> sessionInfo() >>>> > R version 2.13.0 Under development (unstable) (2010-12-07 r53809) >>>> > Platform: x86_64-unknown-linux-gnu (64-bit) >>>> > >>>> > locale: >>>> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>> > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> > [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> > >>>> > attached base packages: >>>> > [1] grid stats graphics grDevices utils datasets >>>> methods >>>> > [8] base >>>> > >>>> > other attached packages: >>>> > [1] motifRG_0.0.2 rtracklayer_1.11.8 RCurl_1.5-0 >>>> > [4] bitops_1.0-4.1 seqLogo_1.17.0 chipseq_1.1.2 >>>> > [7] ShortRead_1.9.8 Rsamtools_1.3.11 lattice_0.18-3 >>>> > [10] BSgenome_1.19.2 Biostrings_2.19.2 GenomicRanges_1.3.5 >>>> > [13] IRanges_1.9.16 multicore_0.1-3 >>>> > >>>> > loaded via a namespace (and not attached): >>>> > [1] Biobase_2.11.6 hwriter_1.3 tools_2.13.0 XML_3.2-0 >>>> > >>>> > [[alternative HTML version deleted]] >>>> > >>>> > _______________________________________________ >>>> > Bioconductor mailing list >>>> > Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> > Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> -- >>>> Computational Biology >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: M1-B861 >>>> Telephone: 206 667-2793 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>> >>> -- >>> Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 >>> Telephone: 206 667-2793 >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, I'm not Michael, but: > How general is the "peaks do not have a direction" statement? > In my experience (RNA-seq experiments), peaks are "stranded features" > i.e. they belong to a particular strand. I'm actually a bit surprised by that. It was my impression that most of the rna-seq data to date have been done with a protocol that doesn't honor the strand of the reads. Recently, I've come across several protocols where strandedness is maintained and I'm sure that these will be increasingly the norm as time goes on. I've been working on/with different variants of some (rna) tag-sequencing protocols, and these do typically keep strand info, but I think the "old" Illumina-driven transcriptome-wide rna-seq data that "most" people think of as RNA-seq is unstranded. Still. I like that GenomicRanges can honor strandedness in overlap/etc. functions when the strand is set. >> One thing that often confuses me with GenomicRanges is the dual meaning of >> strand. It seems to indicate both a direction and a coordinate. It makes >> sense to me to have resize() and precede() consider strand as a direction. >> But findOverlaps() treats strands as coordinates such that items on >> different strands do not overlap. This makes less sense. > > Why? IMO a positive read (i.e. a read that was aligned to the plus > strand) shouldn't hit features (genes/transcripts/exons) that are > located on the minus strand. If for whatever reason this is not what > the user wants, then s/he can always set the strand of the query and > subject to '*' but I wouldn't say this is the typical usecase. > >> Another good >> example is gaps() which will yield sequence-long ranges on the strands not >> represented in the original object. This is very often undesirable. But of >> course my perspective is limited. > > I agree that the behaviour of gaps() on GRanges is awkward. You not only > get those sequence-long ranges on the strands not represented in the > original object but you also get them for sequences not represented at > all (as long as they are listed in levels(seqnames(x))). But if gaps() > wasn't doing this, gaps() wouldn't be reversible anymore (i.e. > 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an > important property for gaps(). In particular, gaps(x) would give > the same result whether 'x' as a sequence-long range on one strand > or not (very unlikely to happen with real data though). > > Maybe an extra arg to gaps() would help control this? Suggestions are > welcome. I'll make the change. Thanks! I'd be in favor of adding a param to GRanges::gaps() that would allow me to explicitly turn off this "feature" ;-) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, On 01/11/2011 07:20 PM, Steve Lianoglou wrote: > Hi, > > I'm not Michael, but: > >> How general is the "peaks do not have a direction" statement? >> In my experience (RNA-seq experiments), peaks are "stranded features" >> i.e. they belong to a particular strand. > > I'm actually a bit surprised by that. It was my impression that most > of the rna-seq data to date have been done with a protocol that > doesn't honor the strand of the reads. Just to clarify, I was not talking about protocols that honor or not the original strand of the reads. I was talking about the strand to which each read aligns. This information is stored in the SAM/BAM file and propagates to the GRanges object you get when loading this file (with something like 'as(read.GappedAlignments(myfile), "GRanges")'. The strand information is here whatever type of *-seq experiment it is (RNA-seq, CHIP-seq, ...) > > Recently, I've come across several protocols where strandedness is > maintained and I'm sure that these will be increasingly the norm as > time goes on. > > I've been working on/with different variants of some (rna) > tag-sequencing protocols, and these do typically keep strand info, but > I think the "old" Illumina-driven transcriptome-wide rna-seq data that > "most" people think of as RNA-seq is unstranded. > > Still. I like that GenomicRanges can honor strandedness in > overlap/etc. functions when the strand is set. > >>> One thing that often confuses me with GenomicRanges is the dual meaning of >>> strand. It seems to indicate both a direction and a coordinate. It makes >>> sense to me to have resize() and precede() consider strand as a direction. >>> But findOverlaps() treats strands as coordinates such that items on >>> different strands do not overlap. This makes less sense. >> >> Why? IMO a positive read (i.e. a read that was aligned to the plus >> strand) shouldn't hit features (genes/transcripts/exons) that are >> located on the minus strand. If for whatever reason this is not what >> the user wants, then s/he can always set the strand of the query and >> subject to '*' but I wouldn't say this is the typical usecase. >> >>> Another good >>> example is gaps() which will yield sequence-long ranges on the strands not >>> represented in the original object. This is very often undesirable. But of >>> course my perspective is limited. >> >> I agree that the behaviour of gaps() on GRanges is awkward. You not only >> get those sequence-long ranges on the strands not represented in the >> original object but you also get them for sequences not represented at >> all (as long as they are listed in levels(seqnames(x))). But if gaps() >> wasn't doing this, gaps() wouldn't be reversible anymore (i.e. >> 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an >> important property for gaps(). In particular, gaps(x) would give >> the same result whether 'x' as a sequence-long range on one strand >> or not (very unlikely to happen with real data though). >> >> Maybe an extra arg to gaps() would help control this? Suggestions are >> welcome. I'll make the change. Thanks! > > I'd be in favor of adding a param to GRanges::gaps() that would allow > me to explicitly turn off this "feature" ;-) OK, will see what I can do. Thanks for your feedback! H. > > -steve > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.3 years ago Hervé Pagès 16k

0

Entering edit mode

2011/1/12 Hervé Pagès <hpages@fhcrc.org> > Hi Steve, > > > On 01/11/2011 07:20 PM, Steve Lianoglou wrote: > >> Hi, >> >> I'm not Michael, but: >> >> How general is the "peaks do not have a direction" statement? >>> In my experience (RNA-seq experiments), peaks are "stranded features" >>> i.e. they belong to a particular strand. >>> >> >> I'm actually a bit surprised by that. It was my impression that most >> of the rna-seq data to date have been done with a protocol that >> doesn't honor the strand of the reads. >> > > Just to clarify, I was not talking about protocols that honor or not > the original strand of the reads. I was talking about the strand to > which each read aligns. This information is stored in the SAM/BAM file > and propagates to the GRanges object you get when loading this file > (with something like 'as(read.GappedAlignments(myfile), "GRanges")'. > The strand information is here whatever type of *-seq experiment it > is (RNA-seq, CHIP-seq, ...) > > Right, but if you're getting reads from both strands of a transcript, you certainly don't want to ignore the reads coming from the strand opposite transcription. Steve is right about stranded protocols becoming more popular, but in many cases their advantages might not outweigh the increased cost/complexity. Time will tell, I guess. Anyway, let's not make GenomicRanges the RNA-seq package. Suffice it to say that we deal RNA- seq, ChIP-seq, other DNA-seqs and up till now we always need to remove strand before looking for overlaps. > > >> Recently, I've come across several protocols where strandedness is >> maintained and I'm sure that these will be increasingly the norm as >> time goes on. >> >> I've been working on/with different variants of some (rna) >> tag-sequencing protocols, and these do typically keep strand info, but >> I think the "old" Illumina-driven transcriptome-wide rna-seq data that >> "most" people think of as RNA-seq is unstranded. >> >> Still. I like that GenomicRanges can honor strandedness in >> overlap/etc. functions when the strand is set. >> >> One thing that often confuses me with GenomicRanges is the dual meaning >>>> of >>>> strand. It seems to indicate both a direction and a coordinate. It makes >>>> sense to me to have resize() and precede() consider strand as a >>>> direction. >>>> But findOverlaps() treats strands as coordinates such that items on >>>> different strands do not overlap. This makes less sense. >>>> >>> >>> Why? IMO a positive read (i.e. a read that was aligned to the plus >>> strand) shouldn't hit features (genes/transcripts/exons) that are >>> located on the minus strand. If for whatever reason this is not what >>> the user wants, then s/he can always set the strand of the query and >>> subject to '*' but I wouldn't say this is the typical usecase. >>> >>> Another good >>>> example is gaps() which will yield sequence-long ranges on the strands >>>> not >>>> represented in the original object. This is very often undesirable. But >>>> of >>>> course my perspective is limited. >>>> >>> >>> I agree that the behaviour of gaps() on GRanges is awkward. You not only >>> get those sequence-long ranges on the strands not represented in the >>> original object but you also get them for sequences not represented at >>> all (as long as they are listed in levels(seqnames(x))). But if gaps() >>> wasn't doing this, gaps() wouldn't be reversible anymore (i.e. >>> 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an >>> important property for gaps(). In particular, gaps(x) would give >>> the same result whether 'x' as a sequence-long range on one strand >>> or not (very unlikely to happen with real data though). >>> >>> Maybe an extra arg to gaps() would help control this? Suggestions are >>> welcome. I'll make the change. Thanks! >>> >> >> I'd be in favor of adding a param to GRanges::gaps() that would allow >> me to explicitly turn off this "feature" ;-) >> > > OK, will see what I can do. Thanks for your feedback! > > H. > > >> -steve >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, 2011/1/12 Michael Lawrence <lawrence.michael at="" gene.com="">: [snip] > Time will tell, I guess. Anyway, let's not make > GenomicRanges the RNA-seq package. Yeah, let's not do that, but let's also keep the strand-awaredness. Not that I think you're arguing to ignore it, I just wanted to let it be known that I like it's ability to do so. I didn't see your email last night before sending off the one I sent, and I think your suggestion of including a param to the 'strand-aware' function that flip their ability to honor/ignore strand (default to honoring strand) is a great idea. > Suffice it to say that we deal RNA-seq, > ChIP-seq, other DNA-seqs and up till now we always need to remove strand > before looking for overlaps. I typically deal with this right from the get-go as I read data from my BAM files. If I know the data is coming from a non-stranded protocol, I just set the strand to '*' as I'm building up my GRanges objects from the scanBam results ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Michael, On 01/12/2011 05:45 AM, Michael Lawrence wrote: > > > 2011/1/12 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Steve, > > > On 01/11/2011 07:20 PM, Steve Lianoglou wrote: > > Hi, > > I'm not Michael, but: > > How general is the "peaks do not have a direction" statement? > In my experience (RNA-seq experiments), peaks are "stranded > features" > i.e. they belong to a particular strand. > > > I'm actually a bit surprised by that. It was my impression that most > of the rna-seq data to date have been done with a protocol that > doesn't honor the strand of the reads. > > > Just to clarify, I was not talking about protocols that honor or not > the original strand of the reads. I was talking about the strand to > which each read aligns. This information is stored in the SAM/BAM file > and propagates to the GRanges object you get when loading this file > (with something like 'as(read.GappedAlignments(myfile), "GRanges")'. > The strand information is here whatever type of *-seq experiment it > is (RNA-seq, CHIP-seq, ...) > > > Right, but if you're getting reads from both strands of a transcript, > you certainly don't want to ignore the reads coming from the strand > opposite transcription. Point taken. I somehow overlooked the fact that with a non-stranded RNA-seq protocol, whatever strand a transcript is coming from, it will generate reads that align to both strands. Thanks for the reminder! > Steve is right about stranded protocols becoming > more popular, but in many cases their advantages might not outweigh the > increased cost/complexity. Time will tell, I guess. Anyway, let's not > make GenomicRanges the RNA-seq package. Suffice it to say that we deal > RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to > remove strand before looking for overlaps. Why only before looking for overlaps? I mean, if the strand is meaningless, why not just remove the strand from the very beginning i.e. when you make your GRanges object from your SAM/BAM file (like Steve does)? So you only need to do strand(gr) <- "*" once. Isn't that easier/safer than having to remember to set an additional 'ignore.strand' parameter to FALSE every time you need to call a strand-aware function like findOverlaps? Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and probably to other strand-aware GRanges methods if we start to go that route) is not a big deal (just 2 lines of code in each method), but still, that means adding yet another argument (findOverlaps already has 9), and documenting it for all these methods. I'm just wondering if it's worth it... H. > > > > Recently, I've come across several protocols where strandedness is > maintained and I'm sure that these will be increasingly the norm as > time goes on. > > I've been working on/with different variants of some (rna) > tag-sequencing protocols, and these do typically keep strand > info, but > I think the "old" Illumina-driven transcriptome-wide rna-seq > data that > "most" people think of as RNA-seq is unstranded. > > Still. I like that GenomicRanges can honor strandedness in > overlap/etc. functions when the strand is set. > > One thing that often confuses me with GenomicRanges is > the dual meaning of > strand. It seems to indicate both a direction and a > coordinate. It makes > sense to me to have resize() and precede() consider > strand as a direction. > But findOverlaps() treats strands as coordinates such > that items on > different strands do not overlap. This makes less sense. > > > Why? IMO a positive read (i.e. a read that was aligned to > the plus > strand) shouldn't hit features (genes/transcripts/exons) > that are > located on the minus strand. If for whatever reason this is > not what > the user wants, then s/he can always set the strand of the > query and > subject to '*' but I wouldn't say this is the typical usecase. > > Another good > example is gaps() which will yield sequence-long ranges > on the strands not > represented in the original object. This is very often > undesirable. But of > course my perspective is limited. > > > I agree that the behaviour of gaps() on GRanges is awkward. > You not only > get those sequence-long ranges on the strands not > represented in the > original object but you also get them for sequences not > represented at > all (as long as they are listed in levels(seqnames(x))). But > if gaps() > wasn't doing this, gaps() wouldn't be reversible anymore (i.e. > 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind > of an > important property for gaps(). In particular, gaps(x) would give > the same result whether 'x' as a sequence-long range on one > strand > or not (very unlikely to happen with real data though). > > Maybe an extra arg to gaps() would help control this? > Suggestions are > welcome. I'll make the change. Thanks! > > > I'd be in favor of adding a param to GRanges::gaps() that would > allow > me to explicitly turn off this "feature" ;-) > > > OK, will see what I can do. Thanks for your feedback! > > H. > > > -steve > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.3 years ago Hervé Pagès 16k

0

Entering edit mode

2011/1/12 Hervé Pagès <hpages@fhcrc.org> > Hi Michael, > > On 01/12/2011 05:45 AM, Michael Lawrence wrote: > >> >> >> 2011/1/12 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi Steve, >> >> >> On 01/11/2011 07:20 PM, Steve Lianoglou wrote: >> >> Hi, >> >> I'm not Michael, but: >> >> How general is the "peaks do not have a direction" statement? >> In my experience (RNA-seq experiments), peaks are "stranded >> features" >> i.e. they belong to a particular strand. >> >> >> I'm actually a bit surprised by that. It was my impression that >> most >> of the rna-seq data to date have been done with a protocol that >> doesn't honor the strand of the reads. >> >> >> Just to clarify, I was not talking about protocols that honor or not >> the original strand of the reads. I was talking about the strand to >> which each read aligns. This information is stored in the SAM/BAM file >> and propagates to the GRanges object you get when loading this file >> (with something like 'as(read.GappedAlignments(myfile), "GRanges")'. >> The strand information is here whatever type of *-seq experiment it >> is (RNA-seq, CHIP-seq, ...) >> >> >> Right, but if you're getting reads from both strands of a transcript, >> you certainly don't want to ignore the reads coming from the strand >> opposite transcription. >> > > Point taken. I somehow overlooked the fact that with a non-stranded > RNA-seq protocol, whatever strand a transcript is coming from, it will > generate reads that align to both strands. Thanks for the reminder! > > > Steve is right about stranded protocols becoming >> more popular, but in many cases their advantages might not outweigh the >> increased cost/complexity. Time will tell, I guess. Anyway, let's not >> make GenomicRanges the RNA-seq package. Suffice it to say that we deal >> RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to >> remove strand before looking for overlaps. >> > > Why only before looking for overlaps? I mean, if the strand is > meaningless, why not just remove the strand from the very beginning > i.e. when you make your GRanges object from your SAM/BAM file (like > Steve does)? So you only need to do strand(gr) <- "*" once. Isn't > that easier/safer than having to remember to set an additional > 'ignore.strand' parameter to FALSE every time you need to call a > strand-aware function like findOverlaps? > > For RNA-seq, that's reasonable. There are cases where the strand of the reads is important. Fragment length estimation algorithms, mostly designed for enrichment experiments, use the strand (see chipseq estimate.mean.fraglen). For ChIP-seq, we detect peaks before doing any overlaps, but for MEDIP-seq, we deal with reads. Anyway, the point is that there are a variety of use cases. I'll admit that an extra parameter is not entirely justified at this point. Just brought this up for discussion. Thanks, Michael > Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and > probably to other strand-aware GRanges methods if we start to go that > route) is not a big deal (just 2 lines of code in each method), but > still, that means adding yet another argument (findOverlaps already > has 9), and documenting it for all these methods. I'm just wondering > if it's worth it... > > H. > > >> >> >> Recently, I've come across several protocols where strandedness is >> maintained and I'm sure that these will be increasingly the norm as >> time goes on. >> >> I've been working on/with different variants of some (rna) >> tag-sequencing protocols, and these do typically keep strand >> info, but >> I think the "old" Illumina-driven transcriptome-wide rna-seq >> data that >> "most" people think of as RNA-seq is unstranded. >> >> Still. I like that GenomicRanges can honor strandedness in >> overlap/etc. functions when the strand is set. >> >> One thing that often confuses me with GenomicRanges is >> the dual meaning of >> strand. It seems to indicate both a direction and a >> coordinate. It makes >> sense to me to have resize() and precede() consider >> strand as a direction. >> But findOverlaps() treats strands as coordinates such >> that items on >> different strands do not overlap. This makes less sense. >> >> >> Why? IMO a positive read (i.e. a read that was aligned to >> the plus >> strand) shouldn't hit features (genes/transcripts/exons) >> that are >> located on the minus strand. If for whatever reason this is >> not what >> the user wants, then s/he can always set the strand of the >> query and >> subject to '*' but I wouldn't say this is the typical usecase. >> >> Another good >> example is gaps() which will yield sequence-long ranges >> on the strands not >> represented in the original object. This is very often >> undesirable. But of >> course my perspective is limited. >> >> >> I agree that the behaviour of gaps() on GRanges is awkward. >> You not only >> get those sequence-long ranges on the strands not >> represented in the >> original object but you also get them for sequences not >> represented at >> all (as long as they are listed in levels(seqnames(x))). But >> if gaps() >> wasn't doing this, gaps() wouldn't be reversible anymore (i.e. >> 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind >> of an >> important property for gaps(). In particular, gaps(x) would >> give >> the same result whether 'x' as a sequence-long range on one >> strand >> or not (very unlikely to happen with real data though). >> >> Maybe an extra arg to gaps() would help control this? >> Suggestions are >> welcome. I'll make the change. Thanks! >> >> >> I'd be in favor of adding a param to GRanges::gaps() that would >> allow >> me to explicitly turn off this "feature" ;-) >> >> >> OK, will see what I can do. Thanks for your feedback! >> >> H. >> >> >> -steve >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I seem to be a bit late in joining the discussion (been away), but I want to lend my support to Michael's suggestion (which I have been advocating in older emails). If we differentiate between the 'true' signal and what comes out of a specific assay, it is clear that true signals can be stranded (say, a gene) or unstranded (nucleosome positions, histone modifications) and it is also possible to have an unstranded assay for a stranded true signal (the standard Illumina RNA-Seq prep). On top of this, mapped reads are always stranded (at least, I don't know of an unstranded mapper) which may or may represent the signal. So doing comparisons between stranded and unstranded data/annotation is a fundamental issue. Now, it has been suggested in this thread and in the past that one could just change the strand of the objects we compare (and then do the comparison). However, that is a destructive modification and will almost alway mean that we will have two objects lying around (the original one and the strand-modified one). I don't like that both from a memory and a clutter perspective. For example, while I could set my RNA-Seq reads to all have strand *, this precludes me from obtaining the original nucleotide sequence for the read if all I have is chr, start, end and no strand. So now I need two sets of RNA-Seq reads.... Furthermore, I often have quite a few objects lying around representing different kinds of signals and data and some of these are strand specific, others are not. So this really becomes an issue when you deal with an integrative analysis. While it introduces an additional argument to many functions dealing with GRanges, I find that solution to be better, from a user perspective (it is clearly more work for a developer). Furthermore, someone (and I am happy to do so), should explicitly write down how strand comparisons ought to be interpreted in a section in the vignette so we all have it for future reference. This includes how to treat objects that have a mix of *, + and -. An explicit reference will help us all make sure that all functions use the same underlying logic (comments in this thread seems to indicate that this is not the case). Even if an additional ignore.strand argument is not implemented, it will be a nice section for future reference and bug hunting. I believe it is important to be very precise about this. In Genominator, we do have a ignoreStrand argument to all relevant functions and we also interpret a strand of * as "present on both strands" so that a 1-4,* overlaps 1-4,+. (It is my understanding that right now, GenomicRanges essentially treat * as a separate strand). So my comments above about preferring an ignoreStrand argument from a user perspective is based on actual experience. Kasper 2011/1/12 Michael Lawrence <lawrence.michael at="" gene.com="">: > 2011/1/12 Hervé Pagès <hpages at="" fhcrc.org=""> > >> Hi Michael, >> >> On 01/12/2011 05:45 AM, Michael Lawrence wrote: >> >>> >>> >>> 2011/1/12 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> >>> >>> >>> ? ?Hi Steve, >>> >>> >>> ? ?On 01/11/2011 07:20 PM, Steve Lianoglou wrote: >>> >>> ? ? ? ?Hi, >>> >>> ? ? ? ?I'm not Michael, but: >>> >>> ? ? ? ? ? ?How general is the "peaks do not have a direction" statement? >>> ? ? ? ? ? ?In my experience (RNA-seq experiments), peaks are "stranded >>> ? ? ? ? ? ?features" >>> ? ? ? ? ? ?i.e. they belong to a particular strand. >>> >>> >>> ? ? ? ?I'm actually a bit surprised by that. It was my impression that >>> most >>> ? ? ? ?of the rna-seq data to date have been done with a protocol that >>> ? ? ? ?doesn't honor the strand of the reads. >>> >>> >>> ? ?Just to clarify, I was not talking about protocols that honor or not >>> ? ?the original strand of the reads. I was talking about the strand to >>> ? ?which each read aligns. This information is stored in the SAM/BAM file >>> ? ?and propagates to the GRanges object you get when loading this file >>> ? ?(with something like 'as(read.GappedAlignments(myfile), "GRanges")'. >>> ? ?The strand information is here whatever type of *-seq experiment it >>> ? ?is (RNA-seq, CHIP-seq, ...) >>> >>> >>> Right, but if you're getting reads from both strands of a transcript, >>> you certainly don't want to ignore the reads coming from the strand >>> opposite transcription. >>> >> >> Point taken. I somehow overlooked the fact that with a non-stranded >> RNA-seq protocol, whatever strand a transcript is coming from, it will >> generate reads that align to both strands. Thanks for the reminder! >> >> >> ?Steve is right about stranded protocols becoming >>> more popular, but in many cases their advantages might not outweigh the >>> increased cost/complexity. Time will tell, I guess. Anyway, let's not >>> make GenomicRanges the RNA-seq package. Suffice it to say that we deal >>> RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to >>> remove strand before looking for overlaps. >>> >> >> Why only before looking for overlaps? I mean, if the strand is >> meaningless, why not just remove the strand from the very beginning >> i.e. when you make your GRanges object from your SAM/BAM file (like >> Steve does)? So you only need to do strand(gr) <- "*" once. Isn't >> that easier/safer than having to remember to set an additional >> 'ignore.strand' parameter to FALSE every time you need to call a >> strand-aware function like findOverlaps? >> >> > For RNA-seq, that's reasonable. There are cases where the strand of the > reads is important. Fragment length estimation algorithms, mostly designed > for enrichment experiments, use the strand (see chipseq > estimate.mean.fraglen). For ChIP-seq, we detect peaks before doing any > overlaps, but for MEDIP-seq, we deal with reads. Anyway, the point is that > there are a variety of use cases. I'll admit that an extra parameter is not > entirely justified at this point. Just brought this up for discussion. > > Thanks, > Michael > > > >> Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and >> probably to other strand-aware GRanges methods if we start to go that >> route) is not a big deal (just 2 lines of code in each method), but >> still, that means adding yet another argument (findOverlaps already >> has 9), and documenting it for all these methods. I'm just wondering >> if it's worth it... >> >> H. >> >> >>> >>> >>> ? ? ? ?Recently, I've come across several protocols where strandedness is >>> ? ? ? ?maintained and I'm sure that these will be increasingly the norm as >>> ? ? ? ?time goes on. >>> >>> ? ? ? ?I've been working on/with different variants of some (rna) >>> ? ? ? ?tag-sequencing protocols, and these do typically keep strand >>> ? ? ? ?info, but >>> ? ? ? ?I think the "old" Illumina-driven transcriptome-wide rna- seq >>> ? ? ? ?data that >>> ? ? ? ?"most" people think of as RNA-seq is unstranded. >>> >>> ? ? ? ?Still. I like that GenomicRanges can honor strandedness in >>> ? ? ? ?overlap/etc. functions when the strand is set. >>> >>> ? ? ? ? ? ? ? ?One thing that often confuses me with GenomicRanges is >>> ? ? ? ? ? ? ? ?the dual meaning of >>> ? ? ? ? ? ? ? ?strand. It seems to indicate both a direction and a >>> ? ? ? ? ? ? ? ?coordinate. It makes >>> ? ? ? ? ? ? ? ?sense to me to have resize() and precede() consider >>> ? ? ? ? ? ? ? ?strand as a direction. >>> ? ? ? ? ? ? ? ?But findOverlaps() treats strands as coordinates such >>> ? ? ? ? ? ? ? ?that items on >>> ? ? ? ? ? ? ? ?different strands do not overlap. This makes less sense. >>> >>> >>> ? ? ? ? ? ?Why? IMO a positive read (i.e. a read that was aligned to >>> ? ? ? ? ? ?the plus >>> ? ? ? ? ? ?strand) shouldn't hit features (genes/transcripts/exons) >>> ? ? ? ? ? ?that are >>> ? ? ? ? ? ?located on the minus strand. If for whatever reason this is >>> ? ? ? ? ? ?not what >>> ? ? ? ? ? ?the user wants, then s/he can always set the strand of the >>> ? ? ? ? ? ?query and >>> ? ? ? ? ? ?subject to '*' but I wouldn't say this is the typical usecase. >>> >>> ? ? ? ? ? ? ? ?Another good >>> ? ? ? ? ? ? ? ?example is gaps() which will yield sequence-long ranges >>> ? ? ? ? ? ? ? ?on the strands not >>> ? ? ? ? ? ? ? ?represented in the original object. This is very often >>> ? ? ? ? ? ? ? ?undesirable. But of >>> ? ? ? ? ? ? ? ?course my perspective is limited. >>> >>> >>> ? ? ? ? ? ?I agree that the behaviour of gaps() on GRanges is awkward. >>> ? ? ? ? ? ?You not only >>> ? ? ? ? ? ?get those sequence-long ranges on the strands not >>> ? ? ? ? ? ?represented in the >>> ? ? ? ? ? ?original object but you also get them for sequences not >>> ? ? ? ? ? ?represented at >>> ? ? ? ? ? ?all (as long as they are listed in levels(seqnames(x))). But >>> ? ? ? ? ? ?if gaps() >>> ? ? ? ? ? ?wasn't doing this, gaps() wouldn't be reversible anymore (i.e. >>> ? ? ? ? ? ?'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind >>> ? ? ? ? ? ?of an >>> ? ? ? ? ? ?important property for gaps(). In particular, gaps(x) would >>> give >>> ? ? ? ? ? ?the same result whether 'x' as a sequence-long range on one >>> ? ? ? ? ? ?strand >>> ? ? ? ? ? ?or not (very unlikely to happen with real data though). >>> >>> ? ? ? ? ? ?Maybe an extra arg to gaps() would help control this? >>> ? ? ? ? ? ?Suggestions are >>> ? ? ? ? ? ?welcome. I'll make the change. Thanks! >>> >>> >>> ? ? ? ?I'd be in favor of adding a param to GRanges::gaps() that would >>> ? ? ? ?allow >>> ? ? ? ?me to explicitly turn off this "feature" ;-) >>> >>> >>> ? ?OK, will see what I can do. Thanks for your feedback! >>> >>> ? ?H. >>> >>> >>> ? ? ? ?-steve >>> >>> >>> >>> ? ?-- >>> ? ?Hervé Pagès >>> >>> ? ?Program in Computational Biology >>> ? ?Division of Public Health Sciences >>> ? ?Fred Hutchinson Cancer Research Center >>> ? ?1100 Fairview Ave. N, M2-B876 >>> ? ?P.O. Box 19024 >>> ? ?Seattle, WA 98109-1024 >>> >>> ? ?E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >>> >>> ? ?Phone: ?(206) 667-5791 >>> ? ?Fax: ? ?(206) 667-1319 >>> >>> >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: ?(206) 667-5791 >> Fax: ? ?(206) 667-1319 >> > > ? ? ? ?[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.3 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Thanks all for the input on this. We'll add a flag to the relevant functions. It'll happen before the next release. And Kasper if you want to initiate > Furthermore, someone (and I am happy to do so), should explicitly > write down how strand comparisons ought to be interpreted in a section > in the vignette so we all have it for future reference. This includes that would be a really valuable addition Martin On 01/13/2011 05:30 AM, Kasper Daniel Hansen wrote: > I seem to be a bit late in joining the discussion (been away), but I > want to lend my support to Michael's suggestion (which I have been > advocating in older emails). > > If we differentiate between the 'true' signal and what comes out of a > specific assay, it is clear that true signals can be stranded (say, a > gene) or unstranded (nucleosome positions, histone modifications) and > it is also possible to have an unstranded assay for a stranded true > signal (the standard Illumina RNA-Seq prep). On top of this, mapped > reads are always stranded (at least, I don't know of an unstranded > mapper) which may or may represent the signal. So doing comparisons > between stranded and unstranded data/annotation is a fundamental > issue. > > Now, it has been suggested in this thread and in the past that one > could just change the strand of the objects we compare (and then do > the comparison). However, that is a destructive modification and will > almost alway mean that we will have two objects lying around (the > original one and the strand-modified one). I don't like that both > from a memory and a clutter perspective. For example, while I could > set my RNA-Seq reads to all have strand *, this precludes me from > obtaining the original nucleotide sequence for the read if all I have > is chr, start, end and no strand. So now I need two sets of RNA-Seq > reads.... Furthermore, I often have quite a few objects lying around > representing different kinds of signals and data and some of these are > strand specific, others are not. So this really becomes an issue when > you deal with an integrative analysis. > > While it introduces an additional argument to many functions dealing > with GRanges, I find that solution to be better, from a user > perspective (it is clearly more work for a developer). > > Furthermore, someone (and I am happy to do so), should explicitly > write down how strand comparisons ought to be interpreted in a section > in the vignette so we all have it for future reference. This includes > how to treat objects that have a mix of *, + and -. An explicit > reference will help us all make sure that all functions use the same > underlying logic (comments in this thread seems to indicate that this > is not the case). Even if an additional ignore.strand argument is not > implemented, it will be a nice section for future reference and bug > hunting. I believe it is important to be very precise about this. > > In Genominator, we do have a ignoreStrand argument to all relevant > functions and we also interpret a strand of * as "present on both > strands" so that a 1-4,* overlaps 1-4,+. (It is my understanding that > right now, GenomicRanges essentially treat * as a separate strand). > So my comments above about preferring an ignoreStrand argument from a > user perspective is based on actual experience. > > Kasper > > 2011/1/12 Michael Lawrence <lawrence.michael at="" gene.com="">: >> 2011/1/12 Hervé Pagès <hpages at="" fhcrc.org=""> >> >>> Hi Michael, >>> >>> On 01/12/2011 05:45 AM, Michael Lawrence wrote: >>> >>>> >>>> >>>> 2011/1/12 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> >>>> >>>> >>>> Hi Steve, >>>> >>>> >>>> On 01/11/2011 07:20 PM, Steve Lianoglou wrote: >>>> >>>> Hi, >>>> >>>> I'm not Michael, but: >>>> >>>> How general is the "peaks do not have a direction" statement? >>>> In my experience (RNA-seq experiments), peaks are "stranded >>>> features" >>>> i.e. they belong to a particular strand. >>>> >>>> >>>> I'm actually a bit surprised by that. It was my impression that >>>> most >>>> of the rna-seq data to date have been done with a protocol that >>>> doesn't honor the strand of the reads. >>>> >>>> >>>> Just to clarify, I was not talking about protocols that honor or not >>>> the original strand of the reads. I was talking about the strand to >>>> which each read aligns. This information is stored in the SAM/BAM file >>>> and propagates to the GRanges object you get when loading this file >>>> (with something like 'as(read.GappedAlignments(myfile), "GRanges")'. >>>> The strand information is here whatever type of *-seq experiment it >>>> is (RNA-seq, CHIP-seq, ...) >>>> >>>> >>>> Right, but if you're getting reads from both strands of a transcript, >>>> you certainly don't want to ignore the reads coming from the strand >>>> opposite transcription. >>>> >>> >>> Point taken. I somehow overlooked the fact that with a non- stranded >>> RNA-seq protocol, whatever strand a transcript is coming from, it will >>> generate reads that align to both strands. Thanks for the reminder! >>> >>> >>> Steve is right about stranded protocols becoming >>>> more popular, but in many cases their advantages might not outweigh the >>>> increased cost/complexity. Time will tell, I guess. Anyway, let's not >>>> make GenomicRanges the RNA-seq package. Suffice it to say that we deal >>>> RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to >>>> remove strand before looking for overlaps. >>>> >>> >>> Why only before looking for overlaps? I mean, if the strand is >>> meaningless, why not just remove the strand from the very beginning >>> i.e. when you make your GRanges object from your SAM/BAM file (like >>> Steve does)? So you only need to do strand(gr) <- "*" once. Isn't >>> that easier/safer than having to remember to set an additional >>> 'ignore.strand' parameter to FALSE every time you need to call a >>> strand-aware function like findOverlaps? >>> >>> >> For RNA-seq, that's reasonable. There are cases where the strand of the >> reads is important. Fragment length estimation algorithms, mostly designed >> for enrichment experiments, use the strand (see chipseq >> estimate.mean.fraglen). For ChIP-seq, we detect peaks before doing any >> overlaps, but for MEDIP-seq, we deal with reads. Anyway, the point is that >> there are a variety of use cases. I'll admit that an extra parameter is not >> entirely justified at this point. Just brought this up for discussion. >> >> Thanks, >> Michael >> >> >> >>> Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and >>> probably to other strand-aware GRanges methods if we start to go that >>> route) is not a big deal (just 2 lines of code in each method), but >>> still, that means adding yet another argument (findOverlaps already >>> has 9), and documenting it for all these methods. I'm just wondering >>> if it's worth it... >>> >>> H. >>> >>> >>>> >>>> >>>> Recently, I've come across several protocols where strandedness is >>>> maintained and I'm sure that these will be increasingly the norm as >>>> time goes on. >>>> >>>> I've been working on/with different variants of some (rna) >>>> tag-sequencing protocols, and these do typically keep strand >>>> info, but >>>> I think the "old" Illumina-driven transcriptome-wide rna- seq >>>> data that >>>> "most" people think of as RNA-seq is unstranded. >>>> >>>> Still. I like that GenomicRanges can honor strandedness in >>>> overlap/etc. functions when the strand is set. >>>> >>>> One thing that often confuses me with GenomicRanges is >>>> the dual meaning of >>>> strand. It seems to indicate both a direction and a >>>> coordinate. It makes >>>> sense to me to have resize() and precede() consider >>>> strand as a direction. >>>> But findOverlaps() treats strands as coordinates such >>>> that items on >>>> different strands do not overlap. This makes less sense. >>>> >>>> >>>> Why? IMO a positive read (i.e. a read that was aligned to >>>> the plus >>>> strand) shouldn't hit features (genes/transcripts/exons) >>>> that are >>>> located on the minus strand. If for whatever reason this is >>>> not what >>>> the user wants, then s/he can always set the strand of the >>>> query and >>>> subject to '*' but I wouldn't say this is the typical usecase. >>>> >>>> Another good >>>> example is gaps() which will yield sequence-long ranges >>>> on the strands not >>>> represented in the original object. This is very often >>>> undesirable. But of >>>> course my perspective is limited. >>>> >>>> >>>> I agree that the behaviour of gaps() on GRanges is awkward. >>>> You not only >>>> get those sequence-long ranges on the strands not >>>> represented in the >>>> original object but you also get them for sequences not >>>> represented at >>>> all (as long as they are listed in levels(seqnames(x))). But >>>> if gaps() >>>> wasn't doing this, gaps() wouldn't be reversible anymore (i.e. >>>> 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind >>>> of an >>>> important property for gaps(). In particular, gaps(x) would >>>> give >>>> the same result whether 'x' as a sequence-long range on one >>>> strand >>>> or not (very unlikely to happen with real data though). >>>> >>>> Maybe an extra arg to gaps() would help control this? >>>> Suggestions are >>>> welcome. I'll make the change. Thanks! >>>> >>>> >>>> I'd be in favor of adding a param to GRanges::gaps() that would >>>> allow >>>> me to explicitly turn off this "feature" ;-) >>>> >>>> >>>> OK, will see what I can do. Thanks for your feedback! >>>> >>>> H. >>>> >>>> >>>> -steve >>>> >>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M2-B876 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >>>> >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>>> >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M2-B876 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 13.3 years ago Martin Morgan 25k

0

Entering edit mode

Hi Steve, On 01/11/2011 07:20 PM, Steve Lianoglou wrote: > Hi, > > I'm not Michael, but: > >> How general is the "peaks do not have a direction" statement? >> In my experience (RNA-seq experiments), peaks are "stranded features" >> i.e. they belong to a particular strand. > > I'm actually a bit surprised by that. It was my impression that most > of the rna-seq data to date have been done with a protocol that > doesn't honor the strand of the reads. Just to clarify, I was not talking about protocols that honor or not the original strand of the reads. I was talking about the strand to which each read aligns. This information is stored in the SAM/BAM file and propagates to the GRanges object you get when loading this file (with something like 'as(read.GappedAlignments(myfile), "GRanges")'. The strand information is here whatever type of *-seq experiment it is (RNA-seq, CHIP-seq, ...) > > Recently, I've come across several protocols where strandedness is > maintained and I'm sure that these will be increasingly the norm as > time goes on. > > I've been working on/with different variants of some (rna) > tag-sequencing protocols, and these do typically keep strand info, but > I think the "old" Illumina-driven transcriptome-wide rna-seq data that > "most" people think of as RNA-seq is unstranded. > > Still. I like that GenomicRanges can honor strandedness in > overlap/etc. functions when the strand is set. > >>> One thing that often confuses me with GenomicRanges is the dual meaning of >>> strand. It seems to indicate both a direction and a coordinate. It makes >>> sense to me to have resize() and precede() consider strand as a direction. >>> But findOverlaps() treats strands as coordinates such that items on >>> different strands do not overlap. This makes less sense. >> >> Why? IMO a positive read (i.e. a read that was aligned to the plus >> strand) shouldn't hit features (genes/transcripts/exons) that are >> located on the minus strand. If for whatever reason this is not what >> the user wants, then s/he can always set the strand of the query and >> subject to '*' but I wouldn't say this is the typical usecase. >> >>> Another good >>> example is gaps() which will yield sequence-long ranges on the strands not >>> represented in the original object. This is very often undesirable. But of >>> course my perspective is limited. >> >> I agree that the behaviour of gaps() on GRanges is awkward. You not only >> get those sequence-long ranges on the strands not represented in the >> original object but you also get them for sequences not represented at >> all (as long as they are listed in levels(seqnames(x))). But if gaps() >> wasn't doing this, gaps() wouldn't be reversible anymore (i.e. >> 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an >> important property for gaps(). In particular, gaps(x) would give >> the same result whether 'x' as a sequence-long range on one strand >> or not (very unlikely to happen with real data though). >> >> Maybe an extra arg to gaps() would help control this? Suggestions are >> welcome. I'll make the change. Thanks! > > I'd be in favor of adding a param to GRanges::gaps() that would allow > me to explicitly turn off this "feature" ;-) OK, will see what I can do. Thanks for your feedback! H. > > -steve > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.3 years ago Hervé Pagès 16k

Login before adding your answer.