conditional merge of duplicated rows in data.frame
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hi all! I have the following problem. chr13 1260 1275 chr13_38134720_38136919 chr13 1261 1276 chr13_38134720_38136919 chr15 839 854 chr15_63332831_63335030 chr15 840 856 chr15_63332831_63335030 chr15 837 852 chr15_63332831_63335030 chr15 842 857 chr15_63332831_63335030 In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want: chr13 1260 1276 chr13_38134720_38136919 chr15 837 857 chr15_63332831_63335030 Any help is highly appreciated!!! -- output of sessionInfo(): sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Sent via the guest posting facility at bioconductor.org.
• 734 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
On 01/20/2014 08:54 AM, Ninni Nahm [guest] wrote: > > Hi all! > > I have the following problem. > > chr13 1260 1275 chr13_38134720_38136919 > chr13 1261 1276 chr13_38134720_38136919 > chr15 839 854 chr15_63332831_63335030 > chr15 840 856 chr15_63332831_63335030 > chr15 837 852 chr15_63332831_63335030 > chr15 842 857 chr15_63332831_63335030 > In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want: > > chr13 1260 1276 chr13_38134720_38136919 > chr15 837 857 chr15_63332831_63335030 > Any help is highly appreciated!!! Hi -- Once you've read in the data > df = read.table(stdin()) 0: chr13 1260 1275 chr13_38134720_38136919 1: chr13 1261 1276 chr13_38134720_38136919 2: chr15 839 854 chr15_63332831_63335030 3: chr15 840 856 chr15_63332831_63335030 4: chr15 837 852 chr15_63332831_63335030 6: chr15 842 857 chr15_63332831_63335030 7: you could use the GenomicRanges package to make a 'GRanges' object with the chromosome coordinates > library(GenomicRanges) > gr = with(df, GRanges(V1, IRanges(V2, V3))) then split gr by the fourth column, reduce() the adjacent ranges within each group, and (if there is one range per group) unlist to a GRanges. Optionally, you might wish to coerce back to a data.frame (though it will often make sense to continue your analysis with GRanges) > as.data.frame(unlist(reduce(split(gr, df$V4)))) seqnames start end width strand chr13_38134720_38136919 chr13 1260 1276 17 * chr15_63332831_63335030 chr15 837 857 21 * Hope that helps, Martin > > -- output of sessionInfo(): > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Thank you! Works perfectly! I did not know about the split function -Ninni On Mon, Jan 20, 2014 at 9:03 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 01/20/2014 08:54 AM, Ninni Nahm [guest] wrote: > >> >> Hi all! >> >> I have the following problem. >> >> chr13 1260 1275 chr13_38134720_38136919 >> chr13 1261 1276 chr13_38134720_38136919 >> chr15 839 854 chr15_63332831_63335030 >> chr15 840 856 chr15_63332831_63335030 >> chr15 837 852 chr15_63332831_63335030 >> chr15 842 857 chr15_63332831_63335030 >> In the 2. and 3. column are positions which I want to combine whenever >> the value in column 4 is the same. For example, I would want: >> >> chr13 1260 1276 chr13_38134720_38136919 >> chr15 837 857 chr15_63332831_63335030 >> Any help is highly appreciated!!! >> > > Hi -- Once you've read in the data > > > df = read.table(stdin()) > 0: chr13 1260 1275 chr13_38134720_38136919 > 1: chr13 1261 1276 chr13_38134720_38136919 > 2: chr15 839 854 chr15_63332831_63335030 > 3: chr15 840 856 chr15_63332831_63335030 > 4: chr15 837 852 chr15_63332831_63335030 > 6: chr15 842 857 chr15_63332831_63335030 > 7: > > > you could use the GenomicRanges package to make a 'GRanges' object with > the chromosome coordinates > > > library(GenomicRanges) > > gr = with(df, GRanges(V1, IRanges(V2, V3))) > > then split gr by the fourth column, reduce() the adjacent ranges within > each group, and (if there is one range per group) unlist to a GRanges. > Optionally, you might wish to coerce back to a data.frame (though it will > often make sense to continue your analysis with GRanges) > > > as.data.frame(unlist(reduce(split(gr, df$V4)))) > seqnames start end width strand > chr13_38134720_38136919 chr13 1260 1276 17 * > chr15_63332831_63335030 chr15 837 857 21 * > > Hope that helps, > > Martin > > >> -- output of sessionInfo(): >> >> sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 1059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6