Repeat masker sequences as GRanges object
3
0
Entering edit mode
@hermann-norpois-5726
Last seen 9.0 years ago
Germany
Hello, I would like to have repeat sequences as GRanges object I started with ... library (BSgenome.Hsapiens.UCSC.hg19) ch1 <- Hsapiens$chr1 active (masks (ch1)) AGAPS AMB RM TRF TRUE TRUE FALSE FALSE active (masks(ch1))["RM"] <- TRUE active (masks (ch1)) AGAPS AMB RM TRF TRUE TRUE TRUE FALSE Can anyboldy give me a hint how to continue. Thanks hermann [[alternative HTML version deleted]]
• 2.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 18 minutes ago
United States
Hi Hermann, How about this: > library(AnnotationHub) > hub <- AnnotationHub() > hub$goldenpath.hg19.database.rmsk_0.0.1.RData GRanges with 5298130 ranges and 2 metadata columns: seqnames ranges strand | name <rle> <iranges> <rle> | <character> [1] chr1 [16777161, 16777470] + | AluSp [2] chr1 [25165801, 25166089] - | AluY [3] chr1 [33553607, 33554646] + | L2b [4] chr1 [50330064, 50332153] + | L1PA10 [5] chr1 [58720068, 58720973] - | L1PA2 ... ... ... ... ... ... [5298126] chr21_gl000210_random [25379, 25875] + | MER74B [5298127] chr21_gl000210_random [26438, 26596] - | MIRc [5298128] chr21_gl000210_random [26882, 27022] - | MIRc [5298129] chr21_gl000210_random [27297, 27447] + | HAL1-2a_MD [5298130] chr21_gl000210_random [27469, 27682] + | HAL1-2a_MD score <numeric> [1] 2147 [2] 2626 [3] 626 [4] 12545 [5] 8050 ... ... [5298126] 1674 [5298127] 308 [5298128] 475 [5298129] 371 [5298130] 370 --- seqlengths: chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 This is a GRanges of all features from UCSC's Repeat Masker table. Best, Jim On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at="" gmail.com=""> wrote: > Hello, > > I would like to have repeat sequences as GRanges object > I started with ... > > library (BSgenome.Hsapiens.UCSC.hg19) > ch1 <- Hsapiens$chr1 > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE FALSE FALSE > active (masks(ch1))["RM"] <- TRUE > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE TRUE FALSE > > Can anyboldy give me a hint how to continue. > > Thanks > hermann > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi, On 09/12/2014 06:30 AM, James W. MacDonald wrote: > Hi Hermann, > > How about this: > >> library(AnnotationHub) >> hub <- AnnotationHub() >> hub$goldenpath.hg19.database.rmsk_0.0.1.RData > GRanges with 5298130 ranges and 2 metadata columns: > seqnames ranges strand | > name > <rle> <iranges> <rle> | > <character> > [1] chr1 [16777161, 16777470] + | > AluSp > [2] chr1 [25165801, 25166089] - | > AluY > [3] chr1 [33553607, 33554646] + | > L2b > [4] chr1 [50330064, 50332153] + | > L1PA10 > [5] chr1 [58720068, 58720973] - | > L1PA2 > ... ... ... ... ... > ... > [5298126] chr21_gl000210_random [25379, 25875] + | > MER74B > [5298127] chr21_gl000210_random [26438, 26596] - | > MIRc > [5298128] chr21_gl000210_random [26882, 27022] - | > MIRc > [5298129] chr21_gl000210_random [27297, 27447] + | > HAL1-2a_MD > [5298130] chr21_gl000210_random [27469, 27682] + | > HAL1-2a_MD > score > <numeric> > [1] 2147 > [2] 2626 > [3] 626 > [4] 12545 > [5] 8050 > ... ... > [5298126] 1674 > [5298127] 308 > [5298128] 475 > [5298129] 371 > [5298130] 370 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > > This is a GRanges of all features from UCSC's Repeat Masker table. Nice to see the "name" and "score" metadata cols but I wonder why the original UCSC names for these cols (which are "repName" and "swScore") were not preserved. Also, other UCSC cols in the rmsk table at UCSC might be of interest (e.g. "repClass" and "repFamily"). FWIW, here is one way to get these cols: local_file <- tempfile() download.file("http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database /rmsk.txt.gz", local_file) ## Get the col names from ## http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns", "genoName", "genoStart", "genoEnd", "genoLeft", "strand", "repName", "repClass", "repFamily", "repStart", "repEnd", "repLeft", "id") library(GenomicRanges) df <- read.table(local_file, col.names=COLNAMES) rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, seqnames.field="genoName", start.field="genoStart", end.field="genoEnd", strand.field="strand", starts.in.df.are.0based=TRUE) Then: > head(rmsk) GRanges with 6 ranges and 13 metadata columns: seqnames ranges strand | bin swScore milliDiv milliDel <rle> <iranges> <rle> | <integer> <integer> <integer> <integer> [1] chr1 [10001, 10468] + | 585 1504 13 4 [2] chr1 [10469, 11447] - | 585 3612 114 270 [3] chr1 [11504, 11675] - | 585 437 235 186 [4] chr1 [11678, 11780] - | 585 239 294 19 [5] chr1 [15265, 15355] - | 585 318 230 38 [6] chr1 [16713, 16749] + | 585 203 162 0 milliIns genoLeft repName repClass repFamily repStart <integer> <integer> <factor> <factor> <factor> <integer> [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat 1 [2] 13 -249239174 TAR1 Satellite telo -399 [3] 35 -249238946 L1MC LINE L1 -2236 [4] 10 -249238841 MER5B DNA hAT-Charlie -74 [5] 0 -249235266 MIR3 SINE MIR -119 [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat 1 repEnd repLeft id <integer> <integer> <integer> [1] 463 0 1 [2] 1712 483 2 [3] 5646 5449 3 [4] 104 1 4 [5] 143 49 5 [6] 37 0 6 I also tried with rtracklayer but got the following error: > library(rtracklayer) > session <- browserSession() > genome(session) <- "hg19" > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk") > system.time(rmsk <- getTable(query)) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 3813855 did not have 17 elements Timing stopped at: 551.027 7.24 1304.386 Could be due to my flaky internet connection though... Cheers, H. > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19 [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5 loaded via a namespace (and not attached): [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19 [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6 [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0 [10] digest_0.6.4 fail_1.2 foreach_1.4.2 [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3 [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2 [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1 > > Best, > > Jim > > > > > On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at="" gmail.com=""> wrote: > >> Hello, >> >> I would like to have repeat sequences as GRanges object >> I started with ... >> >> library (BSgenome.Hsapiens.UCSC.hg19) >> ch1 <- Hsapiens$chr1 >> active (masks (ch1)) >> AGAPS AMB RM TRF >> TRUE TRUE FALSE FALSE >> active (masks(ch1))["RM"] <- TRUE >> active (masks (ch1)) >> AGAPS AMB RM TRF >> TRUE TRUE TRUE FALSE >> >> Can anyboldy give me a hint how to continue. >> >> Thanks >> hermann >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
1
Entering edit mode
Pfft. Downloading. Ain't nobody got time fo' dat! ;-D library(RMySQL) library(GenomicRanges) con <- dbConnect("MySQL", user="genome", host="genome- mysql.cse.ucsc.edu", db="hg19") rmsk <- dbGetQuery(con, "select * from rmsk;") rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, seqnames.field="genoName", start.field="genoStart", end.field="genoEnd", strand.field="strand", starts.in.df.are.0based=TRUE) Best, Jim On Fri, Sep 12, 2014 at 2:08 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi, > > On 09/12/2014 06:30 AM, James W. MacDonald wrote: > >> Hi Hermann, >> >> How about this: >> >> library(AnnotationHub) >>> hub <- AnnotationHub() >>> hub$goldenpath.hg19.database.rmsk_0.0.1.RData >>> >> GRanges with 5298130 ranges and 2 metadata columns: >> seqnames ranges strand | >> name >> <rle> <iranges> <rle> | >> <character> >> [1] chr1 [16777161, 16777470] + | >> AluSp >> [2] chr1 [25165801, 25166089] - | >> AluY >> [3] chr1 [33553607, 33554646] + | >> L2b >> [4] chr1 [50330064, 50332153] + | >> L1PA10 >> [5] chr1 [58720068, 58720973] - | >> L1PA2 >> ... ... ... ... ... >> ... >> [5298126] chr21_gl000210_random [25379, 25875] + | >> MER74B >> [5298127] chr21_gl000210_random [26438, 26596] - | >> MIRc >> [5298128] chr21_gl000210_random [26882, 27022] - | >> MIRc >> [5298129] chr21_gl000210_random [27297, 27447] + | >> HAL1-2a_MD >> [5298130] chr21_gl000210_random [27469, 27682] + | >> HAL1-2a_MD >> score >> <numeric> >> [1] 2147 >> [2] 2626 >> [3] 626 >> [4] 12545 >> [5] 8050 >> ... ... >> [5298126] 1674 >> [5298127] 308 >> [5298128] 475 >> [5298129] 371 >> [5298130] 370 >> --- >> seqlengths: >> chr1 chr2 ... chr18_gl000207_random >> 249250621 243199373 ... 4262 >> >> This is a GRanges of all features from UCSC's Repeat Masker table. >> > > Nice to see the "name" and "score" metadata cols but I wonder why the > original UCSC names for these cols (which are "repName" and "swScore") > were not preserved. Also, other UCSC cols in the rmsk table at UCSC > might be of interest (e.g. "repClass" and "repFamily"). > > FWIW, here is one way to get these cols: > > local_file <- tempfile() > > download.file("http://hgdownload.soe.ucsc.edu/ > goldenPath/hg19/database/rmsk.txt.gz", local_file) > > ## Get the col names from > ## http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql > COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns", > "genoName", "genoStart", "genoEnd", "genoLeft", > "strand", "repName", "repClass", "repFamily", > "repStart", "repEnd", "repLeft", "id") > > library(GenomicRanges) > df <- read.table(local_file, col.names=COLNAMES) > rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, > seqnames.field="genoName", > start.field="genoStart", > end.field="genoEnd", > strand.field="strand", > starts.in.df.are.0based=TRUE) > > Then: > > > head(rmsk) > GRanges with 6 ranges and 13 metadata columns: > seqnames ranges strand | bin swScore milliDiv > milliDel > <rle> <iranges> <rle> | <integer> <integer> <integer> > <integer> > [1] chr1 [10001, 10468] + | 585 1504 13 > 4 > [2] chr1 [10469, 11447] - | 585 3612 114 > 270 > [3] chr1 [11504, 11675] - | 585 437 235 > 186 > [4] chr1 [11678, 11780] - | 585 239 294 > 19 > [5] chr1 [15265, 15355] - | 585 318 230 > 38 > [6] chr1 [16713, 16749] + | 585 203 162 > 0 > milliIns genoLeft repName repClass repFamily repStart > <integer> <integer> <factor> <factor> <factor> > <integer> > [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat 1 > [2] 13 -249239174 TAR1 Satellite telo -399 > [3] 35 -249238946 L1MC LINE L1 -2236 > [4] 10 -249238841 MER5B DNA hAT-Charlie -74 > [5] 0 -249235266 MIR3 SINE MIR -119 > [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat 1 > repEnd repLeft id > <integer> <integer> <integer> > [1] 463 0 1 > [2] 1712 483 2 > [3] 5646 5449 3 > [4] 104 1 4 > [5] 143 49 5 > [6] 37 0 6 > > I also tried with rtracklayer but got the following error: > > > library(rtracklayer) > > session <- browserSession() > > genome(session) <- "hg19" > > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk") > > system.time(rmsk <- getTable(query)) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > line 3813855 did not have 17 elements > Timing stopped at: 551.027 7.24 1304.386 > > Could be due to my flaky internet connection though... > > Cheers, > H. > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19 > [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19 > [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0 > [10] digest_0.6.4 fail_1.2 foreach_1.4.2 > [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3 > [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2 > [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 > [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1 > > >> Best, >> >> Jim >> >> >> >> >> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at="" gmail.com=""> >> wrote: >> >> Hello, >>> >>> I would like to have repeat sequences as GRanges object >>> I started with ... >>> >>> library (BSgenome.Hsapiens.UCSC.hg19) >>> ch1 <- Hsapiens$chr1 >>> active (masks (ch1)) >>> AGAPS AMB RM TRF >>> TRUE TRUE FALSE FALSE >>> active (masks(ch1))["RM"] <- TRUE >>> active (masks (ch1)) >>> AGAPS AMB RM TRF >>> TRUE TRUE TRUE FALSE >>> >>> Can anyboldy give me a hint how to continue. >>> >>> Thanks >>> hermann >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 09/12/2014 12:29 PM, James W. MacDonald wrote: > Pfft. Downloading. Ain't nobody got time fo' dat! ;-D > > library(RMySQL) > library(GenomicRanges) > con <- dbConnect("MySQL", user="genome", host="genome- mysql.cse.ucsc.edu > <http: genome-mysql.cse.ucsc.edu="">", db="hg19") > rmsk <- dbGetQuery(con, "select * from rmsk;") > rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, > seqnames.field="genoName", > start.field="genoStart", > end.field="genoEnd", > strand.field="strand", > starts.in.df.are.0based=TRUE) Yeah, much better :) Thanks Jim. Although I guess you still kind of "download" the data in a way. The SQL query approach has at least 2 benefits that are really worth emphasizing: 1) You get a data.frame with the col names already set for you. 2) You don't need to download the full table: > dbGetQuery(con, "select * from rmsk limit 4;") bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft 1 585 1504 13 4 13 chr1 10000 10468 -249240153 2 585 3612 114 270 13 chr1 10468 11447 -249239174 3 585 437 235 186 35 chr1 11503 11675 -249238946 4 585 239 294 19 10 chr1 11677 11780 -249238841 strand repName repClass repFamily repStart repEnd repLeft id 1 + (CCCTAA)n Simple_repeat Simple_repeat 1 463 0 1 2 - TAR1 Satellite telo -399 1712 483 2 3 - L1MC LINE L1 -2236 5646 5449 3 4 - MER5B DNA hAT-Charlie -74 104 1 4 H. > > Best, > > Jim > > > > > On Fri, Sep 12, 2014 at 2:08 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi, > > On 09/12/2014 06:30 AM, James W. MacDonald wrote: > > Hi Hermann, > > How about this: > > library(AnnotationHub) > hub <- AnnotationHub() > hub$goldenpath.hg19.database.__rmsk_0.0.1.RData > > GRanges with 5298130 ranges and 2 metadata columns: > seqnames ranges strand | > name > <rle> <iranges> <rle> | > <character> > [1] chr1 [16777161, 16777470] + | > AluSp > [2] chr1 [25165801, 25166089] - | > AluY > [3] chr1 [33553607, 33554646] + | > L2b > [4] chr1 [50330064, 50332153] + | > L1PA10 > [5] chr1 [58720068, 58720973] - | > L1PA2 > ... ... ... ... ... > ... > [5298126] chr21_gl000210_random [25379, 25875] + | > MER74B > [5298127] chr21_gl000210_random [26438, 26596] - | > MIRc > [5298128] chr21_gl000210_random [26882, 27022] - | > MIRc > [5298129] chr21_gl000210_random [27297, 27447] + | > HAL1-2a_MD > [5298130] chr21_gl000210_random [27469, 27682] + | > HAL1-2a_MD > score > <numeric> > [1] 2147 > [2] 2626 > [3] 626 > [4] 12545 > [5] 8050 > ... ... > [5298126] 1674 > [5298127] 308 > [5298128] 475 > [5298129] 371 > [5298130] 370 > --- > seqlengths: > chr1 chr2 ... > chr18_gl000207_random > 249250621 243199373 ... > 4262 > > This is a GRanges of all features from UCSC's Repeat Masker table. > > > Nice to see the "name" and "score" metadata cols but I wonder why the > original UCSC names for these cols (which are "repName" and "swScore") > were not preserved. Also, other UCSC cols in the rmsk table at UCSC > might be of interest (e.g. "repClass" and "repFamily"). > > FWIW, here is one way to get these cols: > > local_file <- tempfile() > > download.file("http://__hgdownload.soe.ucsc.edu/__goldenPath/hg1 9/database/rmsk.__txt.gz > <http: hgdownload.soe.ucsc.edu="" goldenpath="" hg19="" database="" rmsk.tx="" t.gz="">", > local_file) > > ## Get the col names from > ## > http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql > <http: hgdownload.soe.ucsc.edu="" goldenpath="" hg19="" database="" rmsk.sql=""> > COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns", > "genoName", "genoStart", "genoEnd", "genoLeft", > "strand", "repName", "repClass", "repFamily", > "repStart", "repEnd", "repLeft", "id") > > library(GenomicRanges) > df <- read.table(local_file, col.names=COLNAMES) > rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, > seqnames.field="genoName", > start.field="genoStart", > end.field="genoEnd", > strand.field="strand", > starts.in.df.are.0based=TRUE) > > Then: > > > head(rmsk) > GRanges with 6 ranges and 13 metadata columns: > seqnames ranges strand | bin swScore > milliDiv milliDel > <rle> <iranges> <rle> | <integer> <integer> > <integer> <integer> > [1] chr1 [10001, 10468] + | 585 1504 > 13 4 > [2] chr1 [10469, 11447] - | 585 3612 > 114 270 > [3] chr1 [11504, 11675] - | 585 437 > 235 186 > [4] chr1 [11678, 11780] - | 585 239 > 294 19 > [5] chr1 [15265, 15355] - | 585 318 > 230 38 > [6] chr1 [16713, 16749] + | 585 203 > 162 0 > milliIns genoLeft repName repClass repFamily > repStart > <integer> <integer> <factor> <factor> <factor> > <integer> > [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat > 1 > [2] 13 -249239174 TAR1 Satellite telo > -399 > [3] 35 -249238946 L1MC LINE L1 > -2236 > [4] 10 -249238841 MER5B DNA hAT- Charlie > -74 > [5] 0 -249235266 MIR3 SINE MIR > -119 > [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat > 1 > repEnd repLeft id > <integer> <integer> <integer> > [1] 463 0 1 > [2] 1712 483 2 > [3] 5646 5449 3 > [4] 104 1 4 > [5] 143 49 5 > [6] 37 0 6 > > I also tried with rtracklayer but got the following error: > > > library(rtracklayer) > > session <- browserSession() > > genome(session) <- "hg19" > > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk") > > system.time(rmsk <- getTable(query)) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > line 3813855 did not have 17 elements > Timing stopped at: 551.027 7.24 1304.386 > > Could be due to my flaky internet connection though... > > Cheers, > H. > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19 > [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19 > [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0 > [10] digest_0.6.4 fail_1.2 foreach_1.4.2 > [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3 > [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2 > [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 > [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1 > > > Best, > > Jim > > > > > On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois > <hnorpois at="" gmail.com="" <mailto:hnorpois="" at="" gmail.com="">> wrote: > > Hello, > > I would like to have repeat sequences as GRanges object > I started with ... > > library (BSgenome.Hsapiens.UCSC.hg19) > ch1 <- Hsapiens$chr1 > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE FALSE FALSE > active (masks(ch1))["RM"] <- TRUE > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE TRUE FALSE > > Can anyboldy give me a hint how to continue. > > Thanks > hermann > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode

In the time since this helpful reply was written, is there a more AnnotationHub-way to do this? That is, in the third line of the below code chunk, is there a way to specify which additional columns are added as mcols to repeats?

library(AnnotationHub)
ah <- AnnotationHub()
repeats <- query(ah, c("mm10", "RepeatMasker"))[[1L]]
ADD REPLY
0
Entering edit mode
On Fri, Sep 12, 2014 at 11:08 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi, > > > On 09/12/2014 06:30 AM, James W. MacDonald wrote: > >> Hi Hermann, >> >> How about this: >> >> library(AnnotationHub) >>> hub <- AnnotationHub() >>> hub$goldenpath.hg19.database.rmsk_0.0.1.RData >>> >> GRanges with 5298130 ranges and 2 metadata columns: >> seqnames ranges strand | >> name >> <rle> <iranges> <rle> | >> <character> >> [1] chr1 [16777161, 16777470] + | >> AluSp >> [2] chr1 [25165801, 25166089] - | >> AluY >> [3] chr1 [33553607, 33554646] + | >> L2b >> [4] chr1 [50330064, 50332153] + | >> L1PA10 >> [5] chr1 [58720068, 58720973] - | >> L1PA2 >> ... ... ... ... ... >> ... >> [5298126] chr21_gl000210_random [25379, 25875] + | >> MER74B >> [5298127] chr21_gl000210_random [26438, 26596] - | >> MIRc >> [5298128] chr21_gl000210_random [26882, 27022] - | >> MIRc >> [5298129] chr21_gl000210_random [27297, 27447] + | >> HAL1-2a_MD >> [5298130] chr21_gl000210_random [27469, 27682] + | >> HAL1-2a_MD >> score >> <numeric> >> [1] 2147 >> [2] 2626 >> [3] 626 >> [4] 12545 >> [5] 8050 >> ... ... >> [5298126] 1674 >> [5298127] 308 >> [5298128] 475 >> [5298129] 371 >> [5298130] 370 >> --- >> seqlengths: >> chr1 chr2 ... chr18_gl000207_random >> 249250621 243199373 ... 4262 >> >> This is a GRanges of all features from UCSC's Repeat Masker table. >> > > Nice to see the "name" and "score" metadata cols but I wonder why the > original UCSC names for these cols (which are "repName" and "swScore") > were not preserved. Also, other UCSC cols in the rmsk table at UCSC > might be of interest (e.g. "repClass" and "repFamily"). > > FWIW, here is one way to get these cols: > > local_file <- tempfile() > > download.file("http://hgdownload.soe.ucsc.edu/ > goldenPath/hg19/database/rmsk.txt.gz", local_file) > > ## Get the col names from > ## http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql > COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns", > "genoName", "genoStart", "genoEnd", "genoLeft", > "strand", "repName", "repClass", "repFamily", > "repStart", "repEnd", "repLeft", "id") > > library(GenomicRanges) > df <- read.table(local_file, col.names=COLNAMES) > rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, > seqnames.field="genoName", > start.field="genoStart", > end.field="genoEnd", > strand.field="strand", > starts.in.df.are.0based=TRUE) > > Then: > > > head(rmsk) > GRanges with 6 ranges and 13 metadata columns: > seqnames ranges strand | bin swScore milliDiv > milliDel > <rle> <iranges> <rle> | <integer> <integer> <integer> > <integer> > [1] chr1 [10001, 10468] + | 585 1504 13 > 4 > [2] chr1 [10469, 11447] - | 585 3612 114 > 270 > [3] chr1 [11504, 11675] - | 585 437 235 > 186 > [4] chr1 [11678, 11780] - | 585 239 294 > 19 > [5] chr1 [15265, 15355] - | 585 318 230 > 38 > [6] chr1 [16713, 16749] + | 585 203 162 > 0 > milliIns genoLeft repName repClass repFamily repStart > <integer> <integer> <factor> <factor> <factor> > <integer> > [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat 1 > [2] 13 -249239174 TAR1 Satellite telo -399 > [3] 35 -249238946 L1MC LINE L1 -2236 > [4] 10 -249238841 MER5B DNA hAT-Charlie -74 > [5] 0 -249235266 MIR3 SINE MIR -119 > [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat 1 > repEnd repLeft id > <integer> <integer> <integer> > [1] 463 0 1 > [2] 1712 483 2 > [3] 5646 5449 3 > [4] 104 1 4 > [5] 143 49 5 > [6] 37 0 6 > > I also tried with rtracklayer but got the following error: > > > library(rtracklayer) > > session <- browserSession() > > genome(session) <- "hg19" > > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk") > > system.time(rmsk <- getTable(query)) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > line 3813855 did not have 17 elements > Timing stopped at: 551.027 7.24 1304.386 > > Could be due to my flaky internet connection though... > > This is just hitting up against the row count limit of the table browser. rtracklayer uses the table browser as an abstraction, but it has its limitations. Anyway, it would be great if the AnnotationHub track could be improved. Those extra fields are indeed useful. > Cheers, > H. > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19 > [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19 > [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0 > [10] digest_0.6.4 fail_1.2 foreach_1.4.2 > [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3 > [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2 > [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 > [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1 > > >> Best, >> >> Jim >> >> >> >> >> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at="" gmail.com=""> >> wrote: >> >> Hello, >>> >>> I would like to have repeat sequences as GRanges object >>> I started with ... >>> >>> library (BSgenome.Hsapiens.UCSC.hg19) >>> ch1 <- Hsapiens$chr1 >>> active (masks (ch1)) >>> AGAPS AMB RM TRF >>> TRUE TRUE FALSE FALSE >>> active (masks(ch1))["RM"] <- TRUE >>> active (masks (ch1)) >>> AGAPS AMB RM TRF >>> TRUE TRUE TRUE FALSE >>> >>> Can anyboldy give me a hint how to continue. >>> >>> Thanks >>> hermann >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Michael, On 09/12/2014 01:47 PM, Michael Lawrence wrote: > > > On Fri, Sep 12, 2014 at 11:08 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi, > > > On 09/12/2014 06:30 AM, James W. MacDonald wrote: > > Hi Hermann, > > How about this: > > library(AnnotationHub) > hub <- AnnotationHub() > hub$goldenpath.hg19.database.__rmsk_0.0.1.RData > > GRanges with 5298130 ranges and 2 metadata columns: > seqnames ranges strand | > name > <rle> <iranges> <rle> | > <character> > [1] chr1 [16777161, 16777470] + | > AluSp > [2] chr1 [25165801, 25166089] - | > AluY > [3] chr1 [33553607, 33554646] + | > L2b > [4] chr1 [50330064, 50332153] + | > L1PA10 > [5] chr1 [58720068, 58720973] - | > L1PA2 > ... ... ... ... ... > ... > [5298126] chr21_gl000210_random [25379, 25875] + | > MER74B > [5298127] chr21_gl000210_random [26438, 26596] - | > MIRc > [5298128] chr21_gl000210_random [26882, 27022] - | > MIRc > [5298129] chr21_gl000210_random [27297, 27447] + | > HAL1-2a_MD > [5298130] chr21_gl000210_random [27469, 27682] + | > HAL1-2a_MD > score > <numeric> > [1] 2147 > [2] 2626 > [3] 626 > [4] 12545 > [5] 8050 > ... ... > [5298126] 1674 > [5298127] 308 > [5298128] 475 > [5298129] 371 > [5298130] 370 > --- > seqlengths: > chr1 chr2 ... > chr18_gl000207_random > 249250621 243199373 ... > 4262 > > This is a GRanges of all features from UCSC's Repeat Masker table. > > > Nice to see the "name" and "score" metadata cols but I wonder why the > original UCSC names for these cols (which are "repName" and "swScore") > were not preserved. Also, other UCSC cols in the rmsk table at UCSC > might be of interest (e.g. "repClass" and "repFamily"). > > FWIW, here is one way to get these cols: > > local_file <- tempfile() > > download.file("http://__hgdownload.soe.ucsc.edu/__goldenPath/hg1 9/database/rmsk.__txt.gz > <http: hgdownload.soe.ucsc.edu="" goldenpath="" hg19="" database="" rmsk.tx="" t.gz="">", > local_file) > > ## Get the col names from > ## > http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql > <http: hgdownload.soe.ucsc.edu="" goldenpath="" hg19="" database="" rmsk.sql=""> > COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns", > "genoName", "genoStart", "genoEnd", "genoLeft", > "strand", "repName", "repClass", "repFamily", > "repStart", "repEnd", "repLeft", "id") > > library(GenomicRanges) > df <- read.table(local_file, col.names=COLNAMES) > rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, > seqnames.field="genoName", > start.field="genoStart", > end.field="genoEnd", > strand.field="strand", > starts.in.df.are.0based=TRUE) > > Then: > > > head(rmsk) > GRanges with 6 ranges and 13 metadata columns: > seqnames ranges strand | bin swScore > milliDiv milliDel > <rle> <iranges> <rle> | <integer> <integer> > <integer> <integer> > [1] chr1 [10001, 10468] + | 585 1504 > 13 4 > [2] chr1 [10469, 11447] - | 585 3612 > 114 270 > [3] chr1 [11504, 11675] - | 585 437 > 235 186 > [4] chr1 [11678, 11780] - | 585 239 > 294 19 > [5] chr1 [15265, 15355] - | 585 318 > 230 38 > [6] chr1 [16713, 16749] + | 585 203 > 162 0 > milliIns genoLeft repName repClass repFamily > repStart > <integer> <integer> <factor> <factor> <factor> > <integer> > [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat > 1 > [2] 13 -249239174 TAR1 Satellite telo > -399 > [3] 35 -249238946 L1MC LINE L1 > -2236 > [4] 10 -249238841 MER5B DNA hAT- Charlie > -74 > [5] 0 -249235266 MIR3 SINE MIR > -119 > [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat > 1 > repEnd repLeft id > <integer> <integer> <integer> > [1] 463 0 1 > [2] 1712 483 2 > [3] 5646 5449 3 > [4] 104 1 4 > [5] 143 49 5 > [6] 37 0 6 > > I also tried with rtracklayer but got the following error: > > > library(rtracklayer) > > session <- browserSession() > > genome(session) <- "hg19" > > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk") > > system.time(rmsk <- getTable(query)) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > line 3813855 did not have 17 elements > Timing stopped at: 551.027 7.24 1304.386 > > Could be due to my flaky internet connection though... > > > > This is just hitting up against the row count limit of the table > browser. rtracklayer uses the table browser as an abstraction, but it > has its limitations. Thanks for the reminder about the row count limit of UCSC table browser. How hard would it be to modify getTable() to query directly the MySQL server at UCSC? That would not only work around the row count limit of the table browser, but would also make getTable() at least 10x or 20x faster. According to my timings, that would make getTable() almost as fast as doing hub$goldenpath.hg19.database.rmsk_0.0.1.RData (46.5 s vs 40.5 s) even though the comparison is unfair because getTable() retrieves the 17 cols instead of only 6 when downloading from AnnotationHub. That would also benefit things like makeTranscriptDbFromUCSC() and makeFeatureDbFromUCSC(), which spend most of their time in getTable(). Thanks, H. > > Anyway, it would be great if the AnnotationHub track could be improved. > Those extra fields are indeed useful. > > Cheers, > H. > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19 > [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19 > [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0 > [10] digest_0.6.4 fail_1.2 foreach_1.4.2 > [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3 > [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2 > [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 > [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1 > > > Best, > > Jim > > > > > On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois > <hnorpois at="" gmail.com="" <mailto:hnorpois="" at="" gmail.com="">> wrote: > > Hello, > > I would like to have repeat sequences as GRanges object > I started with ... > > library (BSgenome.Hsapiens.UCSC.hg19) > ch1 <- Hsapiens$chr1 > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE FALSE FALSE > active (masks(ch1))["RM"] <- TRUE > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE TRUE FALSE > > Can anyboldy give me a hint how to continue. > > Thanks > hermann > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
The table browser stuff just grew out of the HTTP interface to the actual browser component. Of course, I wouldn't be opposed to an extension of rtracklayer that accessed the database via SQL. On Fri, Sep 12, 2014 at 3:06 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Michael, > > On 09/12/2014 01:47 PM, Michael Lawrence wrote: > >> >> >> On Fri, Sep 12, 2014 at 11:08 AM, Hervé Pagès <hpages at="" fhcrc.org="">> <mailto:hpages at="" fhcrc.org="">> wrote: >> >> Hi, >> >> >> On 09/12/2014 06:30 AM, James W. MacDonald wrote: >> >> Hi Hermann, >> >> How about this: >> >> library(AnnotationHub) >> hub <- AnnotationHub() >> hub$goldenpath.hg19.database.__rmsk_0.0.1.RData >> >> >> GRanges with 5298130 ranges and 2 metadata columns: >> seqnames ranges strand >> | >> name >> <rle> <iranges> <rle> >> | >> <character> >> [1] chr1 [16777161, 16777470] + >> | >> AluSp >> [2] chr1 [25165801, 25166089] - >> | >> AluY >> [3] chr1 [33553607, 33554646] + >> | >> L2b >> [4] chr1 [50330064, 50332153] + >> | >> L1PA10 >> [5] chr1 [58720068, 58720973] - >> | >> L1PA2 >> ... ... ... ... >> ... >> ... >> [5298126] chr21_gl000210_random [25379, 25875] + >> | >> MER74B >> [5298127] chr21_gl000210_random [26438, 26596] - >> | >> MIRc >> [5298128] chr21_gl000210_random [26882, 27022] - >> | >> MIRc >> [5298129] chr21_gl000210_random [27297, 27447] + >> | >> HAL1-2a_MD >> [5298130] chr21_gl000210_random [27469, 27682] + >> | >> HAL1-2a_MD >> score >> <numeric> >> [1] 2147 >> [2] 2626 >> [3] 626 >> [4] 12545 >> [5] 8050 >> ... ... >> [5298126] 1674 >> [5298127] 308 >> [5298128] 475 >> [5298129] 371 >> [5298130] 370 >> --- >> seqlengths: >> chr1 chr2 ... >> chr18_gl000207_random >> 249250621 243199373 ... >> 4262 >> >> This is a GRanges of all features from UCSC's Repeat Masker table. >> >> >> Nice to see the "name" and "score" metadata cols but I wonder why the >> original UCSC names for these cols (which are "repName" and "swScore") >> were not preserved. Also, other UCSC cols in the rmsk table at UCSC >> might be of interest (e.g. "repClass" and "repFamily"). >> >> FWIW, here is one way to get these cols: >> >> local_file <- tempfile() >> >> download.file("http://__hgdownload.soe.ucsc.edu/__ >> goldenPath/hg19/database/rmsk.__txt.gz >> <http: hgdownload.soe.ucsc.edu="" goldenpath="" hg19="" database="" rmsk.txt.gz="">> >", >> local_file) >> >> ## Get the col names from >> ## >> http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql >> >> <http: hgdownload.soe.ucsc.edu="" goldenpath="" hg19="" database="" rmsk.sql=""> >> COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns", >> "genoName", "genoStart", "genoEnd", "genoLeft", >> "strand", "repName", "repClass", "repFamily", >> "repStart", "repEnd", "repLeft", "id") >> >> library(GenomicRanges) >> df <- read.table(local_file, col.names=COLNAMES) >> rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, >> seqnames.field="genoName", >> start.field="genoStart", >> end.field="genoEnd", >> strand.field="strand", >> starts.in.df.are.0based=TRUE) >> >> Then: >> >> > head(rmsk) >> GRanges with 6 ranges and 13 metadata columns: >> seqnames ranges strand | bin swScore >> milliDiv milliDel >> <rle> <iranges> <rle> | <integer> <integer> >> <integer> <integer> >> [1] chr1 [10001, 10468] + | 585 1504 >> 13 4 >> [2] chr1 [10469, 11447] - | 585 3612 >> 114 270 >> [3] chr1 [11504, 11675] - | 585 437 >> 235 186 >> [4] chr1 [11678, 11780] - | 585 239 >> 294 19 >> [5] chr1 [15265, 15355] - | 585 318 >> 230 38 >> [6] chr1 [16713, 16749] + | 585 203 >> 162 0 >> milliIns genoLeft repName repClass repFamily >> repStart >> <integer> <integer> <factor> <factor> <factor> >> <integer> >> [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat >> 1 >> [2] 13 -249239174 TAR1 Satellite telo >> -399 >> [3] 35 -249238946 L1MC LINE L1 >> -2236 >> [4] 10 -249238841 MER5B DNA hAT- Charlie >> -74 >> [5] 0 -249235266 MIR3 SINE MIR >> -119 >> [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat >> 1 >> repEnd repLeft id >> <integer> <integer> <integer> >> [1] 463 0 1 >> [2] 1712 483 2 >> [3] 5646 5449 3 >> [4] 104 1 4 >> [5] 143 49 5 >> [6] 37 0 6 >> >> I also tried with rtracklayer but got the following error: >> >> > library(rtracklayer) >> > session <- browserSession() >> > genome(session) <- "hg19" >> > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk") >> > system.time(rmsk <- getTable(query)) >> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, >> na.strings, : >> line 3813855 did not have 17 elements >> Timing stopped at: 551.027 7.24 1304.386 >> >> Could be due to my flaky internet connection though... >> >> >> >> This is just hitting up against the row count limit of the table >> browser. rtracklayer uses the table browser as an abstraction, but it >> has its limitations. >> > > Thanks for the reminder about the row count limit of UCSC table browser. > > How hard would it be to modify getTable() to query directly the MySQL > server at UCSC? That would not only work around the row count limit of > the table browser, but would also make getTable() at least 10x or 20x > faster. According to my timings, that would make getTable() almost as > fast as doing hub$goldenpath.hg19.database.rmsk_0.0.1.RData (46.5 s > vs 40.5 s) even though the comparison is unfair because getTable() > retrieves the 17 cols instead of only 6 when downloading from > AnnotationHub. > > That would also benefit things like makeTranscriptDbFromUCSC() > and makeFeatureDbFromUCSC(), which spend most of their time in > getTable(). > > Thanks, > H. > > >> Anyway, it would be great if the AnnotationHub track could be improved. >> Those extra fields are indeed useful. >> >> Cheers, >> H. >> >> > sessionInfo() >> R version 3.1.1 (2014-07-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods >> [8] base >> >> other attached packages: >> [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19 >> [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19 >> [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6 >> [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0 >> [10] digest_0.6.4 fail_1.2 foreach_1.4.2 >> [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3 >> [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2 >> [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 >> [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1 >> >> >> Best, >> >> Jim >> >> >> >> >> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois >> <hnorpois at="" gmail.com="" <mailto:hnorpois="" at="" gmail.com="">> wrote: >> >> Hello, >> >> I would like to have repeat sequences as GRanges object >> I started with ... >> >> library (BSgenome.Hsapiens.UCSC.hg19) >> ch1 <- Hsapiens$chr1 >> active (masks (ch1)) >> AGAPS AMB RM TRF >> TRUE TRUE FALSE FALSE >> active (masks(ch1))["RM"] <- TRUE >> active (masks (ch1)) >> AGAPS AMB RM TRF >> TRUE TRUE TRUE FALSE >> >> Can anyboldy give me a hint how to continue. >> >> Thanks >> hermann >> >> [[alternative HTML version deleted]] >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">> > >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__ >> conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.="">> conductor> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@hermann-norpois-5726
Last seen 9.0 years ago
Germany
I continued ... active (masks (ch1))["AMP"] <- FALSE masks (ch1) MaskCollection of length 4 and width 249250621 masks: maskedwidth maskedratio active names desc 1 23970000 0.09616827 FALSE AGAPS assembly gaps 2 0 0.00000000 TRUE AMB intra-contig ambiguities (empty) 3 114014472 0.45742904 TRUE RM RepeatMasker 4 1581889 0.00634658 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 138071094 0.5539448 all active masks together: maskedwidth maskedratio 114014472 0.457429 Reversed the masks gg <- gaps (ch1) > gg <- collapse (gg) > gu <- as (gg, "Views") > sum (width (gu)) [1] 114014472 # same as maskedwidth (masks (ch1)) So I just have to transform gu into an GRanges object. I hope my solution is correct. Thanks Hermann 2014-09-11 12:16 GMT+02:00 Hermann Norpois <hnorpois at="" gmail.com="">: > Hello, > > I would like to have repeat sequences as GRanges object > I started with ... > > library (BSgenome.Hsapiens.UCSC.hg19) > ch1 <- Hsapiens$chr1 > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE FALSE FALSE > active (masks(ch1))["RM"] <- TRUE > active (masks (ch1)) > AGAPS AMB RM TRF > TRUE TRUE TRUE FALSE > > Can anyboldy give me a hint how to continue. > > Thanks > hermann > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@hermann-norpois-5726
Last seen 9.0 years ago
Germany
Thanks, I followed your instructions but I got an error message: library (AnnotationHub) > hub <- AnnotationHub() > seq.masked <- hub$goldenpath.hg19.database.rmsk_0.0.1.RData Retrieving ‘goldenpath/hg19/database/rmsk_0.0.1.RData’ Fehler: Lesefehler aus Verbindung # 'Fehler' means error and 'Verbindung' connection But I was able to download the package. So, principally a "connection" is possible. sessionInfo () R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] AnnotationHub_1.0.2 GenomicRanges_1.12.5 IRanges_1.18.4 [4] BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.22.6 Biobase_2.20.1 BiocInstaller_1.10.4 [4] compiler_3.1.0 DBI_0.3.0 rjson_0.2.14 [7] RSQLite_0.11.4 stats4_3.1.0 tools_3.1.0 Whats wrong with my connection? Thanks Hermann 2014-09-12 15:30 GMT+02:00 James W. MacDonald <jmacdon@uw.edu>: > Hi Hermann, > > How about this: > > > library(AnnotationHub) > > hub <- AnnotationHub() > > hub$goldenpath.hg19.database.rmsk_0.0.1.RData > GRanges with 5298130 ranges and 2 metadata columns: > seqnames ranges strand | > name > <rle> <iranges> <rle> | > <character> > [1] chr1 [16777161, 16777470] + | > AluSp > [2] chr1 [25165801, 25166089] - | > AluY > [3] chr1 [33553607, 33554646] + | > L2b > [4] chr1 [50330064, 50332153] + | > L1PA10 > [5] chr1 [58720068, 58720973] - | > L1PA2 > ... ... ... ... ... > ... > [5298126] chr21_gl000210_random [25379, 25875] + | > MER74B > [5298127] chr21_gl000210_random [26438, 26596] - | > MIRc > [5298128] chr21_gl000210_random [26882, 27022] - | > MIRc > [5298129] chr21_gl000210_random [27297, 27447] + | > HAL1-2a_MD > [5298130] chr21_gl000210_random [27469, 27682] + | > HAL1-2a_MD > score > <numeric> > [1] 2147 > [2] 2626 > [3] 626 > [4] 12545 > [5] 8050 > ... ... > [5298126] 1674 > [5298127] 308 > [5298128] 475 > [5298129] 371 > [5298130] 370 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > > This is a GRanges of all features from UCSC's Repeat Masker table. > > Best, > > Jim > > > > > On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois@gmail.com> > wrote: > >> Hello, >> >> I would like to have repeat sequences as GRanges object >> I started with ... >> >> library (BSgenome.Hsapiens.UCSC.hg19) >> ch1 <- Hsapiens$chr1 >> active (masks (ch1)) >> AGAPS AMB RM TRF >> TRUE TRUE FALSE FALSE >> active (masks(ch1))["RM"] <- TRUE >> active (masks (ch1)) >> AGAPS AMB RM TRF >> TRUE TRUE TRUE FALSE >> >> Can anyboldy give me a hint how to continue. >> >> Thanks >> hermann >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 1058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6