saving GRanges objects - resulting file size issue

0

Entering edit mode

Andrew Jaffe ▴ 120

@andrew-jaffe-4820

Last seen 11.4 years ago

Hey, I'm trying to save several GRanges objects in the same rda file, but for some reason, one of the smaller GRanges objects (~23Mb) makes the saved file go from 25Mb to 9Gb+ and I'm unsure of why exactly, and have never seen this problem before: > class(exons) [1] "GRanges" attr(,"package") [1] "GenomicRanges" > class(utr5seg) [1] "GRanges" attr(,"package") [1] "GenomicRanges" > class(utr3seg) [1] "GRanges" attr(,"package") [1] "GenomicRanges" > class(introns) [1] "GRanges" attr(,"package") [1] "GenomicRanges" > save(exons, utr5seg, utr3seg, trans, tInfo, t2g, # introns, file="annotation-tables-forRNAseq.rda") the resulting saved object size in my directory is 23mb > print(object.size(introns),units="Mb") 22.9 Mb but when i include this `introns` GRanges object by running: save(exons, utr5seg, utr3seg, trans, tInfo, t2g, introns, file="annotation-tables-forRNAseq.rda") the object size in my directory got up to 9Gb, and I have to kill the save. any idea whats going on?? > introns GRanges with 691807 ranges and 2 metadata columns: seqnames ranges strand | tx_id <rle> <iranges> <rle> | <character> [1] chr1 [ 12228, 12645] + | 1,2,3 [2] chr1 [ 12698, 13402] + | 1,2,3 [3] chr1 [322229, 324287] + | 7 [4] chr1 [324346, 324438] + | 7 [5] chr1 [324061, 324287] + | 8 ... ... ... ... ... ... [691803] chrY [27216989, 27218792] - | 76941 [691804] chrY [27218869, 27245878] - | 76941 [691805] chrY [27329896, 27330860] - | 76942 [691806] chrY [59359509, 59360006] - | 76950 [691807] chrY [59360116, 59360500] - | 76950 tx_name <character> [1] uc001aaa.3,uc010nxq.1,uc010nxr.1 [2] uc001aaa.3,uc010nxq.1,uc010nxr.1 [3] uc009vjk.2 [4] uc009vjk.2 [5] uc001aau.3 ... ... [691803] uc011nbv.2 [691804] uc011nbv.2 [691805] uc004fwt.3 [691806] uc011ncc.1 [691807] uc011ncc.1 --- seqlengths: chr1 chr2 chr3 chr4 ... chr22 chrX chrY 249250621 243199373 198022430 191154276 ... 51304566 155270560 59373566 Thanks a ton, Andrew [[alternative HTML version deleted]]

GO GO • 4.2k views

ADD COMMENT • link updated 12.4 years ago by Valerie Obenchain ★ 6.8k • written 12.4 years ago by Andrew Jaffe ▴ 120

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hi Andrew, I can't reproduce this problem. Can you successfully save the files in a different order, i.e., is it just the addition of 'introns' that is giving you trouble? I assume these are custom annotation GRanges that did not come from a TxDb in Bioconductor. Can you make the files available for testing? Valerie On 08/16/2013 10:13 AM, Andrew Jaffe wrote: > Hey, > > I'm trying to save several GRanges objects in the same rda file, but for > some reason, one of the smaller GRanges objects (~23Mb) makes the saved > file go from 25Mb to 9Gb+ and I'm unsure of why exactly, and have never > seen this problem before: > >> class(exons) > [1] "GRanges" > attr(,"package") > [1] "GenomicRanges" >> class(utr5seg) > [1] "GRanges" > attr(,"package") > [1] "GenomicRanges" >> class(utr3seg) > [1] "GRanges" > attr(,"package") > [1] "GenomicRanges" >> class(introns) > [1] "GRanges" > attr(,"package") > [1] "GenomicRanges" >> save(exons, utr5seg, utr3seg, trans, tInfo, t2g, # introns, > file="annotation-tables-forRNAseq.rda") > > the resulting saved object size in my directory is 23mb > >> print(object.size(introns),units="Mb") > 22.9 Mb > > but when i include this `introns` GRanges object by running: > > save(exons, utr5seg, utr3seg, trans, tInfo, t2g, introns, > file="annotation-tables-forRNAseq.rda") > > the object size in my directory got up to 9Gb, and I have to kill the save. > any idea whats going on?? > >> introns > GRanges with 691807 ranges and 2 metadata columns: > seqnames ranges strand | tx_id > <rle> <iranges> <rle> | <character> > [1] chr1 [ 12228, 12645] + | 1,2,3 > [2] chr1 [ 12698, 13402] + | 1,2,3 > [3] chr1 [322229, 324287] + | 7 > [4] chr1 [324346, 324438] + | 7 > [5] chr1 [324061, 324287] + | 8 > ... ... ... ... ... ... > [691803] chrY [27216989, 27218792] - | 76941 > [691804] chrY [27218869, 27245878] - | 76941 > [691805] chrY [27329896, 27330860] - | 76942 > [691806] chrY [59359509, 59360006] - | 76950 > [691807] chrY [59360116, 59360500] - | 76950 > tx_name > <character> > [1] uc001aaa.3,uc010nxq.1,uc010nxr.1 > [2] uc001aaa.3,uc010nxq.1,uc010nxr.1 > [3] uc009vjk.2 > [4] uc009vjk.2 > [5] uc001aau.3 > ... ... > [691803] uc011nbv.2 > [691804] uc011nbv.2 > [691805] uc004fwt.3 > [691806] uc011ncc.1 > [691807] uc011ncc.1 > --- > seqlengths: > chr1 chr2 chr3 chr4 ... chr22 chrX chrY > 249250621 243199373 198022430 191154276 ... 51304566 155270560 59373566 > > > Thanks a ton, > Andrew > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Charles Berry ▴ 290

@charles-berry-5754

Last seen 6.8 years ago

United States

Valerie Obenchain <vobencha at="" ...=""> writes: > > Hi Andrew, > > I can't reproduce this problem. Can you successfully save the files in a > different order, i.e., is it just the addition of 'introns' that is > giving you trouble? > > I assume these are custom annotation GRanges that did not come from a > TxDb in Bioconductor. Can you make the files available for testing? > One thing to watch for is an object in the mcols() or metadata() with an attr(*, ".Environment") that includes large objects. formula objects are often culprits in this regard. Try str() on the GRanges and look for stuff like attr(*, ".Environment")=<environment: 0x10e28c118=""> There were some discussions on R-devel about inexplicably large save()'s in the last year or so. IIRC, these were the result of saving an object containing a formula whose environment (containing some large objects) was added to the .rda. > Valerie > > On 08/16/2013 10:13 AM, Andrew Jaffe wrote: > > Hey, > > > > I'm trying to save several GRanges objects in the same rda file, but for > > some reason, one of the smaller GRanges objects (~23Mb) makes the saved > > file go from 25Mb to 9Gb+ and I'm unsure of why exactly, and have never > > seen this problem before: [snip]

ADD COMMENT • link 12.4 years ago Charles Berry ▴ 290

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hi Andrew, I'm copying the discussion back to the list. It looks like you're interested in grouping introns by gene with metadata columns of 'tx_id' and 'tx_name'. Below I've included some code that I think is a more direct approach than the example you were using. It uses select for mapping between ids and creates CharacterLists for the metadata. The final object is 15 Mb and I had no trouble saving. I haven't had time to reproduce your object and check for problems saving. I'll post back when I've had a chance to look into it. Valerie library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene introns <- intronsByTranscript(txdb) ## Get gene_id and tx_name associated with tx_id map <- select(txdb, keys=names(introns), keytype="TXID", columns=c("TXID", "TXNAME", "GENEID")) ## Subset on transcripts that have a GENEID map <- map[!is.na(map$GENEID),] ## Group introns by gene tx_names <- CharacterList(split(map$TXNAME, map$GENEID)) tx_id <- CharacterList(split(map$TXID, map$GENEID)) tx2gene <- data.frame(tx=unlist(unname(tx_id)), gene=rep(names(tx_id), elementLengths(tx_id))) txidRep <- rep(names(introns), elementLengths(introns)) geneRep <- tx2gene$gene[match(txidRep, tx2gene$tx)] ## GRangesList of introns by gene intronsByGene <- split(introns at unlistData, geneRep) intronsByGene <- reduce(intronsByGene) ## Add metadata to the list elements. ## Access with mcols on the GRangesList. values(intronsByGene) <- DataFrame(tx_id, tx_names) mcols(intronsByGene) > print(object.size(intronsByGene), units="Mb") 14.9 Mb On 08/19/2013 11:10 AM, Andrew Jaffe wrote: > The `introns` object did originally come from the TxDb, via the > intronsByTranscript wrapper function, but I was trying to just get the > unique locations of introns without double counting across different > transcripts sharing introns. I'm trying to rerun the code from scratch > to see if it reproduces the large file size problem. The code is > attached. The order by which I saved the R objects did not fix the problem. > > Basically, I'm trying to annotate genomic regions from RNAseq based on > exons, introns, and UTRs, but I am trying to make it gene-centric > instead of transcript-centric like the TxDb hg19 known gene object > currently is. So I try to do some reducing of exons within transcripts > first. The general idea is, for a region of interest (chr:start- end), > how many unique exons, introns, and UTRs does the region cross, where > overlapping regions between these annotations are given priority to > exons (e.g. if one transcript of a gene is short and its UTR is where > there is an exon of the another transcript for that gene, then the > region should only add towards the exon overlap count). > > > >> > >>> *From: *Andrew Jaffe <ajaffe at="" jhsph.edu=""> >>> > >>> On Aug 16, 2013, at 1:13 PM, Andrew Jaffe wrote: > >>> > >>> > Hey, > >>> > > >>> > I'm trying to save several GRanges objects in the same rda > file, > >>> but for > >>> > some reason, one of the smaller GRanges objects (~23Mb) makes > >>> the saved > >>> > file go from 25Mb to 9Gb+ and I'm unsure of why exactly, and > >>> have never > >>> > seen this problem before: > >>> > > >>> >> class(exons) > >>> > [1] "GRanges" > >>> > attr(,"package") > >>> > [1] "GenomicRanges" > >>> >> class(utr5seg) > >>> > [1] "GRanges" > >>> > attr(,"package") > >>> > [1] "GenomicRanges" > >>> >> class(utr3seg) > >>> > [1] "GRanges" > >>> > attr(,"package") > >>> > [1] "GenomicRanges" > >>> >> class(introns) > >>> > [1] "GRanges" > >>> > attr(,"package") > >>> > [1] "GenomicRanges" > >>> >> save(exons, utr5seg, utr3seg, trans, tInfo, t2g, # introns, > >>> > file="annotation-tables-forRNAseq.rda") > >>> > > >>> > the resulting saved object size in my directory is 23mb > >>> > > >>> >> print(object.size(introns),units="Mb") > >>> > 22.9 Mb > >>> > > >>> > but when i include this `introns` GRanges object by running: > >>> > > >>> > save(exons, utr5seg, utr3seg, trans, tInfo, t2g, introns, > >>> > file="annotation-tables-forRNAseq.rda") > >>> > > >>> > the object size in my directory got up to 9Gb, and I have to > >>> kill the save. > >>> > any idea whats going on?? > >>> > > >>> >> introns > >>> > GRanges with 691807 ranges and 2 metadata columns: > >>> > seqnames ranges strand | tx_id > >>> > <rle> <iranges> <rle> | <character> > >>> > [1] chr1 [ 12228, 12645] + | 1,2,3 > >>> > [2] chr1 [ 12698, 13402] + | 1,2,3 > >>> > [3] chr1 [322229, 324287] + | 7 > >>> > [4] chr1 [324346, 324438] + | 7 > >>> > [5] chr1 [324061, 324287] + | 8 > >>> > ... ... ... ... ... ... > >>> > [691803] chrY [27216989, 27218792] - | 76941 > >>> > [691804] chrY [27218869, 27245878] - | 76941 > >>> > [691805] chrY [27329896, 27330860] - | 76942 > >>> > [691806] chrY [59359509, 59360006] - | 76950 > >>> > [691807] chrY [59360116, 59360500] - | 76950 > >>> > tx_name > >>> > <character> > >>> > [1] uc001aaa.3,uc010nxq.1,uc010nxr.1 > >>> > [2] uc001aaa.3,uc010nxq.1,uc010nxr.1 > >>> > [3] uc009vjk.2 > >>> > [4] uc009vjk.2 > >>> > [5] uc001aau.3 > >>> > ... ... > >>> > [691803] uc011nbv.2 > >>> > [691804] uc011nbv.2 > >>> > [691805] uc004fwt.3 > >>> > [691806] uc011ncc.1 > >>> > [691807] uc011ncc.1 > >>> > --- > >>> > seqlengths: > >>> > chr1 chr2 chr3 chr4 ... chr22 > >>> chrX chrY > >>> > 249250621 243199373 198022430 191154276 ... 51304566 > >>> 155270560 59373566 > >>> > > >>> > > >>> > Thanks a ton, > >>> > Andrew > >>> > > >>> > [[alternative HTML version deleted]] > >>> > > >>> > _______________________________________________ > >>> > Bioconductor mailing list > >>> > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org="" <mailto:bioconductor="" at="" r-project.org="">> > >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> > Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >> > > > >

ADD COMMENT • link 12.4 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.