Question: converting a GRangesList object with different lengths into a data frame
gravatar for Assa Yeroslaviz
12 days ago by
Assa Yeroslaviz1.3k
Munich, Germany
Assa Yeroslaviz1.3k wrote:

Following my summarize scores of GRanges into bins and advancing one stye at a time, I would now like to convert a GRangesList object into a data.frame, where each of the score columns (meta data columns) of the different GRanges in the list are seaprate columns in the data frame such as :

GRangesList object of length 3:
GRanges object with 100 ranges and 1 metadata column:
                    seqnames       ranges strand |            score
                       <Rle>    <IRanges>  <Rle> |        <numeric>
  15S_rRNA.15S_rRNA       MT [6546, 6561]      * | 47.0025219774636
  15S_rRNA.15S_rRNA       MT [6562, 6577]      * | 52.4692503895184
                ...      ...          ...    ... .              ...
  15S_rRNA.15S_rRNA       MT [8162, 8177]      * | 131.070537758245
  15S_rRNA.15S_rRNA       MT [8178, 8193]      * | 133.993728100123
GRanges object with 100 ranges and 1 metadata column:
                    seqnames         ranges strand |            score
                       <Rle>      <IRanges>  <Rle> |        <numeric>
  21S_rRNA.21S_rRNA       MT [58009, 58052]      * |   11.61435429513
  21S_rRNA.21S_rRNA       MT [58053, 58096]      * | 13.9056586769545
                ...      ...            ...    ... .              ...
  21S_rRNA.21S_rRNA       MT [62359, 62402]      * | 65.9285146503723
  21S_rRNA.21S_rRNA       MT [62403, 62447]      * | 113.348199738504
GRanges object with 93 ranges and 1 metadata column:
                      seqnames         ranges strand |            score
                         <Rle>      <IRanges>  <Rle> |        <numeric>
  YAL037C-A.YAL037C-A        I [73426, 73426]      * | 242.417848776282
  YAL037C-A.YAL037C-A        I [73427, 73427]      * | 246.146507583353
                  ...      ...            ...    ... .              ...
  YAL037C-A.YAL037C-A        I [73517, 73517]      * | 221.726874447293
  YAL037C-A.YAL037C-A        I [73518, 73518]      * | 220.070233632405

seqinfo: 17 sequences from an unspecified genome; no seqlengths

Each of the GRanges in the GRangesList object has a meta data column with scores. I would like to convert this list into a matrix, where in the columns I have the scores and the row names are numbered 1-100 so it should look like that:         

               15S_rRNA          21S_rRNA           YAL037C-A
1      47.0025219774636    11.61435429513    242.417848776282
2      52.4692503895184  13.9056586769545    246.146507583353
99    131.070537758245   65.9285146503723                 NA
100   133.993728100123   113.348199738504                 NA

The last GRanges Objwct which has only 93 ranges should have NA (or 0 ) instead, when converting the data.frame. 

I know how to do it when they are all 100 ranges with (for example), tiles.list) and than delete the unwanted columns, but how do I combine a list of GRanges with different lengths into one big data frame?

Any help would be appreciated.

Thanks Assa


The dput(tiles.tiles) can be found here

ADD COMMENTlink modified 11 days ago by Michael Lawrence9.6k • written 12 days ago by Assa Yeroslaviz1.3k
gravatar for Michael Lawrence
11 days ago by
United States
Michael Lawrence9.6k wrote:

I don't immediately see how arranging the data in this way is useful. But the best way would be to coerce to data.frame and then use reshape() to move to wide form. I guess the tricky part is getting a variable representing the subscript within each GRanges. I've called that "row" below.

df <-
df$row <- as.integer(IRanges(1L, width=lengths(tiles.list)))
wide <- reshape(df[c("row", "group_name", "score")], direction="wide", 
                timevar="group_name", idvar="row")


ADD COMMENTlink written 11 days ago by Michael Lawrence9.6k

Thanks Michael, this is really smooth. I know this is a weird presentation of the data. I need this big data.frame of scores to be able to plot (either as a heat-map or lines plot) the gene intensities on top of each other. For that reason I needed the "gene lengths" to be identical. On my X-axis I have the gene positions (in my case it would be 1-100) and on the Y-axis I have the intensities ( in my case the averaged scores per region).

Unfortunately I couldn't find a better way of plotting the gene intensities over all genes per sample 

The idea is to get something similar to this one here:

ADD REPLYlink written 11 days ago by Assa Yeroslaviz1.3k

Ok. I think you could make a plot like the above using the long form. Certainly in ggplot2 or lattice, and probably in base. One issue that may not apply in your case is splicing. The simple code above will not handle the case of an intron within the first 100 bp. For that, you'll want to look into pmapToTranscripts().

ADD REPLYlink written 11 days ago by Michael Lawrence9.6k

Thanks for  the suggestion of this function. It is worth knowing for later cases. I know about the problem of exon, luckily we are working on S. cerevisiae and have no introns problem, as we are interested in the complete transcript. But this function looks very interesting.

ADD REPLYlink written 10 days ago by Assa Yeroslaviz1.3k
gravatar for Marcel Ramos
11 days ago by
Marcel Ramos ♦♦ 60
United States
Marcel Ramos ♦♦ 60 wrote:

Hi Assa Yeroslaviz,

If you collapse into a single data.frame, each row will represent a different genomic location. You may not want this.

Nevertheless, if you do want to go ahead and do this, you can try this:

# Take all the score values
scoreList <- lapply(tiles.list, function(x) mcols(x))
# Impute NA
scoreList[[3]][94:100, ] <- NA
# Bind into DataFrame
Reduce(cbind, scoreList)

I suggest the use of RaggedExperiment for matrix representation of ragged metadata columns. This will take into account any matching row ranges in your data.

# Convert GRangesList to RaggedExperiment
ragTile <- RaggedExperiment(tiles.list)
# Create matrix of all values across GRangesList elements
assay(ragTile, i = "score")
# Combine if possible, any matching ranges
compactAssay(ragTile, i = "score")

In this case, there are no matching ranges across the elements of the GRangesList.

Best Regards, Marcel

ADD COMMENTlink written 11 days ago by Marcel Ramos ♦♦ 60

Thanks Marcel for the suggestion of RaggedExperiment, But this is not what i needed, as I already know, that there are no common regions. This is not what I am looking for here. The first option I have already thought of. In my case I have over 3000 genomic regions, many of them have an identical length, other have different lengths, so I can't set it to a specific number as you did. I have managed already to change it to data.frame and reduce()-cbind() the data into one big data.frame. But I was hoping for a more efficient method

ADD REPLYlink modified 11 days ago • written 11 days ago by Assa Yeroslaviz1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 273 users visited in the last hour