Question

converting a GRangesList object with different lengths into a data frame

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 8 weeks ago

Germany

Following my summarize scores of GRanges into bins and advancing one stye at a time, I would now like to convert a GRangesList object into a data.frame, where each of the score columns (meta data columns) of the different GRanges in the list are seaprate columns in the data frame such as :

>tiles.list
GRangesList object of length 3:
$15S_rRNA 
GRanges object with 100 ranges and 1 metadata column:
                    seqnames       ranges strand |            score
                       <Rle>    <IRanges>  <Rle> |        <numeric>
  15S_rRNA.15S_rRNA       MT [6546, 6561]      * | 47.0025219774636
  15S_rRNA.15S_rRNA       MT [6562, 6577]      * | 52.4692503895184
                ...      ...          ...    ... .              ...
  15S_rRNA.15S_rRNA       MT [8162, 8177]      * | 131.070537758245
  15S_rRNA.15S_rRNA       MT [8178, 8193]      * | 133.993728100123
$21S_rRNA 
GRanges object with 100 ranges and 1 metadata column:
                    seqnames         ranges strand |            score
                       <Rle>      <IRanges>  <Rle> |        <numeric>
  21S_rRNA.21S_rRNA       MT [58009, 58052]      * |   11.61435429513
  21S_rRNA.21S_rRNA       MT [58053, 58096]      * | 13.9056586769545
                ...      ...            ...    ... .              ...
  21S_rRNA.21S_rRNA       MT [62359, 62402]      * | 65.9285146503723
  21S_rRNA.21S_rRNA       MT [62403, 62447]      * | 113.348199738504
$YAL037C-A 
GRanges object with 93 ranges and 1 metadata column:
                      seqnames         ranges strand |            score
                         <Rle>      <IRanges>  <Rle> |        <numeric>
  YAL037C-A.YAL037C-A        I [73426, 73426]      * | 242.417848776282
  YAL037C-A.YAL037C-A        I [73427, 73427]      * | 246.146507583353
                  ...      ...            ...    ... .              ...
  YAL037C-A.YAL037C-A        I [73517, 73517]      * | 221.726874447293
  YAL037C-A.YAL037C-A        I [73518, 73518]      * | 220.070233632405

-------
seqinfo: 17 sequences from an unspecified genome; no seqlengths

Each of the GRanges in the GRangesList object has a meta data column with scores. I would like to convert this list into a matrix, where in the columns I have the scores and the row names are numbered 1-100 so it should look like that:

               15S_rRNA          21S_rRNA           YAL037C-A
1      47.0025219774636    11.61435429513    242.417848776282
2      52.4692503895184  13.9056586769545    246.146507583353
...
99    131.070537758245   65.9285146503723                 NA
100   133.993728100123   113.348199738504                 NA

The last GRanges Objwct which has only 93 ranges should have NA (or 0 ) instead, when converting the data.frame.

I know how to do it when they are all 100 ranges with (for example) do.call(cbind.data.frame, tiles.list) and than delete the unwanted columns, but how do I combine a list of GRanges with different lengths into one big data frame?

Any help would be appreciated.

Thanks Assa

P.S.

The dput(tiles.tiles) can be found here

genomicranges grangeslist dataframe irangeslist lists • 6.9k views

ADD COMMENT • link updated 7.5 years ago by Michael Lawrence ★ 11k • written 7.5 years ago by Assa Yeroslaviz ★ 1.5k

1

Entering edit mode

Marcel Ramos 700

@marcel-ramos-7325

Last seen 8 weeks ago

United States

Hi Assa Yeroslaviz,

If you collapse into a single data.frame, each row will represent a different genomic location. You may not want this.

Nevertheless, if you do want to go ahead and do this, you can try this:

# Take all the score values
scoreList <- lapply(tiles.list, function(x) mcols(x))
# Impute NA
scoreList[[3]][94:100, ] <- NA
# Bind into DataFrame
Reduce(cbind, scoreList)

I suggest the use of RaggedExperiment for matrix representation of ragged metadata columns. This will take into account any matching row ranges in your data.

library(RaggedExperiment)
# Convert GRangesList to RaggedExperiment
ragTile <- RaggedExperiment(tiles.list)
# Create matrix of all values across GRangesList elements
assay(ragTile, i = "score")
# Combine if possible, any matching ranges
compactAssay(ragTile, i = "score")

In this case, there are no matching ranges across the elements of the GRangesList.

Best Regards, Marcel

ADD COMMENT • link 7.5 years ago Marcel Ramos 700

0

Entering edit mode

Thanks Marcel for the suggestion of RaggedExperiment, But this is not what i needed, as I already know, that there are no common regions. This is not what I am looking for here. The first option I have already thought of. In my case I have over 3000 genomic regions, many of them have an identical length, other have different lengths, so I can't set it to a specific number as you did. I have managed already to change it to data.frame and reduce()-cbind() the data into one big data.frame. But I was hoping for a more efficient method

ADD REPLY • link 7.5 years ago Assa Yeroslaviz ★ 1.5k

score 4 · Accepted Answer · 2017-09-13

4

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

I don't immediately see how arranging the data in this way is useful. But the best way would be to coerce to data.frame and then use reshape() to move to wide form. I guess the tricky part is getting a variable representing the subscript within each GRanges. I've called that "row" below.

df <- as.data.frame(tiles.list)
df$row <- as.integer(IRanges(1L, width=lengths(tiles.list)))
wide <- reshape(df[c("row", "group_name", "score")], direction="wide", 
                timevar="group_name", idvar="row")

ADD COMMENT • link 7.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael, this is really smooth. I know this is a weird presentation of the data. I need this big data.frame of scores to be able to plot (either as a heat-map or lines plot) the gene intensities on top of each other. For that reason I needed the "gene lengths" to be identical. On my X-axis I have the gene positions (in my case it would be 1-100) and on the Y-axis I have the intensities ( in my case the averaged scores per region).

Unfortunately I couldn't find a better way of plotting the gene intensities over all genes per sample

The idea is to get something similar to this one here:

ADD REPLY • link 7.5 years ago Assa Yeroslaviz ★ 1.5k

1

Entering edit mode

Ok. I think you could make a plot like the above using the long form. Certainly in ggplot2 or lattice, and probably in base. One issue that may not apply in your case is splicing. The simple code above will not handle the case of an intron within the first 100 bp. For that, you'll want to look into pmapToTranscripts().

ADD REPLY • link 7.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for the suggestion of this function. It is worth knowing for later cases. I know about the problem of exon, luckily we are working on S. cerevisiae and have no introns problem, as we are interested in the complete transcript. But this function looks very interesting.

ADD REPLY • link 7.5 years ago Assa Yeroslaviz ★ 1.5k