how to merge SummarizedExperiment datasets
3
0
Entering edit mode
@annkolman78-21980
Last seen 4.6 years ago

I would like to merge two SummarizedExperiment datasets. Both have the same assays,rownames, row data names but different colnames, colData names. Thank you.

batch1

class: RangedSummarizedExperiment 
dim: 49736 20 
metadata(0):
assays(13): fpkm posterior_mean_count ... fpkm_ci_upperbound fpkm_coefficient_quartile_variation
rownames(49736): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData names(9): gene hgnc ... transcript_count entrezgene
colnames(27): 24-21064 24-21067 ... 24-21242 24-21245
colData names(216): INDIVIDUAL_ID SAMPLE_ID ... gender age

batch2

class: RangedSummarizedExperiment 
dim: 49736 80 
metadata(0):
assays(13): fpkm posterior_mean_count ... fpkm_ci_upperbound fpkm_coefficient_quartile_variation
rownames(49736): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData names(9): gene hgnc ... transcript_count entrezgene
colnames(83): SAMPLE1-77-005-60 SAMPLE2-47-006-60  ... SAMPLE3-95-004-70 SAMPLE4-47-045-60
colData names(223): INDIVIDUAL_ID LIBRARY_ID ... kit status
summarizedexperiment • 5.3k views
ADD COMMENT
0
Entering edit mode

You’ve tagged this with DESeq2 but really it’s a SummarizedExperiment post.

ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States

Do you mean bind the two SummarizedExperiment together so that you have a combined 100 samples? Try

cbind(batch1, batch2)
ADD COMMENT
0
Entering edit mode

Thank you for your answer, but unfortunately, I have this error:

Error in FUN(X[[i]], ...) : column(s) 'externalgenesource' in ‘mcols’ are duplicated and the data do not match

batch1 externalgenesource <character>
ENSG00000000003 HGNC Symbol
ENSG00000000005 HGNC Symbol
ENSG00000000419 HGNC Symbol
ENSG00000000457 HGNC Symbol
ENSG00000000460 HGNC Symbol
... ... ... ... ... ... ENSG00000273487 Clone-based (Vega)

batch2. externalgenesource <character> ENSG00000000003 HGNC Symbol
ENSG00000000005 HGNC Symbol
ENSG00000000419 HGNC Symbol
ENSG00000000457 HGNC Symbol
ENSG00000000460 HGNC Symbol
... ... ... ... ... ... ENSG00000273487 Clone-based (Vega) gene

Thank you.

ADD REPLY
0
Entering edit mode

Please take a few seconds to select each code chunk and then click the button with 1's and 0's in the editing toolbar to format your code chunks so that they are more readable.

ADD REPLY
0
Entering edit mode

So why do the external_gene_source columns differ and what do you want to do about it? One possibility is that the rows of the summarized experiment are all present, but in different order. You could then place them into the same order and cbind

setequal(rowData(batch1)$external_gene_source, rowData(batch2)$external_gene_source) ## TRUE
idx <- match(rowData(batch1)$external_gene_source, rowData(batch2)$external_gene_source)
cbind(batch1, batch2[idx,])

(I never get the match() / reorder correct, maybe the batch1 and batch2 in the second line should be reversed...)

But it might be that different rows are present in the different SE, and then you might want to combine only the shared rows, or fill the non-shared rows in each batch with NA...

ADD REPLY
0
Entering edit mode

Thank you for suggestions. As I need data from one assay at the moment I used

combined <- cbind(batch1@assay[[1]], batch2@[[1]])

which worked fine and created a matrix object . However, I want to add some clinical information to this matrix. I have a cvs file with a few columns: 1)sampleIDs that do match those in combined and a few more columns with clinical data (including age). This is a data frame. I have tried below but they are not working

combined$age <- clinical$age[match(combined [ ,1], clinical$ID)]

Warning message: Coercing LHS to a list

and

combined_clinical <- cbind(combined, clinical)

Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 49736, 100

Thank you for help!

ADD REPLY
0
Entering edit mode

Use assay(batch1) or assays(batch1)[[1]] rather than @.

I misread your earlier post about

Error in FUN(X[[i]], ...) : column(s) 'external_gene_source' in ‘mcols’ are duplicated and the data do not match

thinking that the values displayed were different, but now I see that in batch2 the first row is displayed at the end of the heading line, so this confused me. The error is still indicating that the reason you cannot simply cbind the SummarizedExperiment is because these columns differ from one another, and your best bet is to figure out why. So compare them

all.equal(
    rowData(batch1)$external_gene_source,
    rowData(batch2)$external_gene_source
)

and to make them the same, e.g., by brute force if that is correct

rowData(batch2)$external_gene_source <- rowData(batch1)$external_gene_source

or by re-ordering rows in the overall SummarizedExperiment as I suggested earlier.

ADD REPLY
0
Entering edit mode

Hi Martin, thank you for answering all my questions!

As I created a matrix using only 1st assay

 combined <- cbind(assays(batch1) [[1]], assays(batch2) [[1]])

I also wanted to add clinical info so I run

combined_se <-SummarizedExperiment(assays=list(counts=combined), colData=clinical)

and I think it worked. Is that a correct way of doing it?

Thanks again!

ADD REPLY
1
Entering edit mode

To get a reproducible example, I ran

example(SummarizedExperiment)

This created an object that I'll add row names to

> rownames(se) = seq_len(nrow(se))
> se
class: RangedSummarizedExperiment
dim: 200 6
metadata(0):
assays(1): counts
rownames(200): 1 2 ... 199 200
rowData names(1): feature_id
colnames(6): A B ... E F
colData names(1): Treatment

You can see that if the row and column data are compatible, I can cbind() the entire objects

> cbind(se, se)
class: RangedSummarizedExperiment
dim: 200 12
metadata(0):
assays(1): counts
rownames(200): 1 2 ... 199 200
rowData names(1): feature_id
colnames(12): A B ... E F
colData names(1): Treatment

but if there is a mismatch then I run into an error

> cbind(se, se[sample(nrow(se)),])
Error in .cbind.SummarizedExperiment(args) :
  '...' object ranges (rows) are not compatible

If I take your approach of cbind()ing the assays, then there are no errors (because base R does not check for identical row names)

> res <- cbind(assay(se), assay(se)[sample(nrow(se)),])
>

Earlier you showed that cbind()ing the SummarizedExperiment resulted in an error. This indicates to me that you need to figure out why the row annotations of the SummarizedExperiment are discordant, otherwise by cbinding the assays you could be propagating errors as well as making work for yourself (by having to add the clinical data that I guess are already in batch1 and batch2, and presumably losing the row annotations that are present in the batches).

Let's try to reproduce the exact error in your effort by creating two SummarizedExperiment with the same rowData column names but with different content

> rowData(se1)$foo = sample(nrow(se1))
> rowData(se2)$foo = sample(nrow(se2))
> cbind(se1, se2)
Error in FUN(X[[i]], ...) :
  column(s) 'foo' in 'mcols' are duplicated and the data do not match

So again one really wants to figure out why the rowData differ, and if this is important.

I mentioned one option above, that the difference is not important somehow (seems unlikely!) so just replace the row data column of one object with the row data column of the other object.

Another possibility is that the row data are actually supposed to be different, so that the columns could be renamed in one of the SummarizedExperiment, e.g., from foo to bar in se2.

> rowData(se2)$bar <- rowData(se2)$foo
> rowData(se2)$foo <- NULL
> se3 = cbind(se1, se2)
> head(rowData(se3))
DataFrame with 6 rows and 3 columns
   feature_id       foo       bar
  <character> <integer> <integer>
1       ID001       125        27
2       ID002        91        94
3       ID003       102        61
4       ID004         8       160
5       ID005        72        64
6       ID006        45        89

My guess is that the inconsistency was introduced in an earlier step in your processing pipeline, and that the failure to cbind the SummarizedExperiment is actually telling you to look closely at the point where the inconsistency was introduced -- that step is probably not correct!

ADD REPLY
0
Entering edit mode

Hi Martin,

But it might be that different rows are present in the different SE, and then you might want to combine only the shared rows, or fill the non-shared rows in each batch with NA...//

How to do this part ? Sorry, I am pretty much new in SE stuffs.

Regards,

ADD REPLY
0
Entering edit mode
manwar • 0
@52a937a0
Last seen 17 months ago
South Korea

This is just to add for someone else who arrived here by google search that the addition/merging of SummarizedExperiments could be done using "SEtools":

https://bioconductor.org/packages/release/bioc/html/SEtools.html

It can be installed using BiocManager and further information can be extracted from the help/pdf document.

ADD COMMENT

Login before adding your answer.

Traffic: 888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6