how to merge SummarizedExperiment datasets
3
0
Entering edit mode
@annkolman78-21980
Last seen 3.0 years ago

I would like to merge two SummarizedExperiment datasets. Both have the same assays,rownames, row data names but different colnames, colData names. Thank you.

batch1

class: RangedSummarizedExperiment
dim: 49736 20
assays(13): fpkm posterior_mean_count ... fpkm_ci_upperbound fpkm_coefficient_quartile_variation
rownames(49736): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData names(9): gene hgnc ... transcript_count entrezgene
colnames(27): 24-21064 24-21067 ... 24-21242 24-21245
colData names(216): INDIVIDUAL_ID SAMPLE_ID ... gender age


batch2

class: RangedSummarizedExperiment
dim: 49736 80
assays(13): fpkm posterior_mean_count ... fpkm_ci_upperbound fpkm_coefficient_quartile_variation
rownames(49736): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData names(9): gene hgnc ... transcript_count entrezgene
colnames(83): SAMPLE1-77-005-60 SAMPLE2-47-006-60  ... SAMPLE3-95-004-70 SAMPLE4-47-045-60
colData names(223): INDIVIDUAL_ID LIBRARY_ID ... kit status

summarizedexperiment • 2.6k views
0
Entering edit mode

You’ve tagged this with DESeq2 but really it’s a SummarizedExperiment post.

0
Entering edit mode
@martin-morgan-1513
Last seen 18 days ago
United States

Do you mean bind the two SummarizedExperiment together so that you have a combined 100 samples? Try

cbind(batch1, batch2)

0
Entering edit mode

Thank you for your answer, but unfortunately, I have this error:

Error in FUN(X[[i]], ...) : column(s) 'externalgenesource' in ‘mcols’ are duplicated and the data do not match

batch1 externalgenesource <character>
ENSG00000000003 HGNC Symbol
ENSG00000000005 HGNC Symbol
ENSG00000000419 HGNC Symbol
ENSG00000000457 HGNC Symbol
ENSG00000000460 HGNC Symbol
... ... ... ... ... ... ENSG00000273487 Clone-based (Vega)

batch2. externalgenesource <character> ENSG00000000003 HGNC Symbol
ENSG00000000005 HGNC Symbol
ENSG00000000419 HGNC Symbol
ENSG00000000457 HGNC Symbol
ENSG00000000460 HGNC Symbol
... ... ... ... ... ... ENSG00000273487 Clone-based (Vega) gene

Thank you.

0
Entering edit mode

Please take a few seconds to select each code chunk and then click the button with 1's and 0's in the editing toolbar to format your code chunks so that they are more readable.

0
Entering edit mode

So why do the external_gene_source columns differ and what do you want to do about it? One possibility is that the rows of the summarized experiment are all present, but in different order. You could then place them into the same order and cbind

setequal(rowData(batch1)$external_gene_source, rowData(batch2)$external_gene_source) ## TRUE
idx <- match(rowData(batch1)$external_gene_source, rowData(batch2)$external_gene_source)
cbind(batch1, batch2[idx,])


(I never get the match() / reorder correct, maybe the batch1 and batch2 in the second line should be reversed...)

But it might be that different rows are present in the different SE, and then you might want to combine only the shared rows, or fill the non-shared rows in each batch with NA...

0
Entering edit mode

Thank you for suggestions. As I need data from one assay at the moment I used

combined <- cbind(batch1@assay[[1]], batch2@[[1]])


which worked fine and created a matrix object . However, I want to add some clinical information to this matrix. I have a cvs file with a few columns: 1)sampleIDs that do match those in combined and a few more columns with clinical data (including age). This is a data frame. I have tried below but they are not working

combined$age <- clinical$age[match(combined [ ,1], clinical$ID)]  Warning message: Coercing LHS to a list and combined_clinical <- cbind(combined, clinical)  Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 49736, 100 Thank you for help! ADD REPLY 0 Entering edit mode Use assay(batch1) or assays(batch1)[[1]] rather than @. I misread your earlier post about Error in FUN(X[[i]], ...) : column(s) 'external_gene_source' in ‘mcols’ are duplicated and the data do not match  thinking that the values displayed were different, but now I see that in batch2 the first row is displayed at the end of the heading line, so this confused me. The error is still indicating that the reason you cannot simply cbind the SummarizedExperiment is because these columns differ from one another, and your best bet is to figure out why. So compare them all.equal( rowData(batch1)$external_gene_source,
rowData(batch2)$external_gene_source )  and to make them the same, e.g., by brute force if that is correct rowData(batch2)$external_gene_source <- rowData(batch1)$external_gene_source  or by re-ordering rows in the overall SummarizedExperiment as I suggested earlier. ADD REPLY 0 Entering edit mode Hi Martin, thank you for answering all my questions! As I created a matrix using only 1st assay  combined <- cbind(assays(batch1) [[1]], assays(batch2) [[1]])  I also wanted to add clinical info so I run combined_se <-SummarizedExperiment(assays=list(counts=combined), colData=clinical)  and I think it worked. Is that a correct way of doing it? Thanks again! ADD REPLY 1 Entering edit mode To get a reproducible example, I ran example(SummarizedExperiment)  This created an object that I'll add row names to > rownames(se) = seq_len(nrow(se)) > se class: RangedSummarizedExperiment dim: 200 6 metadata(0): assays(1): counts rownames(200): 1 2 ... 199 200 rowData names(1): feature_id colnames(6): A B ... E F colData names(1): Treatment  You can see that if the row and column data are compatible, I can cbind() the entire objects > cbind(se, se) class: RangedSummarizedExperiment dim: 200 12 metadata(0): assays(1): counts rownames(200): 1 2 ... 199 200 rowData names(1): feature_id colnames(12): A B ... E F colData names(1): Treatment  but if there is a mismatch then I run into an error > cbind(se, se[sample(nrow(se)),]) Error in .cbind.SummarizedExperiment(args) : '...' object ranges (rows) are not compatible  If I take your approach of cbind()ing the assays, then there are no errors (because base R does not check for identical row names) > res <- cbind(assay(se), assay(se)[sample(nrow(se)),]) >  Earlier you showed that cbind()ing the SummarizedExperiment resulted in an error. This indicates to me that you need to figure out why the row annotations of the SummarizedExperiment are discordant, otherwise by cbinding the assays you could be propagating errors as well as making work for yourself (by having to add the clinical data that I guess are already in batch1 and batch2, and presumably losing the row annotations that are present in the batches). Let's try to reproduce the exact error in your effort by creating two SummarizedExperiment with the same rowData column names but with different content > rowData(se1)$foo = sample(nrow(se1))
> rowData(se2)$foo = sample(nrow(se2)) > cbind(se1, se2) Error in FUN(X[[i]], ...) : column(s) 'foo' in 'mcols' are duplicated and the data do not match  So again one really wants to figure out why the rowData differ, and if this is important. I mentioned one option above, that the difference is not important somehow (seems unlikely!) so just replace the row data column of one object with the row data column of the other object. Another possibility is that the row data are actually supposed to be different, so that the columns could be renamed in one of the SummarizedExperiment, e.g., from foo to bar in se2. > rowData(se2)$bar <- rowData(se2)$foo > rowData(se2)$foo <- NULL
> se3 = cbind(se1, se2)
DataFrame with 6 rows and 3 columns
feature_id       foo       bar
<character> <integer> <integer>
1       ID001       125        27
2       ID002        91        94
3       ID003       102        61
4       ID004         8       160
5       ID005        72        64
6       ID006        45        89


My guess is that the inconsistency was introduced in an earlier step in your processing pipeline, and that the failure to cbind the SummarizedExperiment is actually telling you to look closely at the point where the inconsistency was introduced -- that step is probably not correct!

0
Entering edit mode

Hi Martin,

But it might be that different rows are present in the different SE, and then you might want to combine only the shared rows, or fill the non-shared rows in each batch with NA...//

How to do this part ? Sorry, I am pretty much new in SE stuffs.

Regards,

0
Entering edit mode
manwar • 0
@52a937a0
Last seen 5 days ago
South Korea

This is just to add for someone else who arrived here by google search that the addition/merging of SummarizedExperiments could be done using "SEtools":

https://bioconductor.org/packages/release/bioc/html/SEtools.html

It can be installed using BiocManager and further information can be extracted from the help/pdf document.