Combining and using 10x Genomics output in SingleCellExperiment Objects
1
0
Entering edit mode
@matthew-thornton-5564
Last seen 4 weeks ago
USA, Los Angeles, USC

Hello! I am trying to get used to the new 'SingleCellExperiment' objects. Specifically how to rename samples and add metadata, swap out Ensembl IDs for Gene Names. For example, If I read in my 10x genomics data using 'DropletUtils' i have to specify a path to the data which is then incorporated into the Sample Name.

> library(DropletUtils)
> tmp <- "/home/met/data/Hong_Lab/02Apr19/S11M"
> sce <- read10xCounts(tmp, version="3")
> dim(sce)
[1] 33538  3466
> colData(sce)
DataFrame with 3466 rows and 2 columns
                                   Sample            Barcode
                              <character>        <character>
1    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCAGGCTA-1
2    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCTTCGCG-1
3    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGGAATGGA-1
4    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCAACTGCGC-1
5    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCATCAGTAC-1
...                                   ...                ...
3462 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCAGTCACACGC-1
3463 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCAGTGGTCTCG-1
3464 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCATCAGGTAAA-1
3465 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCATCATGCATG-1
3466 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCATCCTACAGA-1
> rowData(sce)
DataFrame with 33538 rows and 3 columns
                             ID      Symbol            Type
                    <character> <character>     <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613     FAM138A Gene Expression
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression
...                         ...         ...             ...
ENSG00000277856 ENSG00000277856  AC233755.2 Gene Expression
ENSG00000275063 ENSG00000275063  AC233755.1 Gene Expression
ENSG00000271254 ENSG00000271254  AC240274.1 Gene Expression
ENSG00000277475 ENSG00000277475  AC213203.1 Gene Expression
ENSG00000268674 ENSG00000268674     FAM231C Gene Expression

Is there a way to directly rename samples, without the path?

I have another data set that is a different sample that I would like to merge with it and the merging instructions are a bit vague. What I found was in the 'SingleCellExperiment' reference. just use rbind or cbind, rbind can't work because there are more cells in one dataset than the other.

> tmp2 <- "/home/met/data/Hong_Lab/02Apr19/S6M/"
> sce2 <- read10xCounts(tmp2, version="3")
> rowData(sce2)
DataFrame with 33538 rows and 3 columns
                             ID      Symbol            Type
                    <character> <character>     <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613     FAM138A Gene Expression
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression
...                         ...         ...             ...
ENSG00000277856 ENSG00000277856  AC233755.2 Gene Expression
ENSG00000275063 ENSG00000275063  AC233755.1 Gene Expression
ENSG00000271254 ENSG00000271254  AC240274.1 Gene Expression
ENSG00000277475 ENSG00000277475  AC213203.1 Gene Expression
ENSG00000268674 ENSG00000268674     FAM231C Gene Expression
> colData(sce2)
DataFrame with 5379 rows and 2 columns
                                   Sample            Barcode
                              <character>        <character>
1    /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGAGCTGTCTA-1
2    /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGAGGCAAAGA-1
3    /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGAGTTGAGTA-1
4    /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGCAAAGCGGT-1
5    /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGCAATAGAGT-1
...                                   ...                ...
5375 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCACATGCCTAA-1
5376 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTAGGGACT-1
5377 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTCCGAATT-1
5378 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTTGTTTGG-1
5379 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCATCCGTACAA-1
> test <- cbind(sce, sce2)
> colData(test)
DataFrame with 8845 rows and 2 columns
                                   Sample            Barcode
                              <character>        <character>
1    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCAGGCTA-1
2    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCTTCGCG-1
3    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGGAATGGA-1
4    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCAACTGCGC-1
5    /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCATCAGTAC-1
...                                   ...                ...
8841 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCACATGCCTAA-1
8842 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTAGGGACT-1
8843 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTCCGAATT-1
8844 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTTGTTTGG-1
8845 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCATCCGTACAA-1
> rowData(test)
DataFrame with 33538 rows and 3 columns
                             ID      Symbol            Type
                    <character> <character>     <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613     FAM138A Gene Expression
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression
...                         ...         ...             ...
ENSG00000277856 ENSG00000277856  AC233755.2 Gene Expression
ENSG00000275063 ENSG00000275063  AC233755.1 Gene Expression
ENSG00000271254 ENSG00000271254  AC240274.1 Gene Expression
ENSG00000277475 ENSG00000277475  AC213203.1 Gene Expression
ENSG00000268674 ENSG00000268674     FAM231C Gene Expression

Is this correct? It seems like it would be more proper to use merge and by="ID". So If I wanted to use the 'isOutlier' how would I access the sample name to input as "batch"??

If I wanted to add annotation from biomaRt and I have a data.frame of annotation, gene length, NCBI gene ID, RNA Central. What would be the best way to add it to the data with merge?

> library(biomaRt)
> ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl")
> symbs <- getBM(attributes=c('ensembl_gene_id', 'chromosome_name', 'external_gene_name', 'transcript_biotype', 'entrezgene', 'rnacentral'), filters='ensembl_gene_id', values=rownames(sce), mart=ensembl)

Thank you for your time!

SingleCellExperiment • 1.8k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 11 hours ago
The city by the bay

Is there a way to directly rename samples, without the path?

Any way you like.

# Change to base name
sce$Sample <- basename(sce$Sample)

# Or perhaps to something different?
sample2condition <- c(S11M="WT", S12M="KO") # etc., etc.
sce$Sample <- sample2condition[sce$Sample] # uses the basenames

Manipulate the object to your heart's desire.

rbind can't work because there are more cells in one dataset than the other.

Obviously.

It seems like it would be more proper to use merge and by="ID".

Why? You have two objects. They have rows in the same order. So just cbind them!

If they're not in the same order, or if only a subset of genes are overlapping, you can do something like this:

common <- intersect(rownames(sce), rownames(sce2))
combined <- cbind(sce[common,], sce2[common,])

More complicated scenarios require some explicit specification, e.g., to deal with remapping of identifiers from Ensembl to Entrez. You might think that we could write a function to do this, and indeed, scater used to have a merge function to combine pairs of the old SCESet objects. But this was a real pain to write - what happens with missing rows? Which metadata column contains the keys? How do we handle duplicate keys? We had to specify so many parameters to get it to work in an unambiguously correct manner, it was less confusing to just write it out explicitly in user code:

sce # uses Ensembl IDs
sce2 # uses Entrez IDs

# Demonstration code:
library(org.Hs.eg.db)
new.ids <- mapIds(org.Hs.eg.db, keys=rownames(sce2), key.type="ENTREZID", column="ENSEMBL")
common <- intersect(new.ids, rownames(sce))
combined <- cbind(sce[common,], sce2[match(common, new.ids),]) 

In any case, differences in annotation between your object imply that the data were generated in separate experiments. If so, you will have much bigger statistical and analytical problems that make combining the objects look relatively trivial.

So If I wanted to use the 'isOutlier' how would I access the sample name to input as "batch"??

Not entirely clear how this question is relevant, but here goes:

sce$Sample
# or
combined$Sample

What would be the best way to add it to the data with merge?

Just stick it in the rowData:

# Defensive code in case 'symbs' doesn't respect the ordering of keys in the 'getBM()' call.
# I don't recall BiomaRt being particularly smart about this, hence the need for the match().
rowData(sce) <- cbind(rowData(sce), symbs[match(rownames(sce), symbs$ensembl_gene_id),])
ADD COMMENT
0
Entering edit mode

Thanks! That was very helpful!

ADD REPLY

Login before adding your answer.

Traffic: 858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6