Hello! I am trying to get used to the new 'SingleCellExperiment' objects. Specifically how to rename samples and add metadata, swap out Ensembl IDs for Gene Names. For example, If I read in my 10x genomics data using 'DropletUtils' i have to specify a path to the data which is then incorporated into the Sample Name.
> library(DropletUtils)
> tmp <- "/home/met/data/Hong_Lab/02Apr19/S11M"
> sce <- read10xCounts(tmp, version="3")
> dim(sce)
[1] 33538 3466
> colData(sce)
DataFrame with 3466 rows and 2 columns
Sample Barcode
<character> <character>
1 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCAGGCTA-1
2 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCTTCGCG-1
3 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGGAATGGA-1
4 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCAACTGCGC-1
5 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCATCAGTAC-1
... ... ...
3462 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCAGTCACACGC-1
3463 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCAGTGGTCTCG-1
3464 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCATCAGGTAAA-1
3465 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCATCATGCATG-1
3466 /home/met/data/Hong_Lab/02Apr19/S11M TTTGTCATCCTACAGA-1
> rowData(sce)
DataFrame with 33538 rows and 3 columns
ID Symbol Type
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression
... ... ... ...
ENSG00000277856 ENSG00000277856 AC233755.2 Gene Expression
ENSG00000275063 ENSG00000275063 AC233755.1 Gene Expression
ENSG00000271254 ENSG00000271254 AC240274.1 Gene Expression
ENSG00000277475 ENSG00000277475 AC213203.1 Gene Expression
ENSG00000268674 ENSG00000268674 FAM231C Gene Expression
Is there a way to directly rename samples, without the path?
I have another data set that is a different sample that I would like to merge with it and the merging instructions are a bit vague. What I found was in the 'SingleCellExperiment' reference. just use rbind or cbind, rbind can't work because there are more cells in one dataset than the other.
> tmp2 <- "/home/met/data/Hong_Lab/02Apr19/S6M/"
> sce2 <- read10xCounts(tmp2, version="3")
> rowData(sce2)
DataFrame with 33538 rows and 3 columns
ID Symbol Type
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression
... ... ... ...
ENSG00000277856 ENSG00000277856 AC233755.2 Gene Expression
ENSG00000275063 ENSG00000275063 AC233755.1 Gene Expression
ENSG00000271254 ENSG00000271254 AC240274.1 Gene Expression
ENSG00000277475 ENSG00000277475 AC213203.1 Gene Expression
ENSG00000268674 ENSG00000268674 FAM231C Gene Expression
> colData(sce2)
DataFrame with 5379 rows and 2 columns
Sample Barcode
<character> <character>
1 /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGAGCTGTCTA-1
2 /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGAGGCAAAGA-1
3 /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGAGTTGAGTA-1
4 /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGCAAAGCGGT-1
5 /home/met/data/Hong_Lab/02Apr19/S6M/ AAACCTGCAATAGAGT-1
... ... ...
5375 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCACATGCCTAA-1
5376 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTAGGGACT-1
5377 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTCCGAATT-1
5378 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTTGTTTGG-1
5379 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCATCCGTACAA-1
> test <- cbind(sce, sce2)
> colData(test)
DataFrame with 8845 rows and 2 columns
Sample Barcode
<character> <character>
1 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCAGGCTA-1
2 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGCTTCGCG-1
3 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGAGGAATGGA-1
4 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCAACTGCGC-1
5 /home/met/data/Hong_Lab/02Apr19/S11M AAACCTGCATCAGTAC-1
... ... ...
8841 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCACATGCCTAA-1
8842 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTAGGGACT-1
8843 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTCCGAATT-1
8844 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCAGTTGTTTGG-1
8845 /home/met/data/Hong_Lab/02Apr19/S6M/ TTTGTCATCCGTACAA-1
> rowData(test)
DataFrame with 33538 rows and 3 columns
ID Symbol Type
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression
... ... ... ...
ENSG00000277856 ENSG00000277856 AC233755.2 Gene Expression
ENSG00000275063 ENSG00000275063 AC233755.1 Gene Expression
ENSG00000271254 ENSG00000271254 AC240274.1 Gene Expression
ENSG00000277475 ENSG00000277475 AC213203.1 Gene Expression
ENSG00000268674 ENSG00000268674 FAM231C Gene Expression
Is this correct? It seems like it would be more proper to use merge and by="ID". So If I wanted to use the 'isOutlier' how would I access the sample name to input as "batch"??
If I wanted to add annotation from biomaRt and I have a data.frame of annotation, gene length, NCBI gene ID, RNA Central. What would be the best way to add it to the data with merge?
> library(biomaRt)
> ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl")
> symbs <- getBM(attributes=c('ensembl_gene_id', 'chromosome_name', 'external_gene_name', 'transcript_biotype', 'entrezgene', 'rnacentral'), filters='ensembl_gene_id', values=rownames(sce), mart=ensembl)
Thank you for your time!
Thanks! That was very helpful!