Question

removing rows from SingleCellExperiment container

0

Entering edit mode

TJ • 0

@tj-18752

Last seen 4.6 years ago

Stanford University

Hi,

I'm trying to figure out how to remove a set of genes from my SingleCellExperiment container (sce). I have a data.frame called mito.genes with a list of genes as shown here:

      Entrez_ID Symbol         Ensembl_ID
    1     66445   Cyc1 ENSMUSG00000022551
    2     18597  Pdha1 ENSMUSG00000031299
    3     66043  Atp5d ENSMUSG00000003072
    4     74316  Isca2 ENSMUSG00000021241
    5     22273 Uqcrc1 ENSMUSG00000025651
    6     68263   Pdhb ENSMUSG00000021748

My sce has the following:

    rowData(sce)$Ensembl_ID <- ids #row metadata
    rownames(sce) <- Gene_Names #row names as gene symbols

I tried this:

    gene.symbols <- mito.genes$Symbol
    sce <- sce[,!gene.symbols]

But get an error:

    "Error: logical subscript contains NAs
    In addition: Warning message:
    In Ops.factor(gene.symbols) : ‘!’ not meaningful for factors"

The list of genes does not contain any NAs (it's a relatively short list). When I tried changing the data.frame Symbol to rownames then running the same script above, I get Error in !gene.symbols : invalid argument type.

I also tried going about this prior to creating the sce container by creating a new column with my gene symbols, then using dyplr's anti_join() function to successfully remove the rows from the main count matrix, however, when I remove the column with the gene symbols and create an sce, I end up with the sce containing what looks like lists and not numeric values.

Any guidance would be much appreciated.

sce singlecellexperiment scRNAseq subsetting • 6.5k views

ADD COMMENT • link updated 7.1 years ago by Aaron Lun ★ 29k • written 7.1 years ago by TJ • 0

0

Entering edit mode

Fix the formatting in your post, the markdown editor isn't working properly at the moment.

ADD REPLY • link 7.1 years ago Aaron Lun ★ 29k

0

Entering edit mode

Hopefully fixed now (I had to edit 2-3 times because it didn't look proper).

ADD REPLY • link 7.1 years ago TJ • 0

score 3 · Accepted Answer · 2019-01-04

3

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 6 hours ago

The city by the bay

This has nothing to do with SingleCellExperiment, but rather your understanding of how subsetting works in R. For example, let's consider this bit of code:

!factor(LETTERS)

... which gives a warning with the same text as your error. And fair enough; what do you expect to get when you ! a factor? It's not a logical value, it doesn't have an obvious complement value. What's the opposite of "A"? In fact, what does that even mean?

(Discerning readers may note that, in several other languages, an empty string is considered to be equivalent to FALSE, and a non-empty string is TRUE, in which case doing something like !"A" would be give you FALSE. From another interpretation, factors are technically integers, and all non-zero integers evaluate to TRUE, and under this interpretation the above example would give me a vector of FALSEs. However, this is purely academic as neither of these interpretations are used in R, and wouldn't help to solve your problem anyway.)

It seems that you're expecting too much of the language. You're probably reading sce[!gene.symbols,] (I've corrected it to do row subsetting, your code does column subsetting instead) as "give me all rows that are not in gene.symbols". But R doesn't read it that way. R sees it as "okay, I've got to compute !gene.symbols first, and then use the result to row-subset sce". The first step fails for reasons explained above, resulting in your error message. What you should be doing instead is something like:

sce[! rownames(sce) %in% gene.symbols,]

... which conveys the actual intent much more clearly. Note that %in% takes precedence over ! so I'm not !ing the row names here.

P.S. Some general coding advice - get into the habit of not subsetting with factors. Factors are actually encoded as integers; they behave like strings for certain operations (e.g., comparisons) but behave like integers for other operations (e.g., subsetting). This provides great opportunities to mess with the person reading your code later:

LETTERS[factor(1:5)] # gives me A B C D E, as expected.
LETTERS[factor(2:6)] # huh - still gives me A B C D E, not B C D E F. Why?
as.integer(factor(2:6)) # ah - this is 1 to 5, integer behaviour during subsetting.

Not so amusing, of course, when the person reading your code is you.

ADD COMMENT • link 7.1 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you, Aaron! This definitely clarifies my misunderstanding and confusion with regards to factors. And thank you for the solution and advice here!

ADD REPLY • link 7.1 years ago TJ • 0

0

Entering edit mode

This is a follow up question regarding factors (sorry if this is not the right forum for this, I can re-post to stacksoverflow or BioStars if that's more appropriate).
If I have my SingleCellExperiment container, mySCE with counts and normcounts, and with a factor for cell clusters under colData (for example, 4 clusters assigned as: S1, S2, S3, S4), what's the best way to subset multiple clusters into two groups? I want to take out clusters S1+S2, call them CellType1, and compare them to cluster S4, called CellType2 such that I have a matrix with cells from Group1 and Group2 for further analysis (i.e. DE expression)

I came up with the following way to do this, and it works fine, but seems super convoluted:

# Subset the pertinent cell groups
sce.S1.S2 <- mySCE[, mySCE$states == "S1" | mySCE$states == "S2"]. #50 cells
sce_S4 <- mySCE[, mySCE$states == "S4"] #75 cells

# combine the two SCEs
my.new.SCE <- cbind(sce.S1.S2, sce.S4) # 125 cells
# my.new.SCE used for extracting counts and normcounts for further analysis

# Create a factor as labels for the two groups of cells
a <- as.factor(mySCE$states == "S1" | mySCE$states == "S2")
Group1 <- a[a == TRUE]
levels(Group1)[Group1==TRUE] <- c("CellType1") # rename each group, making it the new factor name

b <- as.factor(mySCE$states == "S4")
Group2 <- b[b == TRUE]
levels(Group2)[Group2==TRUE] <- c("CellType2")

# combine the two factors
x <- c(as.character(Group1), as.character(Group2)) # group the two factors as characters
my.2.cell.types <- as.factor(x) # convert back to factor, to have two levels called Group1 and Group2

Now I can use my.new.SCE for counts and normcounts matrices, and can compare CellType1 vs CellType2

ADD REPLY • link 7.1 years ago TJ • 0

1

Entering edit mode

It seems you're trying to do:

mySCE$states <- as.character(mySCE$states) # avoid factors
my.new.SCE <- mySCE[,mySCE$states %in% c("S1", "S2", "S4")]
my.2.cell.types <- c(S1="CellType1", S2="CellType1", S4="CellType2")[my.new.SCE$states]