RUVseq to remove a library size artefact in a zebrafish brain RNA-seq dataset
1
1
Entering edit mode
@karissabarthelson-24193
Last seen 4.2 years ago

I have an RNA-seq dataset where the largest source of variation appears to be due to library size (principal component 1, which captures ~55% of the variance correlates with library size,B), even after TMM normalisation. Library sizes ranged from 13,481,192 to 22,737,049 and was spread relatively evenly between experimental groups (Top graph).

We used RUVSeq to try remove this unwanted variation. We used RUVg setting k = 1 and using the 5000 least DE genes in an initial DE analysis using edgeR as the negative control genes (as described in the RUVseq vignette). This generated a W1 term which tracked with PC1 and library size (C & D). PCA on the RUVseq normalised counts (F) revealed that this artefact was gone (G). We then continued with differential gene expression analysis including the W1 covariate in the design matrix.

Is there something we are missing here about why this isn't a valid way to do this? A reviewer of the manuscript has voiced concerns.

libsize

PCA-all-v1

RUVSeq RUV RNA-seq • 1.9k views
ADD COMMENT
0
Entering edit mode

I guess this is the same as here? As already suggested over at biostars, it would be good to know what the reviewer said and to show your code. https://www.biostars.org/p/459918/

ADD REPLY
0
Entering edit mode

The reviewer was concerned that using RUVg to remove unwanted variation due to library size was non-standard and would require substantial evidence using known data sets with known effects to gauge the impact of the approach. Below is how I performed the analysis :)

# raw abundances 
counts <- list.files(path = here::here("kallisto_out/"), full.names = TRUE) %>%
  catchKallisto()

colnames(counts$counts) %<>%
  basename() %>%
  str_replace_all(pattern = "_S(.+)_merged_R1_001", "") %>%
  str_replace_all(pattern = "^(0|1|2).-", replacement = "") %>%
  str_replace_all(pattern = "_T1.fastq.gz", replacement = "")

# Make the DGE
geneDGE <-
  counts$counts %>%
  as.data.frame() %>%
  rownames_to_column("transcript_id") %>%
  dplyr::filter(!grepl("unspliced", transcript_id)) %>%
  as_tibble() %>%
  mutate(
    transcript_id = str_remove_all(transcript_id, "\\.[0-9]+")
    ) %>%
  gather(key = "Sample", value = "Counts", 2:25) %>%
  left_join(tr2gn) %>%
  group_by(Sample, gene_id) %>%
  summarise(Counts = sum(Counts)) %>%
  spread(key = "Sample", value = "Counts") %>%
  column_to_rownames("gene_id") %>%
  DGEList() %>%
  calcNormFactors(method = "TMM")

geneDGE$samples %<>% 
  rownames_to_column("sample") %>% 
  left_join(meta, by = "sample") 

#Set the genotype WT as the reference level for DE analysis later
geneDGE$samples$Genotype <- relevel(geneDGE$samples$Genotype, "WT")

geneDGE$genes <-
  grGenes %>% 
  as_tibble() %>% 
  dplyr::select(gene_id, gene_id, gc_content, length, symbol, description) %>% 
  dplyr::filter(gene_id %in% rownames(geneDGE$counts)) %>% 
  arrange(gene_id) %>% 
  mutate(ensembl_gene_id = gene_id) %>% 
  column_to_rownames("ensembl_gene_id")

# Filter lowly expressed genes
keepThesegenes <- (rowSums(cpm(geneDGE) > 0.66) >= 12) 
geneDGE <- geneDGE[keepThesegenes,, keep.lib.sizes=FALSE]

# design matrix with WT as the intercept (i.e. reference level)
design <- model.matrix(~Genotype, data = geneDGE$samples) %>% 
  set_colnames(gsub("Genotype", "", colnames(.)))

# fit the GLM
fit <- geneDGE %>% 
  estimateDisp(design) %>% 
  glmFit(design)

# Call the top tables for each of the sorl1 genotypes. 
topTables_tmm <- design %>% colnames() %>% .[2:4] %>% 
   sapply(function(x){
    glmLRT(fit, coef = x) %>%
      topTags(n = Inf) %>%
      .[["table"]] %>%
      as_tibble() %>%
      arrange(PValue) %>%
       mutate(
        coef = x,
        DE = FDR < 0.05
      ) %>% 
      dplyr::select(
        symbol, logFC, logCPM, PValue, FDR, DE, everything()  
      )

  }, simplify = FALSE)

# RUVSeq
#Define the negative control genes (i.e. the least DE genes over all sorl1 genotypes.)
RUVneg <- 
  geneDGE %>% 
  estimateGLMCommonDisp(design) %>%
  estimateGLMTagwiseDisp(design) %>%
  glmFit(design) %>%
  glmLRT(coef = 2:4) %>% # use all 3 sorl1 genotype coefs
  topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>% 
  .$table %>% 
  arrange(desc(PValue)) %>% 
  .[1:5000,] %>% 
  .$gene_id

# Perform the RUVg method with 1 factor of unwanted variation removed, 
RUV_k1 <- geneDGE$counts %>% 
  round %>% 
  RUVg(RUVneg, 1)

# Add the W_1 coef to the meta tibble so a new design matrix can be defined
geneDGE$samples %<>% 
  cbind(RUV_k1$W)

design_RUVk1 <- 
  model.matrix(~Genotype + W_1, data = geneDGE$samples) %>% 
  set_colnames(gsub("Genotype", "", colnames(.)))

## repeat the fit and toptables
fit_RUVk1 <- geneDGE %>% 
  estimateDisp(design_RUVk1) %>% 
  glmFit(design_RUVk1)

topTables_RUVk1 <- design_RUVk1 %>% colnames() %>% .[2:4] %>% 
   sapply(function(x){
    glmLRT(fit_RUVk1, coef = x) %>%
      topTags(n = Inf) %>%
      .[["table"]] %>%
      as_tibble() %>%
      arrange(PValue) %>%
       mutate(
        coef = x,
        DE = FDR < 0.05
      ) %>% 
      dplyr::select(
        symbol, logFC, logCPM, PValue, FDR, DE, everything()  
      )

  }, simplify = FALSE)
ADD REPLY
0
Entering edit mode

Not saying you did anything wrong, but this is quite a chunk of code, and I find it odd to see a read count effect after normalization in bulk RNA-seq dominating a genotype effect. Can you, just for confidence, run a standard PCA on the top1000 most variable genes using the logcounts as input as here in part 2 of this post: https://www.biostars.org/p/461026/ and then show the resulting biplot?

ADD REPLY
0
Entering edit mode

This has the same pattern as the PCA on the TMM normalised counts. (before RUVseq, Fig A in the post). For this particular experiment, it makes sense that the genotype effect is small. I am looking at the effect of mutations which model human Alzheimer's disease mutations in young zebrafish, so effects may be small.

library(PCAtools)
library(matrixStats)

## take the top 1000 most variable genes. What we obtain here is the row index of the most variable genes.
## For this we use rowVars() to get the row-wise variance of the logCPMs and then use head on the 
## decreasingly-ordered output.
logcpms <- cpm(geneDGE, log = TRUE)
top1000.order <- head(order(matrixStats::rowVars(logcpms), decreasing = TRUE), 1000)

## Now run PCA:
## PCAtools:
PCA <- PCAtools::pca(mat = logcpms[top1000.order,])

## add metadata from DGEList (the information which sample belongs to which treatment/kit):
PCA$metadata <-geneDGE$samples

## make a biplot:
bp1 <- 
  PCAtools::biplot(pcaobj = PCA, 
                   colby = "Genotype",
                   shape = "Sex", 
                   legendPosition = "bottom",
                   title = "PCA on the top-1000 most variable logCPMs")

print(bp1)

image

ADD REPLY
3
Entering edit mode
@james-w-macdonald-5106
Last seen 3 days ago
United States

There's a difference between saying 'we did this specific thing to try to remove this specific source of variation' and 'we used RUVg to remove any unobserved variability, and when we looked the factor thus generated is highly correlated with the first PC of the logCPM values, which also happens to be correlated with library size'.

I don't think using RUV or sva to remove unwanted variability is a particularly contentious issue. That said, I also don't think it's particularly unexpected for library size to be correlated with one or more PCs either (the smaller the library size, the lower the counts/gene, and the higher expected variability of the logCPM values, which is what PCA is intended to capture). So I'm not sure I would have used that as a criterion for whether or not there is unexpected variation, given that it's IMO expected variation.

Anyway, this may simply be a framing issue on your part. You by definition didn't specifically try to adjust for the library size by using RUVg. So if you are saying that you did, then you are not describing what you did correctly, and are framing it in a way that makes it seem you might have done something less than ideal. I am not sure about that part - the GLM uses library sizes as the offset to control for differences in library size, and including something in the linear model that also controls for library sizes might be problematic? I don't know.

Although using estimateCommonDisp followed by estimateTagwiseDisp for the negative controls is different from what you did when fitting the model, and people might nit-pick about using the 5000 'bottom' DE genes for the negative controls rather than their favorite number. But to me it seems more of a framing issue than anything else.

ADD COMMENT
1
Entering edit mode

Hi James,

Thanks so much for your advice! I see now how this has come across. You are right that the way I have worded it makes it seem like I used RUVg to normalise for library size, but actually, I used RUVg and this happened to remove the unwanted variation due to library size.

Also, good pickup on the different methods of estimating the dispersions. I guess I've been staring at the code for so long, that I didn't realise I started doing it the another way.

Again, thanks so much for your input! Very grateful! I'm hoping some re-wording in the paper will get it over the line for publication Best wishes Karissa

ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6