Question

How to determine a set of empirical control 'genes' using RUVSeq

0

Entering edit mode

cguzman.roma • 0

@cguzmanroma-8981

Last seen 8.1 years ago

United States

I'm attempting to identify differentially expressed repeats (in this case those of the LTR family) in human genome 38. The data I am working with is 4sU-seq data, which essentially captures RNAs that are produced within a 10 minute time period. The data however, contains many intronic and exonic reads that we are interested in, and thus is aligned using a non-splice aware aligner (bowtie2).

I am using RUVSeq + edgeR to perform differential expression of these repeats after counting using FeatureCounts. There are approximately 750k repeats, however not all of them show expression.

Because I do not have a set of spikeins nor do I know prior if any of these 'loci' are not influenced by my knock-out, I wanted to identify empirical control 'genes' using a first-pass DE analysis prior to RUVg normalization as described in the RUVSeq vignette. However, in the vignette, they consider all but the top 5,000 genes as ranked by edgeR p-values as empirical genes, and i'm not entirely sure this number will work on my dataset, but i'm also not exactly sure what number to use or how to figure out which number to use. I'm here looking for a bit of help:

Here is my current script (I chose everything but the top 30k repeats as empirical loci):

#!/usr/bin/env Rscript

# <PAPER> DGE Script
# Authors: Carlos Guzman, Curtis Bacon
# 4sU-seq

# Command line processing for arguments
args <- commandArgs(trailingOnly=TRUE)
args < args[-1]

# Load library
if (!require("RUVSeq")){
  source("http://bioconductor.org/biocLite.R")
  biocLite("RUVSeq", suppressUpdates=TRUE)
  library("RUVSeq")
}

# Read in count data from all files into a dataframe
# Row names are geneids
temp <- lapply(args, read.delim, skip=1, header=TRUE)

# Merge into a single data frame
merge.all <- function(x, y) {
  merge(x, y, all=TRUE, by="Geneid")
}
data <- data.frame(Reduce(merge.all, temp))

# Set GeneID as row name
rownames(data) <- data[,1]
data[,1] <- NULL

data <- data[,c(6, 12, 18, 24)]

# Filter out non-expressed genes, require more than 5 reads in at least two samples for each gene (after filtering there are only 141k of the 750k repeats left)
filter <- apply(data, 1, function(x) length(x[x>5])>=2)
filtered <- data[filter,]

# Design and normalization using EdgeR
x <- as.factor(rep(c("Ctrl", "KO"), each=2))
set <- newSeqExpressionSet(as.matrix(filtered),
                           phenoData = data.frame(x, row.names=colnames(filtered)))
set <- betweenLaneNormalization(set, which="upper") # normalize data using upper quartile normalization

# No ERCC spikeins so we obtain a set of empirical negative controls (i.e. the least significantly DE genes based on a first-pass DE analysis prior to normalization)
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)

top <- topTags(lrt, n=nrow(set))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:30000]))]

set2 <- RUVg(set, empirical, k=1)

# DGE
design <- model.matrix(~x + W_1, data=pData(set2))
y <- DGEList(counts=counts(set2), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)

top.df <- topTags(lrt, n=Inf, sort.by="logFC", p.value = 0.05)

write.table(top.df, "LTR.4su.dgeList.txt", row.names = T, quote = F, sep = '\t')

Forgive the terrible code, i'm fairly new. Any advice would be appreciated.

RUVSeq • 2.5k views

ADD COMMENT • link updated 2.8 years ago by Mohamed ▴ 30 • written 8.1 years ago by cguzman.roma • 0

score 2 · Accepted Answer · 2017-03-01

2

Entering edit mode

davide risso ▴ 980

@davide-risso-5075

Last seen 13 months ago

University of Padova

Of course 5000 should not be considered as a one-fits-all number. We just empirically found that that number of genes was OK in that dataset.

The first question is whether there are stable repeats to start with. Could it be that all the repeats are influenced by the KO?

Assuming the answer is no, I'd suggest that you perform some exploratory data analysis in order to get a better sense of which elements (if any) can be used as negative controls. For instance, you could start from the bottom 1000 or so and check whether using only those repeats your samples cluster in KO vs control. And potentially you can increase that number to include as many genes as possible that won't make your samples cluster by biology.

Also, do you have access to gene expression? A simpler alternative would be to look for empirical controls among exons/genes rather than repeats (as it may be easier to find stable genes than stable repeats).

These are my two cents.

ADD COMMENT • link 8.1 years ago davide risso ▴ 980

0

Entering edit mode

This is sort of what I expected. Definetely not all repeats are influenced by the KO, and I do have access to KO gene expression RNA-seq data, I could potentially look at that.

ADD REPLY • link 8.1 years ago cguzman.roma • 0

0

Entering edit mode

Am I true in my understanding that with this formula (empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))] will normalize the whole matrix against the matrix except for the top 5000 DE genes ? Actually peoples argue that event those least DE genes might be having a stable expression because of the unwanted effect. How about selction of a set of house keeping genes ?

ADD REPLY • link 2.8 years ago Mohamed ▴ 30