I'm attempting to identify differentially expressed repeats (in this case those of the LTR family) in human genome 38. The data I am working with is 4sU-seq data, which essentially captures RNAs that are produced within a 10 minute time period. The data however, contains many intronic and exonic reads that we are interested in, and thus is aligned using a non-splice aware aligner (bowtie2).
I am using RUVSeq + edgeR to perform differential expression of these repeats after counting using FeatureCounts. There are approximately 750k repeats, however not all of them show expression.
Because I do not have a set of spikeins nor do I know prior if any of these 'loci' are not influenced by my knock-out, I wanted to identify empirical control 'genes' using a first-pass DE analysis prior to RUVg normalization as described in the RUVSeq vignette. However, in the vignette, they consider all but the top 5,000 genes as ranked by edgeR p-values as empirical genes, and i'm not entirely sure this number will work on my dataset, but i'm also not exactly sure what number to use or how to figure out which number to use. I'm here looking for a bit of help:
Here is my current script (I chose everything but the top 30k repeats as empirical loci):
#!/usr/bin/env Rscript # <PAPER> DGE Script # Authors: Carlos Guzman, Curtis Bacon # 4sU-seq # Command line processing for arguments args <- commandArgs(trailingOnly=TRUE) args < args[-1] # Load library if (!require("RUVSeq")){ source("http://bioconductor.org/biocLite.R") biocLite("RUVSeq", suppressUpdates=TRUE) library("RUVSeq") } # Read in count data from all files into a dataframe # Row names are geneids temp <- lapply(args, read.delim, skip=1, header=TRUE) # Merge into a single data frame merge.all <- function(x, y) { merge(x, y, all=TRUE, by="Geneid") } data <- data.frame(Reduce(merge.all, temp)) # Set GeneID as row name rownames(data) <- data[,1] data[,1] <- NULL data <- data[,c(6, 12, 18, 24)] # Filter out non-expressed genes, require more than 5 reads in at least two samples for each gene (after filtering there are only 141k of the 750k repeats left) filter <- apply(data, 1, function(x) length(x[x>5])>=2) filtered <- data[filter,] # Design and normalization using EdgeR x <- as.factor(rep(c("Ctrl", "KO"), each=2)) set <- newSeqExpressionSet(as.matrix(filtered), phenoData = data.frame(x, row.names=colnames(filtered))) set <- betweenLaneNormalization(set, which="upper") # normalize data using upper quartile normalization # No ERCC spikeins so we obtain a set of empirical negative controls (i.e. the least significantly DE genes based on a first-pass DE analysis prior to normalization) design <- model.matrix(~x, data=pData(set)) y <- DGEList(counts=counts(set), group=x) y <- calcNormFactors(y, method="upperquartile") y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit, coef=2) top <- topTags(lrt, n=nrow(set))$table empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:30000]))] set2 <- RUVg(set, empirical, k=1) # DGE design <- model.matrix(~x + W_1, data=pData(set2)) y <- DGEList(counts=counts(set2), group=x) y <- calcNormFactors(y, method="upperquartile") y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit, coef=2) top.df <- topTags(lrt, n=Inf, sort.by="logFC", p.value = 0.05) write.table(top.df, "LTR.4su.dgeList.txt", row.names = T, quote = F, sep = '\t')
Forgive the terrible code, i'm fairly new. Any advice would be appreciated.
This is sort of what I expected. Definetely not all repeats are influenced by the KO, and I do have access to KO gene expression RNA-seq data, I could potentially look at that.
Am I true in my understanding that with this formula (
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]
will normalize the whole matrix against the matrix except for the top 5000 DE genes ? Actually peoples argue that event those least DE genes might be having a stable expression because of the unwanted effect. How about selction of a set of house keeping genes ?