I m running an RNA seq. analyses comparing stimulated vs unstimulated samples and have a read count of let say 5000 genes. I would like to use RUVg to select a groups of negative control genes based on the ones that are least DE among the two group (following the RUV vignette) below is my code for checking how PCA looks like for the least DE 100 genes (data set with filtered reads is named as set)
top <- topTags(tr, n=nrow(set)) $table
empirica_100 <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:4900]))]
set2_100<- RUVg(set, empirica_100, k= 1)
plotPCA(set2_100, col=col.cell, cex=1)
I actually do not see much of difference when making the PCA plot before and after normalization
I have some naïve but confusing questions for me as a starter in this issue
Am I right in my understanding that using this code below I am normalizing the whole matrix with the least 100 DE genes (so the ones that are supposed to be non DE) : I went through the RUVg guidlines and just not sure that I am now normalizing against those least 100 genes using this code below
empirica_100 <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:4900]))]
How the selection of these genes using the least DE genes is comparable to just using a set of house keeping genes in terms of the effect on adjusting for any unknown variation? Actually I looked up into those 100 genes with the least DE and could not spot any of the known stable house keeping genes (e.g, Beta actin or RPL13).
- Now if I normalized this counts using RUVg, can I used it for DE analyses ? I know that the vignette is saying that it is only for exploration, but now how the normalization would affects the DE ?. I am using edgeR. In my understanding the normalization is producing counts to be used for DE analyses.
- Do I always need to make quantile normalization before performing the RUVg normalization ? The Vignette made that. Is it for clarification or do I need to do it before doing the RUVg normalization ?
Thanks in advance
I was working with these a few weeks ago.
2) I would recommend using a set of general housekeeping genes instead of performing DEA and extract low differentially expressed genes as controls, as this is extremely biased because you want to remove unwanted differences but to do so you are considering genes that maybe are not differentially expressed precisely because of the unwanted variability. So, you could use this list: #Download of house keeping genes
HK_genes <- read.table(url("https://m.tau.ac.il/~elieis/HKG/HK_genes.txt"))
3) Most of DEA methods use counts, so be sure about the inputs each method can use by checking the help or their vignette.
Also I would do iterations with a for loop to check which k is making a better separation or clusterization by condition, if it is not creating a good clusterization it is basically useless. But be aware that the bigger the k, the harsher is the transformation and the more different will be the result compared to the original data. It is quite an agressive method which can even transform all same condition points to a single point in the PCA (you can check this by applying an big enough k)