Question

Confused about using RUVg in getting negative control genes using RUVg

0

Entering edit mode

Mohamed ▴ 30

@aa1ae679

Last seen 2.2 years ago

United Kingdom

I m running an RNA seq. analyses comparing stimulated vs unstimulated samples and have a read count of let say 5000 genes. I would like to use RUVg to select a groups of negative control genes based on the ones that are least DE among the two group (following the RUV vignette) below is my code for checking how PCA looks like for the least DE 100 genes (data set with filtered reads is named as set)

top <- topTags(tr, n=nrow(set)) $table
empirica_100 <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:4900]))]
set2_100<- RUVg(set, empirica_100, k= 1)
plotPCA(set2_100, col=col.cell, cex=1)

I actually do not see much of difference when making the PCA plot before and after normalization

I have some naïve but confusing questions for me as a starter in this issue

Am I right in my understanding that using this code below I am normalizing the whole matrix with the least 100 DE genes (so the ones that are supposed to be non DE) : I went through the RUVg guidlines and just not sure that I am now normalizing against those least 100 genes using this code below

empirica_100 <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:4900]))]
How the selection of these genes using the least DE genes is comparable to just using a set of house keeping genes in terms of the effect on adjusting for any unknown variation? Actually I looked up into those 100 genes with the least DE and could not spot any of the known stable house keeping genes (e.g, Beta actin or RPL13).
Now if I normalized this counts using RUVg, can I used it for DE analyses ? I know that the vignette is saying that it is only for exploration, but now how the normalization would affects the DE ?. I am using edgeR. In my understanding the normalization is producing counts to be used for DE analyses.
Do I always need to make quantile normalization before performing the RUVg normalization ? The Vignette made that. Is it for clarification or do I need to do it before doing the RUVg normalization ?

Thanks in advance

RUVnormalize R edgeR RUVSeq RUVnormalizeData • 3.7k views

ADD COMMENT • link 3.4 years ago Mohamed ▴ 30

0

Entering edit mode

I was working with these a few weeks ago.

2) I would recommend using a set of general housekeeping genes instead of performing DEA and extract low differentially expressed genes as controls, as this is extremely biased because you want to remove unwanted differences but to do so you are considering genes that maybe are not differentially expressed precisely because of the unwanted variability. So, you could use this list: #Download of house keeping genes

HK_genes <- read.table(url("https://m.tau.ac.il/~elieis/HKG/HK_genes.txt"))

3) Most of DEA methods use counts, so be sure about the inputs each method can use by checking the help or their vignette.

Also I would do iterations with a for loop to check which k is making a better separation or clusterization by condition, if it is not creating a good clusterization it is basically useless. But be aware that the bigger the k, the harsher is the transformation and the more different will be the result compared to the original data. It is quite an agressive method which can even transform all same condition points to a single point in the PCA (you can check this by applying an big enough k)

ADD REPLY • link 3.4 years ago Pau • 0

score 0 · Answer 1 · 2022-06-24

Thanks for your answer. Can you comment on my first question: Am I right that with this selection, I am normalizing against the least 100 genes ? Also: The list you provided is such a big one 3804 genes. Shall I use all of them or search for which of them is there in my genes ? Also are these genes condition-dependent ? Usually peoples use much fewer gene numbers in their conditions/tissue ... something that is not general?

Thanks

score 0 · Answer 2 · 2022-06-24

1) If in a DE analysis where you test for differential expression and then you extract high pvalues it can simply be that these genes are lowly expressed or with lots of noise rather than reliably be "non-DE". Much better would be to use DESeq2 and use the lessAbs test as this enriches for genes with proper counts and low variation so basically the best data-driven "housekeepers". So if you have two groups then do the pairwise comparison with that test and if you have more than two then do all possible comparisons and intersect the significantly non-differential genes to get a good list of controls. I usually set the lfc parameter in DESeq2 to somewhat log2(1.5) and use a FDR of 0.1. I think there is no "official" analog to this in edgeR/limma but they have some suggestions on how to enrich for non-DE genes if you browse the support site a bit.

2) See 1), I would always do this data-driven rather than using and prefined "housekeepers". Maybe this is my inner paranoia but just because some genes are stable in some cells does not mean that is true for your experimental system.

3) If you want to use the RUVg output then use the factors of unwanted variation it returns as part of the experimental design you pass to edgeR. Here is a case study from the DESeq2 developer using RUVseq with DESeq2, but you can easily abstract this to edgeR: https://github.com/mikelove/preNivolumabOnNivolumab/blob/main/preNivolumabOnNivolumab.knit.md

4) No, not at all. QN is something commonly done for microarrays. See the link in 3), either use what he does in there or just pass the output of edgeR::cpm(y, log=TRUE) to RUVg setting isLog to TRUE.

Does that make sense to you?