Dear all,
I'm using ht-seq raw counts RNA-seq data from TCGA. For Normalizing the data first I used voom() transformation and converted them to log-CPM values.
I have used this voom function from Lima package to normalise data. t_index is the samples. The below function I got with some google search.
vm <- function(x){ cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1, 0)) d <- model.matrix(~1+cond) x <- t(apply(x,1,as.numeric)) ex <- voom(x,d,plot=F) return(ex$E) }
I have some couple of questions regarding the above function. Need some explanation from any one of you please.
Why I see negative values after normalisation? And what type of normalisation is applied after voom? Is normalisation across samples?
Any help is appreciated. Thank you.
sorry that is not what I want. Anyways here. I see the answer to my question [Voom on TCGA data shifts count distributions towards negative values ] But not aware about which normalisation method is applied with voom? Is it quantile normalisation? And is it applied across samples?
This is also documented in the help page. See the appropriately-named
normalize.method
argument. The default is "none", which performs no additional normalization after the logCPM transformation.ok. I see. I'm little confused with this post [https://www.biostars.org/p/153013/#337075] The voom function I mentioned in the my question was taken from this link and applied on raw counts. They say that voom transformation and to normalise data the above mentioned function is used. But not aware which method they have used for normalisation.
sorry, I'm trying to understand it. But not sure much about it.
The use of
voom
in the context of that question is extraneous. The function is callingvoom
and then throwing away the weights that it calculated, keeping only the logCPM values. If all you want is the logCPM values, then use thecpm
function as I've described in my answer.Looking at the other code in that answer, I see many other mistakes. For example, they do not use
calcNormFactors
or any other method to normalize for composition bias, and I would expect significant composition bias to be present an a tumor-vs-normal comparison. I don't think that code is a very good example to base your work on.Edit: Someone in the comments for that post mentions that the input data may already be quantile-normalized, but it's not clear. In any case, the code doesn't mention anything about that, so I doubt the author is aware if the input data has already been normalized.
Thanks a lot for the information. So, basically raw htseq counts with above mentioned voom function in the question gives logCPM values in negative and positive. Is logCPM values are normalized expression data?
logCPM values are what they sound like: the logarithm (in base 2) of the counts for that gene divided by the total millions of counts. This normalizes for differences in sequencing depth between samples and nothing else. If you don't provide any further information, that's exactly what you get. If you use
calcNormFactors
, then the logCPM values will be additionally normalized for composition bias using the method that you chose when running that function. Either way, if you're not going to use the weights in your downstream analysis, then there's no reason to usevoom
.Thank you for the explanation !!